Index build technologies are generally optimized for even distribution of data across index shards. While conventional distributed indices are powerful, they also disaggregate the semantics of the data itself under an assumption that the data is homogeneous. Under some circumstances, such as in a general text search problem, the assumption that the data is homogeneous is good. In other circumstances, such as when specific data collections are needed, the assumption is not useful. Providing a search engine for a specific data collection may include either creating collection-specific indices or developing complex aggregation and filtering data joiners along with complex queries, while relying on intrinsically-generated, structured metadata.
Search engines or databases that deal with specific data collections often use a searchable index that includes a compilation of documents or other data for the purposes of the particular search engine. The size and content of the search index depends on the purposes of the search engine. For example, a search engine for patents may search repositories of all issued patents and published patent applications for a particular patent system—for example, for the United States Patent and Trademark Office or for the patent office of another country, or both. In order to create the patent search engine, a searchable index including all issued patents and published applications for the particular system can be built. Building large searchable indexes is a time-consuming process and may require a large amount of computer processing power.