Traditional search techniques rely heavily upon indexing. Generally, it is impractical and inefficient to scan each document in a large corpus of documents. Indexing can greatly reduce the computing power required for search operations by abstracting data in one or more data structures external to the corpus. An index is a representation of data which is organized to allow faster data retrieval operations, often at the expense of greater storage requirements and/or increased write latency.
One common type of search index is the inverted index. In order to construct an inverted index, each document in the corpus is analyzed and a dictionary is constructed based on the analysis. The dictionary is an index of all unique searchable terms found in the corpus documents. Each term in the dictionary is associated with a list of document identifiers, where each document identifier references a document containing at least one instance of the term. Thus, it is possible to perform a textual search of the dictionary for a specified term and to identify each document containing the term. Although the usage of indices can greatly improve the read latency of a search engine, extremely large data sets can present a challenge for search engine designers. Most large indexed data sets are implemented as distributed systems involving complicated synchronization algorithms to maintain data consistency.
Complicating the matter even further, real-time messaging platforms can have entirely different requirements from those of traditional search architectures. For example, a real-time platform may place a greater emphasis on timeliness as a search criterion. In fact, the corpus of searchable data may be entirely different based on the method chosen for ingesting and assimilating new documents. A key requirement of real-time search is the ability to ingest content rapidly and index it for searchability, while concurrently supporting low-latency, high-throughput query evaluation. As a result, the traditional search engine architecture may be inadequate for indexing and searching real-time information.