The disclosure relates to the fields of search engines and content indexing and, in particular, to methods, devices, and systems for high-throughput indexing and ad hoc query activation.
With increased user activity with networked applications (e.g., websites or services), more complex systems were built and, accordingly, an increasing amount of data has been, and continues to be, generated. For example, web-based mail applications generate vast amounts of contents as millions of users create messages, send attachments, and perform other operations. Similarly, other user applications can result in terabytes (or more) of data being stored and associated with users.
In parallel with this trend, search engines have become more and more advanced and necessary as the amount of data increases. Generally, search engines are focused on crawling the Internet and creating an index of content for future keyword searches. In time, this methodology was applied to user-facing applications. For example, users may now search electronic mail using keywords or search social networks using keywords.
Despite advances in search engines, the addition of search engine technology to user-focused platforms suffers from numerous technical problems. First, existing search indexing techniques are unable to cope efficiently with historical data and out-of-order data. That is, content such as mail is indexed at one time, as it is received. Future content is simply added to the existing indexed data. While this approach may work for a mail provider—since the mail provider has no concept of history (as used herein)—it surfaces problems in extending the capabilities of the system. Specifically, when mail providers attempt to add new technical features, the existing mail must be completely re-indexed (or the current index relied upon) until the features are available for public use. Thus, new features cannot be deployed quickly.
Additionally, current systems are unable to reliably present content while such content is undergoing processing. Specifically, due to the out-of-order nature of event processing, current systems often display “stale” data (e.g., documents that have been deleted). This arises due to the fact that a race condition may occur wherein current systems first process a document but only later process a deletion event (or similar event). Thus, current systems present inconsistent views of a data source to a user. Moreover, current systems often present duplicated data since data from multiple sources is not reconciled in a consistent manner with events associated with documents from those sources.