Most search engines typically perform searching of Web pages during their operation from a browser running on a client device. A search engine receives a search term entered by a user and retrieves a search result list of Web pages associated with the search term. The search engine displays the search results as a series of subsets of a search list based on certain criteria. General criteria that is used during a search operation is whether the search term appears fully or partly on a given webpage, the number of times the search string appears in the search result, alphabetical order, etc. Further, the user can decide to open a link by clicking on the mouse button to open and browse. Some of the user interactions with the search results and/or user information may be monitored and collected by the search engine to provide better searches subsequently.
Typically, the content provided by a search engine online may also be analyzed by an analytics system offline. FIG. 1 is block diagram illustrating a conventional network configuration for online searching and offline data analytics. Referring to FIG. 1, client devices 101-102 are communicatively coupled to Web server 104 and analytics server 105 over network 103 (e.g., Internet). Web server 104 includes search engine 130 to provide online content searching to clients 101 (e.g., browsing users) in this example based on content 116 stored in content storage system or server 110. The online searching system (e.g., Web server 104 and storage system 110) requires a low latency capability, because the content has to be searched and returned in a very short period of time in response to a search query. Clients 102 (e.g., analytics users) accesses analytics system or server 105 to perform offline data analysis via analytics engine 140 on content 118 stored on analytics system or server 112. The offline analytics system (e.g., analytics server 105 and storage system 112) requires a high throughput capability, because a large amount of data will be accessed. Content 116 and content 118 may be received from data collection system or data sources 150 via ETL (extract, transform, and load) pipelines 121-122, respectively.
In this configuration, the same content has to be loaded in separate storage systems 110 and 112, which may be unnecessarily redundant and require more storage space, as well as difficult to synchronize or manage the data. In addition, data collection system 150 or other data sources have to maintain at least two separate ETL pipelines to feed the data to both storage systems 110 and 112, which requires more network bandwidth and other processing resources.