Advancement in information technology is causing more and more information to be generated. Many ecommerce websites, for example, are seeing rapid increases in the volume of data generated by their users. To assistant people in finding their required information, search services are provided to enable a full-text search of data sources. The search can be based on keywords or descriptive information contained in a search query that is provided by a user. The search result will be returned to the user.
To implement the search service, a process of collecting, parsing, and storing data from data sources is commonly used to facilitate fast and accurate information retrieval; and such process is called search engine indexing. Since keywords provided by users for searching are usually in text format, the indexes for keyword searches are usually in text format as well.
A document to be indexed may be a webpage snapshot provided by a search engine, or a portion of a webpage snapshot. A webpage snapshot may have various formats. Some search engines support multiple document formats. The index may include various information from a data source; for example, if part of the content in the data source is textual, the index may include such text; if the data source file is in image, audio, or video formats, the index may have a field indicating the source of such file, for example, a field indicating the 8+file's web address.
An index server is often used to manage indices. When a user initiates a search, a query provided by the user is received by a search server. The search server determines which index has indexed the data required by the user and then looks up the query term(s) in the corresponding index and provides the user with the search result retrieved from the index.
To index massive amounts of data, website operators often use a number of index servers. The index may comprise a series of index data items and each data item is referred to as a document. Usually each document corresponds to a record in the source data. The index servers typically convert the records extracted from the source data into the indices. A major challenge in managing the indices is the management of parallel computing processes. There are many opportunities for race conditions and coherency faults. For example, multiple index servers may need to write to the same index file concurrently. In conventional implementation, to avoid incoherent faults, when one of the multiple index servers is writing data to the index file, other index servers are in an idle state and they may write data to the index file only after the first index server finishes writing data. Such behavior of writing to the same shared resource (e.g., the index file) by multiple index servers is referred to as writing sharing conflict. Thus, the conventional method may result in low performance and sharing conflicts during the indexing process.
Furthermore, during the query searching process, the sizes of the indices can affect the search efficiency. If the indices are too large, it may take a long time to look up the data; and if the indices are too small, accessing many indices may be required.
Therefore, a more efficient search engine indexing and searching method or system is need.