The Internet provides access to a wide variety of resources, for example, video files, image files, audio files, or Web pages, including content for particular subjects, book articles, or news articles. A search engine can select one or more resources in response to receiving a search query. A search query is data that a user submits to (or causes, knowingly or unknowingly, to be submitted to or obtained by) a search engine to conduct a search to satisfy the user's informational needs. Search queries almost always include data in the form of text—e.g., one or more search query terms—as well as other information. The search engine selects and scores resources based on their relevance to the search query and on their importance relative to other resources to provide search results that link to the selected resources. The search results are typically ranked according to the scores and presented according to this order.
Today's large data centers manage collections of data comprising billions of data items. In such large collections, searching for particular items that meet conditions of a given search query is a task that consumes a significant amount of computing resources. It also takes a noticeable amount of time, even on the most powerful multiprocessor computer systems. In many applications, search query response time is critical, either because of specific technical requirements, or because of high expectation from human users. Various conventional methods are used to reduce search query execution time.
Typically, in building a search-efficient data collection management system, data items are indexed according to some or all of the possible search terms that may be contained in search queries. An “inverted index” of the data collection is created (and maintained and updated) by the system for use in the execution of search queries. An inverted index comprises a number of “posting lists”. Each posting list corresponds to a search term and contains references to the data items that include that search term (or otherwise satisfy some other condition that is expressed by the search term). For example, if data items are text documents, as is often the case for Internet search engines, then search terms are individual words (and/or some of their most often used combinations), and the inverted indexes have one posting list for every word that has been encountered in at least one of the documents. In another example, the data collection is a database comprising one or more very long tables. The data items are individual records (i.e., the lines in a table) having a number of attributes represented by some values in the appropriate columns of the table. The search terms are specific attribute values, or other conditions or attributes. The posting list for a search term is a list of references (indexes, ordinal numbers) to records that satisfy the search term.
To speed up execution of search queries, the inverted index is typically stored in a fast access memory device (e.g., RAM) of one or more computer systems, while the data items themselves are stored on larger but slower storage media (e.g., on magnetic or optical disks or other similar large capacity devices). In this way, the processing of a search query will involve searching through one or more posting lists of the inverted index in the fast access memory device rather than through the data items themselves (in the slower access storage device). This generally allows search queries to be performed at a much higher speed.
Given the volume of information available via the Internet, and the lack of consistency between various resources, it is not always easy for a user to formulate search query terms that will quickly and easily provide them with the information that they seek. Further, in many cases a resource of interest to the user is not associated directly with the search terms in the search query or with search query suggestions. A highly relevant page may not be included in the posting lists for the search query and thus cannot be found by means of the usual inverted index. For example, a highly relevant document may be a web resource containing only a picture-diagram which does not include any textual signs of reference to the search query (such as URL, name, etc.).
There is a need to improve upon existing search engine technology in order to provide more complete search results and more satisfactory search experience to users.
U.S. Patent Application Publication No. 2007/0038608 describes a computer search system for improved web page ranking and presentation based on additional information related to the contents of the retrieved documents. The additional information is directly related to the contents of the retrieved web pages but does not appear on the retrieved web pages and/or in the link structure. The new search system searches a conventional web page collection together with databases containing publications and semantic web data, which provides the aforesaid additional information. The concept related information is then used in determining the final page rank, which results in more relevant and objective page ranking.
U.S. Patent Application Publication No. 2013/0132381 describes a system for tagging entities with descriptive phrases. A plurality of description phrases associated with a first domain may be determined, based on an analysis of a first plurality of documents to determine co-occurrences of the description phrases with one or more name labels associated with the first domain. An entity associated with the first domain may be obtained. An analysis of a second plurality of documents may be initiated to identify co-occurrences of mentions of the obtained entity and one or more of the plurality of description phrases, and contexts associated with each of the co-occurrences of the mentions and description phrases, in each one of the second plurality of documents. A description tag association between the obtained entity and one of the description phrases may be determined, based on analysis of the identified contexts.
U.S. Pat. No. 8,095,538 teaches an annotation index system and method. There is described a method of encoding on a computer system for information retrieval in an inverted list structure of annotation, including collecting a group of documents and storing them in a digital format, determining a group of annotations referencing the group of documents, and forming a snippet index by grouping the group of annotations by unique annotation identifier. The method also includes forming a snippet dictionary which, for each unique annotation identifier, indexes a corresponding position in the snippet index for the group of annotations having that unique annotation identifier.