Computer users use search engines to retrieve information that meet specific criteria, from information stored on a computer system. For example, computer users may use search engines to search for information on the World Wide Web, on a corporate network, or on a personal computer. Typically, a user will provide a search term, which is one or more words, or a phrase, to the search engine and request that the search engine conduct a search for documents containing the search term. Depending on the search term provided to the search engine, the information returned by the search engine could be voluminous. Consequently, most search engines provide, to the user, relevance rankings of all the information returned to the user. The relevance rankings aid the user in determining which information the user should view to get the information the user needs.
Current searching technologies are represented by monolithic general-purpose search services that are based on broad-brush assumptions, which are typically derived from mass-market statistics about the information needs of individuals. Also, the current technologies attempt to personalize searching by collecting and maintaining personal data about users in central locations. Note that this personal data is subject to unauthorized use. The current technology provides search results based upon the personal data and the mass-market statistics. More specifically, the current technology relies upon linguistics and semantics to attempt to match search terms to documents using algorithms by trying to construe meaning from context.
Current technology has incomplete indexing of the data or documents that is to be searched. General-purpose search engines typically use the same basic approach to building an index entry for every document they include in their search universe. However, different engines use different assumptions and compromises in building their indexes. The assumptions determine what is left out of the index in order to keep the size of the index small. Typical search engines include a list of stop words or words that are very common to the documents being indexed. Stop words are words that are not indexed. Typical stop words include most pronouns, articles, and prepositions, and high frequency words. For example, in a database of patent documents, the word ‘patent’ may be a stop word.
The use of stop words is problematic for two reasons. The first reason is that stop words may have more than one meaning, with one meaning being very common, and the other meaning may be a suitable search term. In keeping with the above patent example, a document discussing ‘patent leather shoes’ would not have the word patent indexed. Thus, a user searching for such a document would not be readily able to find it. The second problem is that functional words, e.g. the articles, the pronouns, prepositions, etc., form the structure of language. By using these functional words as stop words, search engines cannot apply any kind of grammatical analysis to the index. Current search engines may try to parse phrases to maintain some context by defining a tree that links nouns and verbs together. However, current linguistics programs that use such natural language processing (NLP) parsing are only about 65% accurate.
Current indexing techniques also include indexing a metadata tag associated with a document rather than the document itself. The metadata tag typically comprises information such as document type, title, author, date, metadata, XML objects, other specific context information, etc. Consequently, forming an index from the metadata tag rather than the document greatly limits the accuracy of searches.
Another current indexing technique is to build a taxonomy of the database to be searched. A taxonomy is a hierarchy or decomposition of the documents to relate them to each other. In other words, a taxonomy parses elements of a group into subgroups that are mutually exclusive, unambiguous, and as a whole, include all possibilities. For example, the accepted biological taxonomy of living things is kingdom, phylum, class, order, family, genus, species. One problem with taxonomies, especially in technology, is that it typically requires between 6 months and 18 months to complete for a typical database. And consequently, the taxonomy is obsolete or out-of-date when completed. Also, the hierarchy of the taxonomy acts to limit the searching of the database by requiring searches to conform to the taxonomy, and thus this will reduce the accuracy of a search.
When a user enters a list of words to initiate a search, these search engines attempt to achieve the “best match” between the search term and the index of the documents. The results are displayed to the user in terms of a ranked list. Different search engines use different techniques to rank the results. One common manner is to rank the results based on the popularity of each hit in the result list. Sites or documents that are used more often would rank higher than those used less often. Another manner is to rank the results based on cites or links, whereby a document that is linked or cited more in other documents would be ranked higher than a document with less links or cites. A further manner is ranking by opinion, where documents or sites that are subjectively rated as influential would be ranked higher than those that are not. A still further manner is by payment, where sites that have paid fees to the search engine are ranked higher than those that have not.