The present invention relates to computer search mechanisms, and more particularly to computer searching mechanisms that search indexes.
Search engines are remote access programs that enable users to search for documents from a body of information (i.e., a database of documents or the Internet). Typically, a search engine searches a database for specific key words and retrieves a list of documents that contain the key words. Search engines can use algorithms to create indexes such that, ideally, only meaningful results are returned for each query. The indexes are arrangements or outlines of topics listed in a rational order.
There are multiple query styles commonly used by search engines. For example, topic-based queries (i.e., xe2x80x9cMexicoxe2x80x9d), topic/subtopic queries (i.e., xe2x80x9cMexico Cancunxe2x80x9d), and Boolean queries (i.e., xe2x80x9cMexico OR Cancunxe2x80x9d) are commonly used query styles. Savvy users can build their own Boolean queries and can also use quotation marks to build literal strings with spaces. Search engines ineffectively interpret these query styles because they often result in the retrieval of documents that are too broad for the user""s purpose or irrelevant to the user. In addition, the topic/subtopic query can be unpredictable because some search engines will do an xe2x80x9cANDxe2x80x9d search, and some search engines will do an xe2x80x9cORxe2x80x9d search.
There is a need to better search and retrieve relevant documents using the above query styles. In addition, there is a need to better search and retrieve relevant documents using more complicated query styles, such as sentence-based queries (i.e., xe2x80x9cI need info on the history of Mexicoxe2x80x9d), question queries (i.e., xe2x80x9cWho is the president of Mexico?xe2x80x9d), and essay question queries (i.e., xe2x80x9cWhat are the significant events that led to the formation of Mexico?xe2x80x9d). Search engines do not effectively understand these query styles because search engines limit searches by the words literally appearing in the query (i.e., xe2x80x9cwhat, the, thatxe2x80x9d). Natural language processors (NLPs) have been helpful because they can help identify key words in these types of queries. However, NLPs are not able to prioritize the key words. In addition, if all important key words cannot be matched, there is no thoughtful mechanism for searching a reduced or simpler form of the query.
There is an additional need in the prior art to more effectively search for results for content queries. Requests for a specific type of content (i.e., xe2x80x9cI want pictures of Mexicoxe2x80x9d) require interpreting the query in two ways. First, the topic (i.e., xe2x80x9cMexicoxe2x80x9d) must be identified. Second, the type of content desired (i.e., xe2x80x9cpicturesxe2x80x9d) must be identified. Content types can include pictures, maps, news magazine articles, and sounds, and can be described in any number of ways by users.
Once a query is understood, there is a need to more effectively search a body of information, often in the form of a database. In the prior art, the body of information may be a full text database consisting of all the target content or a key word database associated with the key words of the target content. Results for searching a full text database and a key word database often produce results that are too numerous, irrelevant, and disorderly to be useful without extensive post-search processing. In addition, searching a key word database is limited by the number of key words that exist and their unstructured nature. To find a match, queries must match a key word literally or match the key words found using NLPs. Users must anticipate the limited set of key words under which the content is listed.
In order to help users search databases, some search engines have allowed users to navigate an outline or hierarchical index to find the specific information they want. Although this option is useful, the outlines and hierarchical indexes have been complicated and have defied current user expectations that they should be able to ask a question and get relevant answers.
In light of the above limitations, there is a need for a search engine that better understands multiple query styles. Once the query is understood, there is a need for a search engine that more effectively searches a body of information. There is also a need for a search engine that presents the matched information in a way that is easily understood by the user, and ranks and sorts the matches according to their relevancy.
The present invention can solve the above problems by providing a search engine to better match user requests for information. The search engine allows users to search and retrieve information from a body of information, such as a database. It can lead users with general or specific queries to general or specific content in the body of information. Users can be directed to general information, such as the start of a long article, or to specific content within that article. An article outline and related articles can also be navigated. An effective process can search multiple query styles and can find relevant matches. It can analyze the user""s query to determine its most important and less-important elements. Users can form their queries in an ad-hoc, free-form manner and still get relevant results. Queries can also be processed in a way that allows for quick results and an efficient use of server resources.
This novel treatment of hierarchical index data can be combined with a NLP to provide more accurate and detailed access to indexed content. For example, the body of information to be searched can be compiled in such a way that searches can be limited to relevant information. User queries can be analyzed in a way that determines the most-important and least-important elements by prioritized clustered tokens. Tokens can consist of a word or multiple words recognized as one entity. The NLP can recognize the important tokens in the query. Clustered tokens can be created by adding a family of related or alternative words and phrases, called word clusters, similar to the token. The clustered tokens can be summarized and combined in a single content catalog of indexes, called a lookup table. Prioritized clustered tokens can be created by prioritizing the clustered tokens according to priority rules that utilize the NLP to identify the importance of key words.
Where matches for all important words of the query cannot be found, less important prioritized clustered tokens can be cut from the query, and the query search can then be repeated using the more important prioritized clustered tokens. The matched information can be ranked and sorted according to relevancy by taking advantage of the knowledge of which prioritized clustered tokens are the most important. A tight feedback loop can enable designers to understand what users want and monitor on-going changes in user information needs.
The present invention can include three main segments: the Index Databases, the Run Time Search Component Object Module (xe2x80x9cSearch COMxe2x80x9d), and the Active Server Page User Interface (xe2x80x9cASP UIxe2x80x9d). The Index Databases can include a searchable database containing indexes from a plurality of information sources. The Search COM can be a search component that searches for search terms in the queries. The ASP UI can receive search terms from a user of the computer system.
The Index Databases can include a ContentBuild Database, which collects the various indexes and puts them in a searchable database. There can be numerous indexes or fields in the ContentBuild Database. The ContentBuild Database can include a Full Text Index that is used for performing full text searches. The ContentBuild Database can also include a WordWheel, which is a lookup table. The lookup table consist of rows and columns of data. The lookup table is examined either horizontally or vertically and the data that is sought is retrieved.
The Search COM can include the ResultsList, the Exact Match Search, the NLP, and the Full Text Search. The ResultsList is a results database that can hold all the matches or results from the search. The Exact Match Search can search for an exact match to the query. The NLP can be used for syntactic and semantic analysis of English sentences. The Full Text Search can be responsible for doing a search given a query and returning a weighted set of results.
The user can enter the query in a browser that sends the request to the Web server where the ASP UI retrieves it. The query can then be processed into prioritized clustered tokens using the NLP and token priority rules. The ResultsList can then be emptied. The Exact Match Search can next be performed in the WordWheel using the original query (not the tokens) to determine if the original query matched exactly any entries in the WordWheel. The Full Text Search can next be performed in the ContentBuild Database using the prioritized clustered tokens. The matches can then be sorted on offsets (a scoring criterion that recognizes the explicit hierarchy of index entries) and the matches can be moved to the ResultsList. The matches can be displayed in the ResultsList in a prioritized order on the ASP UI.