1. Field of the Invention
The present invention is directed toward the field of information search and retrieval systems, and more particularly to processing a free format query to determine conjunctives between terms of the free format search query.
2. Art Background
An information retrieval system attempts to match user queries (i.e., the users statement of information needs) to locate information available to the system. In general, the effectiveness of information retrieval systems may be evaluated in terms of many different criteria including execution efficiency, storage efficiency, retrieval effectiveness, etc. The most common measures used are xe2x80x9crecallxe2x80x9d and xe2x80x9cprecision.xe2x80x9d Recall is defined as the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query available in the repository of information. Precision is defined as the ratio of the number of relevant documents retrieved over the total number of documents retrieved. Both recall and precision are measured with values ranging between zero and one. An ideal information retrieval system has both recall and precision values equal to one.
A user, in order to locate information in the system, specifies the type of information sought in the form of a query. The query consists of one or more terms that the user believes best expresses the information sought. For example, if a user seeks information regarding xe2x80x9cthe outbreak of Hepatitis in North Americaxe2x80x9d, the user may formulate the query xe2x80x9cHepatitis and North America.xe2x80x9d Typically, a formal query language requires the user to express the idea regarding the information sought within rigid parameters. A formal query language, such as the standard query language (SQL), sets forth parameters and format requirements for the query. For example, typically, information retrieval systems use Boolean operators to specify the connections between two or more words in the query. If the user inputs the query xe2x80x9cHepatitis and North Americaxe2x80x9d and the word xe2x80x9candxe2x80x9d is interpreted as a Boolean AND operation, then the information retrieval system retrieves all documents that contain subject matter on both Hepatitis and North America. Using a formal query language permits a one-to-one correspondence between the user""s expression of the query and the interpretation of the query by the information retrieval system. Although formal query languages reduce the ambiguity between the user""s expression and the interpretation by the system, they are rigid and require the user to learn the semantics of the query language. Accordingly, it is desirable to permit a user of an information retrieval system to submit a query in any form.
Generally, a query that expresses an idea to retrieve information that is not in a specific query language format is known as a free format query. Free format queries are also known as natural language queries. Free format queries contain an expression in a form of general human discourse. The user intends the query to be interpreted through a contextual semantic interpretation in the same way words are interpreted in general human discourse. For example, in a free format query, the user may formulate the query xe2x80x9cHepatitis in North Americaxe2x80x9d to locate documents including subject matter on both Hepatitis and North America. In normal human discourse, although the word xe2x80x9candxe2x80x9d was not used, it is apparent that the user seeks information on subject matter containing both Hepatitis AND North America. Even when xe2x80x9cand conjunctionsxe2x80x9d and xe2x80x9cor conjunctionsxe2x80x9d are used, humans do not use them in the same way as programming languages. For example, the user may express, as a free format query, the expression xe2x80x9cred and green balls.xe2x80x9d This query example introduces an ambiguity as to whether the user seeks information regarding xe2x80x9cred balls and green ballsxe2x80x9d or whether the user seeks information on balls that are both red and green. As is explained fully below, the present invention provides a link to bridge the gap between a contextual semantic interpretation of a free format query and a computer programming language interpretation of a formal query language.
A search and retrieval system pre-processes an input query to map a contextual semantic interpretation, expressed by the user of the input query, to a boolean logic interpretation for processing in the search and retrieval system. The search and retrieval system includes a knowledge base that comprises a plurality of categories. Subsets of the categories are designated to one of a plurality of groups. In one embodiment, the groups are based on dimensional categories, such that each dimensional category represents a discrete and independent concept from other dimensional categories in the knowledge base.
To pre-process the query, the search and retrieval system receives an input query comprising a plurality of terns, and processes the terms of the input query to identify value terms that comprise a content carrying capacity. The knowledge base is referenced to identify a group for each value term. A processed input query is generated by inserting an AND logical connector between two value terms if the two respective value terms are in different groups and by inserting an OR logical connector between two value terms if the two respective value terms are in the same group.
In one embodiment, the search and retrieval system includes a lexicon. The lexicon stores a plurality of terms and phrases, including information about the terms. During input query pre-processing, the lexicon is referenced to identify query terms as one of the phrases stored. If found, the phrase is processed as a single value term. In another embodiment, the user input query is processed to replace, where appropriate, prepositions and conjunctions. For this embodiment, the lexicon also identifies terms as AND preposition terms, AND conjunction terms, OR conjunction terms, and NOT conjunction terms. During query pre-processing, the lexicon is referenced to identify an input query term as an AND preposition term, an AND conjunction term, or an OR conjunction term. An AND logical boolean connector is generated in lieu of the input query term if an input query term comprises an AND preposition term, and an OR logical boolean connector is generated in lieu of the input query term if an input query term comprises an OR conjunction term. An AND logical boolean connector is generated in lieu of the input query term if an input query term comprises an AND conjunction term, and a NOT logical boolean connector is generated in lieu of the input query term if an input query term comprises a NOT conjunction term.