The present invention relates to the field of information retrieval based on user input and more particularly to a system and method of retrieving, evaluating, and ranking data from a data set based upon a natural language query from the user.
Search engines are known in the art and are used for retrieving information from databases based on user supplied input. However, for large information systems, current search tools fail to provide adequate solutions to the more complex problems which users often face. For example, many known search engines restrict users to providing key search terms which may be connected by logical connectors such as xe2x80x9cand,xe2x80x9d0 xe2x80x9cor,xe2x80x9d and xe2x80x9cnot.xe2x80x9d This is often inadequate for complex user problems which are best expressed in a natural language domain. Using only keywords or boolean operations often results in a failure of the search engine to recognize the proper context of the search. This may lead to the retrieval of a large amount of information from the database that is often not very closely related to the user problem. Because known search engines do not sufficiently process the complexity of user input, it is often the case that it is very difficult with current online help facilities to obtain relevant helpful documentation for a given complex problem.
Another problem with current technologies used in document retrieval is that the user may find certain documents retrieved in a search more valuable than others, however, the user is not able to explicitly express a criteria of importance. It is often difficult in dealing with complex contexts to specify a relevance criteria exactly and explicitly.
It is an object of the present invention to provide a search machine employing a generic or non-context specific approach for a tool that is capable of searching for information contained in documents of a database on the basis of a problem specification entirely stated in natural language.
It is a further object of the present invention to provide a search machine that is not restricted to a specific environment, such as database retrieval, but may also be used in various contexts, such as, for example, context-sensitive online help in complex working and information environments, retrieval of relevant information in tutor and advisory systems, decision support for the organization of information databases, and information agents which search to build up, organize and maintain new information databases.
A further object of the present invention is to provide a search machine that can locate the most relevant parts of text within a document. Thus, the search machine may present the most relevant part of the document to the user, or provide necessary information to indicate to the user which part of the document is the most relevant.
It is a further object of the present invention to provide a search machine that attaches significance values to words or word stems of a database and a query and uses the significance values to compare the query to the database.
Yet another object of the present invention is to provide a search machine that uses data relating to the frequency of occurrence of words or word stems in a document to determine a documents relevance to a query.
In the present invention, a search machine is provided that can accept a query that may be stated in the form of natural language. The search machine can reduce the natural language query into a vector of word stems and also can reduce the documents to be searched into vectors of word stems, where a word stem is a word or part of a word from which various forms of a word are derived. The search machine may analyze the vectors of word stems determining such factors as the frequency with which word stems occur in the query and database documents, the significance of the word stems appearing in the documents, and other comparison information between the query vector and the database document vectors, in order to determine the suitability of a database document to serve as a solution to the query.
The search machine according to the present invention for retrieving information from a database based on a query of a user may include a lexicon generator for deriving a lexicon database of word stems from the documents of the database and from the query. It may further include an evaluation component that includes a document vectorizer for creating representation vectors for the documents of the database and a query representation vector for the query using the lexicon database. The document representation vectors contain data on the word stems located in the database and the query representation vector contains data on the word stems located in the query. The evaluation component may further include a vector rule base, and a vector evaluator. The vector evaluator may derive a relevance value for a document representation vector relative to the query representation vector according to the vector rule base and output information from the database that relates to the query. The search machine may also include a fine-tuner for modifying the vector rule base. The user can provide external feedback about the information retrieved from the database, and the fine tuner may use that feedback to modify the vector rule base.
The present invention is also directed to a method of retrieving documents from a database corresponding to a query that includes the steps of (i) deriving a lexical database of word stems from the documents of the database and the query, (ii) creating a representation vector corresponding to each document of the database and a query representation vector corresponding to the query, each representation vector containing information about the word stems of the lexical database that are contained in the document to which the representation vector corresponds, the query representation vector containing information about the word stems that are contained in the query document, (iii) evaluating each representation vector relative to the query representation vector using vector evaluation rules, for example, evaluating the similarity of the elements contained in vectors, (iv) creating output reflecting the evaluation of the representation vectors; and (v) presenting the output.