1. Field of the Invention
This invention is concerned with an information retrieval apparatus and method which allow the user to retrieve information from databases, such as document databases, containing a plurality of documents.
2. Background of the Invention
Information retrieval systems serve to retrieve those documents that are relevant to the information needs of a user. With the explosive growth of the use of databases and the Internet, information retrieval increasingly fails to enable efficient retrieval of available information. The problem lies at both ends of the system. At one end, there is the ever increasing number of documents that vary widely in content, format and quality. At the other end, there is a huge number of unknown users with extremely diverse needs, skills, educational, cultural, and language backgrounds. Conventional search method and apparatus are, however, not sophisticated enough to provide satisfactory solutions. The search capabilities of conventional search methods and apparatus are designed either for high recall and the xe2x80x9caverage userxe2x80x9d, or for searches of high precision. Both approaches may not retrieve the desired information, although available within a database.
In general, the relevant information contained in the documents is constructed and extracted according to a normalized representation. This representation is abstracted away from its original linguistic form. Database queries of a user are generally subjected to a processing in order to expand the scope of the query and/or to interpret the query syntax. The extracted query information is then matched against the stored representations in order to retrieve specific information contained in the documents.
Those documents which are the most similar to the query are output as retrieved documents.
Different methods exist to find those documents relevant to the query. Statistical methods count the number of times each word of the query appears in each document. Documents in a database are ranked according to the obtained count values. If the number of words in a query is not sufficient, less than two or three words, the number of words may prove to be insufficient to find the documents relevant to the request.
Other approaches use a refined document preprocessing which is based on a deep parsing procedure applying a complex grammatical analysis on the documents to extract an entire sentence dependency structure. Such approaches generally require a huge computational effort without providing satisfactory results. As complex sentences are difficult to analyze, even a complete dependency analysis may only return several possible dependency structures for a single sentence. Other information retrieval systems expand the scope of a query taking semantic relations of words into account. It turned out, that such an approach does not return better results.
For evaluating retrieval performance of information retrieval systems, two criteria are used, namely the xe2x80x9ccalling ratexe2x80x9d and the xe2x80x9cprecisionxe2x80x9d. These criteria are based on the subjective point of view on the relevance of retrieved information. The xe2x80x9ccalling ratexe2x80x9d or xe2x80x9crecallxe2x80x9d and the xe2x80x9cprecisionxe2x80x9d are defined as follows.
The calling rate or recall is a ratio of the number of pertinent documents retrieved to the total number of pertinent documents stored in the database, the precision is a ratio of the number of pertinent documents retrieved to the number of all documents retrieved. There is usually a trade-off between these two criteria. In information retrieval, it is desirable that these two criteria are in proximity to the maximum value of one.
Most traditional information retrieval systems are optimized for longer queries and perform worse for short, more realistic queries. According to surveys made on the Internet, the average request comprises only a few words (mostly less that five words).
The present invention has been made in consideration of the above situation, and it is the primary object of the present invention to provide an improved method and an improved apparatus that retrieve information from a database.
It is a further object of the invention to provide a method and an apparatus for information retrieval that improve the ranking of retrieved documents.
It is another object of the invention to provide a method and an apparatus for information retrieval that pushes the most salient documents on top of a list of retrieved documents.
It is still another object of the invention to provide a method and an apparatus that increases the proportion of relevant documents retrieved from a document database.
It is still another object of the present invention to provide a method and an apparatus that retrieve information from a database with a higher precision.
It is yet another object of the invention to provide a method and an apparatus that increase effectiveness of information retrieval.
These and other objects of the present invention may become apparent hereafter.
To achieve these objects, the present invention provides a method and an apparatus that combine the use of syntactic constructions with an enlargement of terms for documents and queries to improve precision and calling rate for information retrieval. The method for document retrieval of the present invention relates to databases comprising internal representations of documents wherein the internal representations include syntactic relations between terms of sentences of the documents and a semantic lattice for the terms of the documents in the database, the semantic lattice specifying semantic relations between the terms. The method comprises the step of extracting syntactic relations between terms of the query and creating an internal representation of the query based on the terms of the query and the extracted syntactic relations between the terms of the query. Further, the method appends new terms to the semantic lattice if the query includes terms not included in the semantic lattice in the database. The query is projected onto the documents in the database by comparing the internal representation and terms of the query to the internal representations and terms of the documents using the semantic lattice for comparing the terms and a similarity is computed between the query and each document. The documents in the database are ranked according to their computed similarities, and the documents are output as retrieved documents according to the established rank order.
According to a second aspect of the present invention, there is provided an apparatus for retrieving documents from a database. The database comprises internal representations of documents wherein the internal representations include syntactic relations between terms of sentences of the documents and a semantic lattice for the terms of the documents in the database, the semantic lattice specifying the semantic relations between the terms. The apparatus comprises a query input unit, and query processing unit, a semantic lattice management unit, a matching unit and a presentation unit. The query input unit receives a query and provides the query to the query-processing unit. The query-processing unit creates an internal representation of the query based on the terms of the query and syntactic relations between the terms of the query. The semantic lattice management unit appends new terms to the semantic lattice if the query includes terms not included in the semantic lattice in the database. The matching unit projects the query onto each of the documents in the database by comparing the internal representation of the query to the internal representation of the documents using the semantic lattice for comparing the terms. The matching unit further computes a similarity between the query and each document. The presentation unit ranks the documents in the database according to the computed similarities and outputs documents as retrieved documents according to the established rank order.
Furthermore, the present invention provides a computer program product, for use in a computer system, for performing a retrieval of documents from a database. The database comprises internal representations of documents wherein the internal representations include syntactic relations between terms of sentences of the documents and a semantic lattice for the terms of the documents in the database, the semantic lattice specifying semantic relations between the terms. The computer program product performs steps of receiving a database query, extracting syntactic relations between terms of the query, creating an internal representation of the query based on the terms of the query and the extracted syntactic relations between the terms of the query and appending new terms to the semantic lattice if the query includes terms not included in the semantic lattice in the database. Further, the computer program product performs the steps of projecting the query onto each of the documents in the database by comparing the internal representations and terms of the query to the internal representation and terms of the documents using the semantic lattice for comparing the terms and computing similarities between the query and the documents. Finally, the documents in the database are ranked according to the computed similarities, and documents are output as retrieved documents according to the obtained rank order.
In preferred embodiments, further improvements can be achieved by additionally taking weighting factors into account when computing a similarity between a document and a query wherein the weighting factor depends on the semantic and syntactic similarity between a term of the query and a term of a document.
In another preferred embodiment, there is provided an internal representation of document information in the form of a conceptual graph wherein each node of the graph is either a term or a syntactic relation. The mechanism to compare graphs, the projection, is not based on a statistical method but on the combination of a lattice of concepts and the matching of each node of each graph one with the others, according to their structure. A statistical method can be used to extract a first broad set of documents which are then ranked according to the result of the projection of the query on each of those documents.