A critical problem for users of large databases is searching the ever growing volume of text in the databases. It has been estimated that in excess of 90% of all desired intelligence information is available in documents residing in accessible databases. In order for this information to be useful, a user must be able to locate specific documents relating to specific queries. Existing information retrieval systems make use of inefficient techniques to return relevant documents. Generally, these existing information retrieval systems miss highly relevant documents associated with the user's query. For example, conventional information retrieval systems use Boolean logic-based query execution techniques in which keywords are connected together via logical or proximity operators. A logical operator is used to include or exclude particular terms in a search. Proximity operators merely enforce a combination of keywords appearing within a certain numerical range of words. The Boolean search is limited to searching in documents of the database that have occurrences of the keyword combination. Such a system merely returns a list of any documents that includes one of the keyword combinations. Significantly, the Boolean search system has no quantifiable measure of how similar the documents returned are to the query (other than the fact that a Boolean logic combination exists in the document).
The ability to provide a quantifiable measure of similarity between a query and individual documents is important in two ways. First, a user must be able to determine whether the set of documents returned is meaningful at all. Second, the user must also be able to separate out the meaningful documents from documents that have been returned that do not relate in any significant way to the query even though they may include some combinations of key words. That is, users must be able to rank the documents that have been returned in relation to the query. Since Boolean-type searches do not provide a quantifiable measure of similarity between the query and the documents in the database, they fail to provide the user of the database with significant information concerning the search and the documents in the database.
In response to the shortcomings of Boolean-type search systems, vector space-type search systems have been developed. In a vector space-type search system, a score related to a query is computed for each document in the database. In general, the similarity score between a query "Q" and a document "D" can be compared by evaluating the shared and disjoint features of the query and the document over an orthogonal space of T terms that originate from the database. For example, the similarities score can be computed by the following formula: ##EQU1## where Q.sub.i refers to terms in the query and D.sub.j refers to terms in the document.
This scoring technique can be illustrated on the following sample set of queries and documents:
______________________________________ Terms Q.sub.1 Q.sub.2 D.sub.1 D.sub.2 ______________________________________ dog 1 1 2 -- cat 1 -- -- 1 hat 1 -- 1 -- bat 1 -- -- -- mat 1 1 -- -- hut -- 1 2 -- cut -- 1 -- 2 luck -- -- 3 -- buck -- -- 1 -- muck -- -- -- 3 ______________________________________
In this table, the Terms column lists a union of all the terms contained in the two documents D.sub.1 and D.sub.2. The scores of D.sub.1 and D.sub.2 refer to the frequency of those terms as they appear in the documents. The scores of Q.sub.1 and Q.sub.2 refer to frequency of the terms as they appear in the query. The similarity score of the query Q.sub.1 to document D.sub.1 is computed as: ##EQU2## Also, the similarity of the query Q.sub.1 to document D.sub.2 is computed as S(Q.sub.1, D.sub.2)=0.12. As can be seen from the above example, the similarity score of query Q.sub.1 to document D.sub.1 is higher than the similarity score of query Q.sub.1 to document D.sub.2. As a result, the similarity score provides a relative measure of the document to the query. Since these relative scores can be determined for all documents in a database, the user can empirically determine when scores are meaningful, and as a result, derive significant information from documents that are ranked by their score.
In order to derive significant information from the rank and score of particular documents, it is necessary that the number of terms in the term space over which the computations are performed must be comparable between documents. In the example given above, the score computed for each document varies according to the number of terms over which the score is calculated. The score of Q.sub.1 to D.sub.1 in this instance is 0.31 and is based on using all of the terms (global term space) in the document database. If only the terms in the union of the queries Q.sub.1 and Q.sub.2 are used as the term space, the score is 0.45. This simple example illustrates that the term space over which documents are compared must be of similar size.
Generally, scoring a document of controlled size on the co-occurrence of terms between a query and a document is more effective than scoring a large document. In fact, controlling document size even increases the effectiveness of searches based on Boolean combinations of keywords. However, databases do not generally store documents by comparable size. The more general case is that any particular database will have documents that may range from as little as one page or less to as many as thousands pages or more. As a result, the document analysis technique must control for this wide range of document sizes. Conventional vector space analysis techniques do not control for document size because they only operate on whole documents. A refinement to conventional vector space analysis is to normalize the scoring process by arbitrarily breaking up the documents into a set of mutually exclusive subdocuments and analyzing the subdocuments as documents. Subdocument size is controlled in this refined analysis by merely selecting a number of words per subdocument and counting the absolute number of words in the document to create the subdocument boundary. Subdocument boundaries are also created by preprocessing the document to remove meaningless words such as "at", "the", etc. and then counting the words. The size of these subdocuments will not vary by definition. As a result, the scoring errors due to the term space differences are reduced.
The problem with creating subdocuments in this way is that arbitrarily breaking up documents by the number of words may introduce a different type of scoring error. Since adjacent text in a document often contains information related to the same idea, the context of a phrase in one section of text may be lost when arbitrarily separated from adjacent text. In order to avoid separating areas of text that are related, the subdocuments can be created based on paragraphs in the document. This is because paragraphs generally reflect the judgment of the author of the document concerning which areas of text are related. The problem with creating subdocuments base on paragraphs is that the size of paragraghs vary greatly. As discussed above, this variation in size affects the scoring of the subdocuments.