This invention relates to the field of digital libraries. Specifically, it discloses a system and method for identifying useless documents in a document hit list assembled after performing a search among documents stored in a digital library collection, such that these documents can be filtered and eliminated from the document hit list.
The task of finding important and relevant documents in an online document collection is becoming increasingly difficult as documents proliferate. Several techniques have been developed within document retrieval systems to assist users in focusing or directing their queries more effectively, such as the Prompted Query Refinement technique described by Cooper et al. in xe2x80x9cLexical Navigation: Visually Prompted Query Expansion and Refinement,xe2x80x9d Proceedings of DIGLIB97, Philadelphia, Pa., July, 1997; and by Cooper et al. in xe2x80x9cOBIWAN-A Visual Interface for Prompted Query Refinement,xe2x80x9d Proceedings of HICSS-31, Kona, Hi., 1998. These references, and all other references referenced in this specification are herein incorporated by reference in their entirety. However, even after a query has been refined, the problem of having to read too many documents still remains.
To counteract such a daunting task of having to read too many documents, techniques have been developed for producing rapid displays of the most salient sentences in a document, as described by Neff et al. in xe2x80x9cDocument Summarization for Active Markup,xe2x80x9d Proceedings of the 32nd Hawaii International Conference on System Sciences, Wailea, Hi, January, 1999; and by Neff et al. in xe2x80x9cA Knowledge Management Prototype,xe2x80x9d Proceedings of NLDB99, Klagenfurt, Austria, 1999. Based on these techniques, users can prefer to read or browse through only those documents returned by a search engine which are important to the area they are investigating. However, even with these summarization techniques, the document retrieval systems are still not able to predict which documents will be most useful to the user.
Other techniques for solving document retrieval problems entail having the user interact with the document retrieval system. For example, one technique described in the literature entails, in a multi-window document interface, having a user to drag terms into search windows and see relationships between terms in a graphical environment. Further, Schatz et al. in xe2x80x9cInteractive Tern Suggestion for Users of Digital Libraries,xe2x80x9d ACM Digital Library Conference, 1996 describes a multi-window interface that offers user access to a variety of published thesauruses and computed term co-occurrence data. However, these techniques are prone to user errors (e.g., the user selects a term which is non-pertinent to his investigation to further refine the search) and are time-consuming, since user intervention is necessitated. Accordingly, these prior art document retrieval techniques and other known techniques are not capable of filtering document hit lists, such that documents having limited utility, even though they may match many of the search terms fairly accurately, can be removed or downgraded in terms of their ranking, in order to present the most useful documents to the user. Hence, an object of this invention is a system and method for identifying useless documents in a document hit list, such that these documents can be filtered and eliminated from the document hit list.
The present invention is essentially a system and method for identifying useless or insignificant documents in a document hit list assembled from documents stored in one or more document collection database memories. A search engine is used to compose the document hit list based on a query presented by a user. A text extraction algorithm run by a processor is then used to process the documents identified by the document hit list to produce a table of terms and their corresponding collection-level importance ranking called the IQ or Information Quotient. The text extraction algorithm also produces a table of the most important terms per document. The documents are also scanned independently and a table of documents with filenames and lengths is also produced.
A summarizing text algorithm is also run by a processor against the documents of the document hit list to produce a table of terms having a high tf*idf (term frequency times inverted document frequency) value for each document. All of the tables are stored in a relational database, which allows the system of the present invention to generate a table of terms per document ranked by decreasing IQ. To determine whether a document is useful or useless, the table of terms and IQs, the table of most important terms per document, the table of documents with filename and lengths, and the table of high tf*idf values are examined. A document is found to be useless if one of the following two conditions is true: (i) the document has a document length of less than 2,000 bytes, or (ii) the document has less than five terms with an IQ greater than 60, the document has less than six appearances of terms having a tf*idf value of greater than 2.2, and the document has a document length of less than 40,000 bytes. The document length parameter may vary depending on the document format.