1. Field of the Invention
The present invention relates to the field of computer systems. More specifically, the present invention relates to information retrieval (IR) technology, in particular to creating metadata for searching over multiple filtering criteria such as both text and topic criteria.
2. Background Information
Modern computer technology allows databases to incorporate ever greater amounts of information. In order to take full advantage of these advances, methods must be developed to allow a user to quickly, easily and inexpensively identify, retrieve, and order information in a database. Effective IR requires that the search be inexpensive and accessible and that the query results be presented in a manner that facilitates searching.
Conventional IR methods for text based documents rely on large, detailed representations of document sets. Documents are represented by an index file that is derived from the terms of the documents through tokenization, stopping, stemming, elimination of capitalization, and inversion. In stopping, common words are eliminated from the document token stream. Tokens which are to be stopped are the most common words in a given language, such as xe2x80x9caxe2x80x9d and xe2x80x9cthe.xe2x80x9d Stemming strips tokens of certain suffixes such as xe2x80x9cingxe2x80x9d, xe2x80x9cationxe2x80x9d and indications of plurality. Thus xe2x80x9cWorkxe2x80x9d, xe2x80x9cworkingxe2x80x9d and xe2x80x9cworksxe2x80x9d are represented as xe2x80x9cwork.xe2x80x9d Each term in such a full text index (xe2x80x9cFTIxe2x80x9d) serves as an index to the documents in which it appears.
A user searches FTIs by creating term-based queries for documents that include specified keywords. The searches may include term position information. Some methods return all documents containing the specified terms and which have fit the specified term location criteria. Other methods calculate a similarity function between the terms in a query and the terms in each document. Such methods may include a document in a search result as being relevant, even if the document does not fit all the query criteria, as long as the similarity value is greater than a threshold.
Certain FTIs preserve information on the location of terms within documents. This allows users to specify adjacency criteria when searching the document set; i.e., to specify that documents matching a query include instances of terms which are adjacent or in the same sentence, for example.
Such FTI methods require large amounts of storage space. Despite the use of stemming and stopping, virtually every word in the document set must be represented in the index with information on the location of each occurrence of the term in each document in the document set. An FTI may be 50-300% of the size of the document set itself. Generation and maintenance of an index typically requires dedicated computers having processing and storage capacities whose cost is beyond the reach both of those maintaining and those accessing the database. Such indexed document sets are typically available only through services, such as Lexis(copyright)/Nexis(copyright) and Dialog(copyright), and the available indexes are limited to those document sets for which the costs can be justified.
Because such indexes are costly to generate and take up a large amount of storage space, searching on these indexes is typically performed at a site remote to the user but near the document set. This is because the transmission of the indexes to a user and their storage by a user is impractical. In addition, some FTIs contain enough information to reconstruct the original document set, which may be proprietary. Search performance is dependent on data transmission performance and by the availability and workload of remote processors.
Conventional IR methods have limitations in addition to their resource requirements. By the use of stopping, stemming and elimination of capitalization, these methods eliminate information useful to searching. This information is eliminated in order to genericize terms entered as queries and to lower the storage costs of the indexes. While these methods allow for searching based on phrases comprising more than one token, these phrases may not include information eliminated by stopping, stemming and elimination of capitalization.
Conventional IR methods often require a user to enter an exact representation of a phrase and all its variants (i.e. synonyms) in each search query. This is time consuming for the user, and since a user will typically not have the time to contemplate the existence of such variants, documents containing variants of a phrase may not be found. Furthermore, due to the loss of information as a result of stopping, stemming and capitalization elimination, compound terms (i.e. phrases) are not able to be fully defined. Few conventional IR method allows a definition of a compound term or of the variants of a term to be created prior to any search or other use of that term. For example, conventional IR methods will not allow for the equivalence of xe2x80x9cFederal Bureau of Investigationxe2x80x9d, xe2x80x9cFBIxe2x80x9d and xe2x80x9cFederal Bureauxe2x80x9d to be defined before indexing.
Conventional IR methods conduct searching over the text of a document set, using combinations of terms as queries. Conventional IR methods allow for searching and categorization by topic (an area of subject matter or any other categorization); however such methods require that the topics be defined after the documents are indexed.
Some search methods include pre-defined topic definitions as well as term specifications. However, such relevancy determinations typically contain terms which are added to a text search query, where the terms are selected to gather documents relevant to the topic. The topic itself is not evaluated relative to the documents.
Because of the resource requirements of conventional IR methods, and because of their limitations when using topics, it is difficult to integrate these methods with graphical searching and graphical query result representation.
Therefore, there is a need for a more inexpensive and more resource efficient, yet effective, method to search a set of documents. There is a need to perform such a search on a processor which is local to the user and which is remote from the document set. There is a need for metadata providing an efficient and effective search method which allows users to search across different filtering criteria. There is a need for metadata which may allow for graphical searching and graphical query result representation on a local, user processor. There is no method of creating metadata allowing for searching based on phrases which include information normally eliminated by stopping, stemming, and elimination of capitalization or searching based on variants of phrases or terms.
Thus, an improved method for creating, distributing and using document set representations for searching is desirable, and as will be disclosed in more detail below, the present invention provides the desired method as well as other desirable results, which will be readily apparent to those skilled in the art, upon reading the detailed description to follow.
A method and system are disclosed for creating compact integrated metadata representing a set of documents. Each document comprises a set of terms. The metadata comprises a set of topic profiles, each topic profile defining a relationship between a topic and the documents, a set of document surrogates, and a list of terms which may distinguish among documents. Each document surrogate describes a subset of terms occurring in the document and thus permits a document to be searched for by term as well as topic.