1. Field of the Invention
The present invention relates to the field of computer systems. More specifically, the present invention relates to information retrieval (IR) technology, in particular to searching over multiple filtering criteria such as both text and topic criteria.
2. Background Information
Modern computer technology allows databases to incorporate ever greater amounts of information. In order to take full advantage of these advances, methods must be developed to allow a user to quickly, easily and inexpensively identify, retrieve, and order information in a database. Effective IR requires that the search be inexpensive and accessible and that the query results be presented in a manner that facilitates searching.
Conventional IR methods for text based documents rely on large, detailed representations of document sets. Documents are represented by an index file that is derived from the terms of the documents through tokenization, stopping, stemming, elimination of capitalization, and inversion. In stopping, common words are eliminated from the document token stream. Tokens which are to be stopped are the most common words in a given language, such as xe2x80x9caxe2x80x9d and xe2x80x9cthe.xe2x80x9d Stemming strips tokens of certain suffixes such as xe2x80x9cingxe2x80x9d, xe2x80x9cationxe2x80x9d and indications of plurality. Thus xe2x80x9cWorkxe2x80x9d, xe2x80x9cworkingxe2x80x9d and xe2x80x9cworksxe2x80x9d are represented as xe2x80x9cwork.xe2x80x9d Each term in such a full text index (xe2x80x9cFTIxe2x80x9d) serves as an index to the documents in which it appears.
A user searches FTIs by creating term-based queries for documents that include specified keywords. The searches may include term position information. Some methods return all documents containing the specified terms and which have fit the specified term location criteria. Other methods calculate a similarity function between the terms in a query and the terms in each document. Such methods may include a document in a search result as being relevant, even if the document does not fit all the query criteria, as long as the similarity value is greater than a threshold.
Certain FTIs preserve information on the location of terms within documents. This allows users to specify adjacency criteria when searching the document set; i.e., to specify that documents matching a query include instances of terms which are adjacent or are in the same sentence, for example.
Such FTI methods require large amounts of storage space. Despite the use of stemming and stopping, virtually every word in the document set must be represented in the index with information on the location of each occurrence of the term in each document in the document set. An FTI may be 50-300% of the size of the document set itself. Generation and maintenance of an index often requires dedicated computers having processing and storage capacities whose cost is beyond the reach both of those maintaining and those accessing the database. Such indexed document sets are typically available only through services, such as Lexis(copyright)/Nexis(copyright) and Dialog(copyright), and the available indexes are limited to those document sets for which the costs can be justified.
Because such indexes are costly to generate and take up a large amount of storage space, searching on these indexes is typically performed at a site remote to the user but near the document set. This is because the transmission of the indexes to a user and their storage by a user is impractical. In addition, some FTIs contain enough information to reconstruct the original document set, which may be proprietary. Search performance is dependent on data transmission performance and by the availability and workload of remote processors.
Conventional IR methods have limitations in addition to their resource requirements. By the use of stopping, stemming and elimination of capitalization, these methods eliminate information useful to searching. This information is eliminated in order to genericize terms entered as queries and to lower the storage costs of the indexes. While these methods allow for searching based on phrases comprising more than one token, these phrases may not include information eliminated by stopping, stemming and elimination of capitalization.
Conventional IR methods often require a user to enter an exact representation of a phrase and all its variants (i.e. synonyms) in each search query. This is time consuming for the user, and since a user will typically not have the time to contemplate the existence of such variants, documents containing variants of a phrase may not be found. Furthermore, due to the loss of information as a result of stopping, stemming and capitalization elimination, compound terms (i.e. phrases) are not able to be fully defined. Few conventional IR methods allow a definition of a compound term or of the variants of a term to be created prior to indexing using that term. For example, conventional IR methods will not allow for the equivalence of xe2x80x9cFederal Bureau of Investigationxe2x80x9d, xe2x80x9cFBIxe2x80x9d and xe2x80x9cFederal Bureauxe2x80x9d to be defined before indexing.
Conventional IR methods conduct searching over the text of a document set, using combinations of terms as queries. Conventional IR methods allow for searching and categorization by topic (an area of subject matter or any other categorization); however such methods require that the topics be defined after the documents are indexed.
Some search methods include pre-defined topic definitions as well as term specifications. However, such relevancy determinations typically contain terms which are added to a text search query, where the terms are selected to gather documents relevant to the topic. The topic itself is not evaluated relative to the documents.
Because of the resource requirements of conventional IR methods, and because of their limitations when using topics, it is difficult to integrate these methods with graphical searching and graphical query result representation.
Current IR methods do not easily allow for a document index to be filtered prior to use. Thus the full index must usually be accessed by a user, who may be interested in only a small part of the index, and who may not wish to support the resource requirements of the full index.
Current search methods do not allow a user to search using different processors having different capabilities, or to store the state of a search for later use. When a user searches using conventional methods, the search domainxe2x80x94the set of documents over which the user searches (or the set of references to these documents)xe2x80x94is not adjustable at the client level. In an effort to adjust the number of documents returned and narrow a search over a series of iterations, the user often enters an entirely new search for every iteration, replicating information from a previous search. Storing the state of a user search (for example, a set of documents to be searched) eliminates this problem. While current commercial search engines allow for a search state to be maintained, this state is maintained at the server processor, which must devote large amounts of storage resources to maintain process states for the numerous users serviced by the server processor. A user must communicate with a server processor to choose between search states, and thus is limited by communication delays and server processor workload delays. A need exists for a search method which stores a search state locally to the user.
Therefore, there is a need for a more inexpensive and more resource efficient, yet effective, method to search a set of documents. There is a need to perform such a search on a processor which is local to the user and which is remote from the document set. There is a need for an efficient and effective search method which allows users to search across different filtering criteria. There is a need for a search method which may allow for graphical searching and graphical query result representation on a local, user processor. There is no search method allowing for searching based on phrases which include information normally eliminated by stopping, stemming, and elimination of capitalization or searching based on variants of phrases or terms. There is no search method combining the capabilities of different processors, for allowing for a search state to be saved on a client processor and used at a later time, or for easily pre-filtering a search index.
Thus, an improved method for using document set representations for searching is desirable, and as will be disclosed in more detail below, the present invention provides the desired method as well as other desirable results, which will be readily apparent to those skilled in the art, upon reading the detailed description to follow.
A method and system are disclosed for searching a set of documents using compact integrated metadata. Each document comprises a set of terms. The metadata comprises a set of topic profiles, each topic profile defining a relationship between a topic and the documents, a set of distinguishing terms for searching the documents by their component terms, and a set of document surrogates for allowing the documents to be searched by topic or by term. The method and system create references from each topic profile to document surrogates relevant to the corresponding topic, and create a set of references from each distinguishing term to document surrogates containing that term. The method and system accept a query and search on the documents using the metadata. The method and system provide the ability to filter the metadata before presentation to the user and to integrate searching on a client processor with searching on a server processor. The method and system provide the ability to maintain a set of search states, each search state describing a subset of the documents to be searched, and a set of filters, which are the queries which resulted in the subset of documents.