Not applicable.
Not applicable.
The present invention relates to document retrieval systems, and more particularly, to search systems suitable for locating documents on an internet or intranet.
The earliest retrieval systems were mainframe computers that contained the full text of thousands of documents and that were accessed from time sharing terminals. The earliest such systems, developed in the early 1960""s, took a list of words and linearly searched through a tape library of the documents searching directly for those that contained the specified words. By the mid to late 1960""s, more sophisticated systems first developed word indices or concordances of the searchable words in the set of documents (excluding nonsearchable words such as of, the, and). The concordance contained, for each word, the document numbers of all the documents that contained the word. In some systems, this document number was accompanied by the number of times the word appeared in the corresponding document to serve as a crude measure of the relevance of each word to each document. Such systems simply required the requestor to type in a list of words, and the system then computed and assigned a relevance to each document, retrieving and displaying the documents to the requestor in relevance order. An example of such a system was the QuicLaw system developed by Hugh Lawford at Queens University in Canada with support from IBM Canada. Phrase searches on that system were done by examining the documents and scanning them for phrases after they had been retrieved, and accordingly phrase searches were slow.
Other systems, such as Mead Data Central""s LEXIS system developed by Jerome Rubin and Edward Gotsman and others, included in its concordance an entry for each and every word, which included, along with the document number (of the document that contained the word), a document segment number (identifying the segment of the document in which the word appeared) and also a word position number (identifying where, within the segment, the word appeared relative to other words). West Group""s WESTLAW system, developed a few years later by William Voedisch and others, improved upon this by including in the concordance entry for each word a paragraph number (indicating where the word appeared within the segment), a sentence number (indicating where the word appeared within the paragraph), and a word position number (indicating where the word appeared within the sentence). These two systems, which are still in use today, both permit the logical connectors or operators, AND, OR, AND NOT, w/seg (within the same segment), w/p (within the same paragraph), w/s (within the same sentence), w/4 (within 4 words of each other), and pre/4 (preceding by 4 words) to be used to write out formal, complex search requests. Parenthesis permit one to control the order of execution of these logical operations. Another class of systems, and in particular the Dialog system which is still in use today, grew out of the early NASA RECON system that assigned names to previously-performed searches so that those searches could be incorporated by reference into later-performed searches.
Professional librarians and legal researchers use all three of these systems regularly. However, these professionals must train for many weeks and months to learn how to formulate complex queries containing parenthesis and logical operators. Lay searchers cannot use these powerful systems with the same degree of success because they are not trained in the proper use of operators and parenthesis and do not know how to formulate search queries.
These systems also have other undesirable properties. When asked to search for multiple words and phrases conjoined by OR, these systems tend to recall far too many unwanted documentsxe2x80x94their precision is poor. Precision can be improved by the addition of AND operators and word proximity operators to a search request, but then relevant documents tend to be missed, and accordingly the recall rate of these systems suffers.
To enable untrained searchers to use these systems, various artificial intelligence schemes have been developed which, like the early QuicLaw system, simply permit a requestor to type in a list of words or a sentence, and then produce some ranking and production of the documents. These systems produce variable results and are not particularly reliable. Some ask the requestor to select a particularly relevant document, and then, using the words which that document contains, these systems attempt to find similar documents, again with rather mixed results.
The WESTLAW system also contains some formal indexing of its documents, with each document assigned to a topic and, within each topic, to a key number that corresponds to a position within an outline of the topic. But this indexing can only be used when each document has been hand-indexed by a skilled indexer. New documents added to the WESTLAW system must also be manually indexed. Other systems provide each document with a segment or field that contains words and/or phrases that help to identify and characterize the document, but again this indexing must be done manually, and the retrieval systems treat these words and phrases in the same manner as they do other words and phrases in the document.
With the development of the Internet, web crawlers have been developed that search the web creating what amount to concordances of thousands of web pages, indexing documents by their URLs (uniform resource locators or web addresses) as well as by the words and phrases that they contain and also by index terms optionally placed into a special field of each document by the document""s authors. But these search engines retrieve thousands of documents containing a word or phrase and do not assist one in sorting through all the documents that are captured. In other words, their precision is poor. And the introduction of the AND operator to these systems causes their recall to suffer.
All of these systems suffer from an even more fundamental defect: They do not teach the requestor how to search other than to the extent that the requestor accidentally encounters new words and phrases while browsing. They also do not suggest, nor automate, the application and the use of indexing to the extent that indexing is available. They do not query the requestor, offering the requestor alternative ways to proceed. They do not automatically index new documents that have not previously been indexed manually.
Accordingly, it is a primary object of the present invention to provide a document retrieval system, suitable for searching for documents on the Internet or on an intranet, that is indexed, that is extensible without additional manual indexing, and that accepts broadly formulated queries from a requestor, and that then enters into a dialogue with the requestor to refine and focus the search, using precise indexing to improve considerably the precision of searching, minimizing browse time and false hits, without suffering a corresponding reduction in the relevant document recall rate.
Briefly summarized, the present invention is an interactive document retrieval system that is designed to search for documents after receiving a search query from a requestor. It contains a knowledge database that contains at least one data structure which relates document word patterns to topics. This knowledge database can be derived from an indexed collection of documents.
The present invention utilizes a query processor that, in response to the receipt of a search query from a requestor, searches for and tries to capture documents containing at least one term that is related to the search query. If any documents are captured, the processor analyzes the captured documents to determine their word patterns, and it then categorizes the captured documents by comparing each document""s word pattern to the word patterns in the database. When a document""s word pattern is similar to a word pattern in the database, the processor assigns to that document the similar word pattern""s related topic. In this manner, each document is assigned to one or several topics. Next, a list of the topics assigned to the categorized documents is presented to the requestor, and the requestor is asked to designate at least one topic from the list as a topic that is relevant to the requestor""s search. Finally, the requestor is granted access to the subset of the captured and categorized documents to which topics designated by the requestor have been assigned.
The system may rely on a server connected to the Internet or to an intranet, and the requestor may access the system from a PC equipped with a web browser. To save time, queries once processed are saved along with the list of documents retrieved by those queries and the topics to which they are assigned. Periodic update and maintenance searches are performed to keep the system up-to-date, and analysis and categorization performed during update and maintenance is saved to speed the performance of searches later on.
The system may be set up initially and trained by having it analyze a set of documents that have been manually indexed, saving in a word combination table within the knowledge database a record of the word patterns of these documents and relating these word patterns to the topics assigned to each document. The word patterns may be adjacent pairs of searchable words (not including non-searchable words such as of, and, etc.) where one of the words in each such pairing occurs frequently within the document.