The present invention relates to computer databases. More particularly, the present invention relates to the indexing and retrieval of information from computer databases.
FIG. 1 shows a system 100 and its environment for indexing and retrieving information from a database 105. FIG. 1 is a high-level diagram and is labelled "prior art." However, if selected modules within FIG. 1 are implemented according to the present invention, FIG. 1 would not be prior art.
According to FIG. 1, an indexing engine 101 uses documents 103 from a database 105 to determine an index structure 107. A retrieval engine 109 accepts a query 111 from a user via a user interface and uses the index structure 107 to determine the identity or identities 115 of one or more documents from the database 105 that are responsive to the query 111. As indicated by dashed lines, the identities 115 of documents may include links 117 to the databases for producing the identified documents 119 to the user 113. Either the identities 115 of the documents or some or all of the documents 119 may be considered to be the output of the retrieval engine 109, depending on the particular implementation.
Databases
A database is a collection of information. A database can be viewed as a collection of records, or documents. Thus, a database is also referred to as a document database or a database of documents.
A query as used herein is an expression of a user's need for information from the database or about the database. Given a query, a document retrieval system attempts to identify and/or provide one or more documents from the database responsive to the query. Generally, this means that the document retrieval system will navigate or search through an index structure to find document(s) relevant to the query.
An index structure is any type of data structure with information about the contents of documents in the database. The documents of a database may themselves constitute an index structure for the database.
An example of a computer database is an electronically-readable library of files containing, e.g., text, graphics, audio, and/or video, etc. Each file in the library is a document in the database.
A specific example of a computer database is the Internet or the portion of the Internet called the World Wide Web ("the Web"). On the Web, every file or service or Web page which can be accessed can be referred to as a document. The Web can be thought of as a globe-spanning, fast-growing collection, or database, of millions of linked documents. These documents are created and added to the Web by individuals and individual organizations with virtually no rules or restrictions as to content or organization of information. Consequently, the task of locating relevant information on the Web is a difficult one.
As computer databases, e.g., the Web, proliferate, there is a growing need for powerful, efficient, and versatile document indexing and retrieval methods and systems.
Indexing
Document indexing refers to the creation of an index structure for a database for use during retrieval. One approach to document indexing is simply to let the documents themselves be the index structure, so that the documents themselves can be searched (e.g., text-searched) during retrieval. This approach has the advantage of being simple. However, this approach is grossly inefficient for even moderately large databases.
Another approach to document indexing is to form an inverted file of words that appear in the database with pointers for each such word to the document(s) which contain the word. During retrieval, the document retrieval system searches the index for words of interest and identifies and/or provides the document(s) which contain them.
Optionally, an index structure such as an inverted file index structure may omit words, e.g., "the" or "and," that are unlikely to be useful as retrieval keywords. Such words omitted from an index structure are well-known and are commonly referred to as "stop words."
Inverted-file index structures can be automatically generated and are adequate for many purposes. However, the storage and computational requirements for storing and processing such index structures are typically quite substantial for large databases because of the need to index all or substantially all (non-stop) words found in the database.
Yet another approach to document indexing is to have human workers categorize documents to produce index structures such as hierarchical directories. While this approach is useful and appropriate for certain document retrieval purposes, it is labor-intensive and not practical for indexing large databases such as the Web.
Retrieval
Document retrieval refers to the identifying or providing of one or more documents from a database responsive to a query. One common type of query includes a list of retrieval keywords. Because document retrieval typically involves some type of search, it is sometimes referred to as document search.
Document retrieval schemes are typically evaluated using two scores, Recall and Precision. Recall typically refers to the proportion of relevant documents in a database that are successfully retrieved. For example, if ten documents in a database of 10,000 documents are relevant to a user's information need, and a document retrieval scheme successfully retrieves six of these ten documents, then the Recall is sixty percent.
Precision typically refers to the proportion of retrieved documents that are actually relevant. For example, if a document retrieval scheme retrieves fifty documents, but only six of them are relevant to a user's information need, then the Precision is twelve percent.
One approach to document retrieval is Boolean search of an index structure. Under this approach, a user may use operators such as AND, OR, and NOT with retrieval keywords in a query. A drawback of this approach is that it typically requires a large index structure, such as the inverted file. Another drawback is that, while this approach typically produces high Recall, it also produces low Precision.
Refinements to the Boolean search exist. Typically, these refinements attempt to improve Recall and/or Precision. Refinements of the Boolean search include vector-based search and automatic query expansion.
With vector-based search, a retrieval engine considers for each retrieved document the frequency of appearance of each retrieval keyword in the document relative to the frequency of appearance of the keyword in all documents in the database. The relative frequencies of appearance for each retrieved document define a point (i.e., vector) in a vector space having dimensional axes that each correspond to one of the retrieval keywords in the query. Similar documents tend to cluster near each other in the vector space. Using this geometric relationship, the user can more efficiently examine retrieved documents. A drawback of vector-based search is that, while it can improve Precision, it cannot improve Recall. Another drawback with vector-based search is that it continues to require large index structures.
With automatic query expansion, a retrieval engine automatically adds new keywords to a user's query based on a predefined thesaurus or a predefined word co-occurrence list. A drawback with this refinement is that, while it can improve Recall, it typically worsens Precision. Furthermore, building a thesaurus of terms can be a very difficult manual task.
In light of the above discussion, it is seen that there is a need for improved indexing and retrieval of documents for computer document databases. In particular, methods and systems are needed which require reduced storage for index structures and improve Recall and Precision even when applied to large databases such as the Internet or the Web.