Full-text searching of unstructured and semi-structured data is becoming more and more important in the world of computing. For many years, the information-retrieval community has had to deal with the storage of documents and with the retrieval of documents based on one or more keywords. Since the burgeoning of the World Wide Web, and the feasibility of storing documents on-line, retrieval of documents based on keywords has become a thorny problem. A number of software solutions have been developed, which have attempted to address some of these problems.
A large portion of digitally stored information is presently stored in the form of unstructured textual data, both in plain text files and in formatted documents. Although the bulk of this textual data is stored in file systems, there are advantages to storing such data in relational databases. By doing so, the advantages of a database, including high-performance access, query capability, simple application-based user interfaces for end users, and secure remote access, are made available.
Relational Databases
Database management systems (DBMSs) such as SQL Server are widely used to search structured data. It is impractical, however, to search unstructured data (e.g., text documents) the same way structured data is searched because doing so is too expensive.
For example, in order to retrieve information from structured data in a database, a user typically provides a query (written in a query language such as SQL), where the query specifies the structured information to be retrieved (the search term or terms), the field in which the search term is to be found and the manner in which the retrieved information is to be manipulated or evaluated in order to provide a useful result, which is typically a relational operator or a function. To process the query, the database system typically converts the query into a relational expression that describes algebraically the result specified by the query. The relational expression is used to produce an execution plan, which describes particular steps to be taken by a computer to produce the requested result. Because the search term and the field where the search term is sought are specified, such results can be returned quickly. Indexes based on key fields, (e.g., an index based on name or social security number for a personnel database), routinely assist in efficient searching.
A similarly-conducted search for the same search term in unstructured data would require a word-by-word search of the entire text database and is unworkable.
Typically, today, an inverted index for searching documents is created by building a custom data structure external to the database system before a search query is entered. These solutions usually involve pulling data out of the database via bridges or gateways and storing the data as files in the file system so that textual indexing can be applied. Some systems actually store index data in a database but use an external engine to build and query the index. This approach does not provide a seamless way for a user to combine a textual query with a regular structured relational query and limits the extent to which a query can be optimized.
Typically, a full-text index is organized as a tree where internal nodes represent keywords and whose external nodes contain document identifiers and occurrences. When searched, the keyword(s) are looked up in the index and the documents containing the keyword(s) are retrieved. Naturally, whenever the collection of documents changes, a new index must be built or the existing index must be updated.
Although full text searching is frequently a capability of database management systems, the implementation of full-text search is typically unable to take advantage of the features of the database management system, which is to say, relational database management systems generally are unable to accommodate full-text searching of documents within the structure of the database. Typically, the index created to search the document database is not itself part of the database system (i.e., is separate from the database's index system). Because the index created is not part of the database system, certain limitations arise and certain highly advantageous aspects of database systems do not apply to typical full-text search systems.
Limitations associated with a full-text search system that relies on an external index include the following:                Integration with existing database search technologies like Microsoft's SQL SERVER is fairly complex and difficult because the index is a custom index, and typically has its own transactional mechanism and storage mechanism. A significant amount of custom code, therefore, is needed for indexing, querying and administration.        Enhancements to existing or newly added systems that require a change in persistent storage format is difficult because changes in the storage management code of the custom index is required.        Implementation of scaling features such as the distribution of workload and files among multiple resources including clustering, etc., requires a significant amount of development.        Replication, i.e., keeping distributed databases synchronized by copying the entire database or subsets of the database to other servers in the network, is typically of the unsophisticated “full copy and propagate” form with very loose integrity semantics. A more efficient form of replication would require a significant amount of development.        Incorporation of database features such as query caching, keyword buffering, data partitioning etc. is more difficult since any such work frequently impacts the core engine code and sometimes impacts persistent store layout.        Upgrading from one file structure to another is a difficult development task.        A significant amount of code must be maintained to perform a function which is very similar to a function already performed by, for example, a cluster index associated with a relational database system such as SQL Server.        Query optimization cannot be tightly integrated.        
Similarly, some of the advantages of database management systems are not applicable to a full-text search system based on a custom index. For example, most database systems have excellent facilities for data recovery in the event of database degradation, however, these data recovery systems do not work for the index file because the index file is not a DBMS data store. Hence data corruption can be a frequent problem with a file system index file. If there is a hardware malfunction it is very difficult to efficiently reach a point where the documents database and the documents index are in sync because the two different systems have different recovery protocols.
Backup and restore mechanisms for the index file generally do not have the advanced features typically available for database files, as discussed above.
Scalability issues exist for the index file. Scalability refers to partitioning one logical table into multiple physical tables on the same machine or on different machines in order to accommodate very large collections of data. For example, instead of storing a large database on a single resource, it is frequently desirable to split or partition the database across a number of resources. Database data stores generally maintain data in tables that can reside locally on a single data store or can be distributed among several data stores in a distributed database environment.
Advantages to partitioning are the reduction of processing load on a single resource, faster access to data and if a particular machine experiences a hardware failure, only part of the data is lost. Partitioning, however, is typically not available for a file system index file, because partitioning a file system file requires a separate infrastructure. Thus, typically the index file, although frequently very large, cannot be partitioned, so a single resource must be dedicated to the index.
Hence, a need exists in the art to provide a full-text searching system wherein the index is built upon standard database technology.
Most of the methods of building text indexes based on keyword, document identifier and occurrence lists share the mechanism of building compressed inverted lists and merging the inverted lists to build merged indexes. For example, when a database is searched, data is typically scanned and then indexed. Each time a crawler finishes crawling a batch of data, an indexer may build an inverted list of keywords with data identifiers and the occurrences of the keyword(s) in the data (an index). Typically the index is persisted. Frequently several (or many) indexes are built per data set because typically the body of data is quite large. Indexes are then merged together. During the merge of an old index into a new index, typically a table lookup must be done for every data identifier in the older index to see if the data has been changed or deleted since the older index was built. For example, if a particular data item was present in the older index but is deleted or changed later, the information about the data from the old index is not included in the new index. Typically, for performance reasons, this table is stored in memory. It would be helpful if the number of table lookups could be reduced, especially if the need for an in-memory data structure for the lookup table could be reduced or eliminated.