Full-text searching of unstructured and semi-structured data is becoming increasingly popular and significant in the computing world. For many years, the information-retrieval community has had to deal with the storage of documents and with the retrieval of documents based on one or more keywords. Since the burgeoning of the Internet and the feasibility of storing documents on-line, retrieval of documents based on keywords has become a complex problem. A myriad of software solutions have been developed, which have attempted to address this problem.
Conventional search engines provide a mechanism for searching unstructured as well as semi-structured data, however they are all nonspecific and search algorithms as well as schema are hard coded. Many of the most popular search engines such as Google® and Yahoo® are targeted toward processing generic queries over an almost infinite domain—the Internet. The search and ranking algorithms employed by such search engines are static and unchanging with respect to received queries. Hence, these search engines will utilize the same algorithms regardless of whether the majority of searches correspond to specialized areas or scenarios such as medicine, law, and e-business, for instance. The relevance of returned results could be dramatically increased if the query algorithms were targeted at a particular domain of interest. Conventionally, however, query algorithms are hard coded into search engines and securely protected as company trade secrets. Accordingly, if an individual or entity would like to add extra information or features to a conventional search engine targeted at a particular domain, for instance, they would need to attempt to build an auxiliary application external to the search engine, which most likely would not produce the desired results. Alternatively, an individual or entity could attempt to build their own search engine or solicit a software company to do it for them. Unfortunately, either choice would most likely be prohibitively expensive both in terms of time and money.
A large portion of digitally stored information is presently stored in the form of unstructured textual data, both in plain text files and in formatted documents. Although the bulk of this textual data is stored in file systems, there are advantages to storing such data in databases (e.g., relational, multidimensional). By doing so, the advantages of a database, including high-performance access, query capability, metadata based queries, simple application-based user interfaces for end users, and secure remote access, are made available.
Database management systems (DBMSs) such as SQL Server are widely used to search structured data. It is impractical, however, to search unstructured data (e.g., text documents) the same way structured data is searched in part because doing so is too expensive. For example, in order to retrieve information from structured data in a database, a user typically provides a query (written in a query language such as SQL), where the query specifies the structured information to be retrieved (the search term or terms), the field in which the search term is to be found and the manner in which the retrieved information is to be manipulated or evaluated in order to provide a useful result, which is typically a relational operator or a function. To process the query, the database system typically converts the query into a relational expression that describes algebraically the result specified by the query. The relational expression is used to produce an execution plan, which describes particular steps to be taken by a computer to produce the requested result. Because the search term and the field where the search term is sought are specified, such results can be returned quickly. Moreover, indexes based on key fields (e.g., an index based on name or social security number for a personnel database) routinely assist in efficient searching.
A similarly conducted search for the same search term in unstructured data would require a word-by-word search of the entire text database and is simply unworkable and impractical. Conventional solutions to this problem typically involve the creation of an inverted index for searching documents by building a custom data structure external to the database system before a search query is entered. These solutions usually involve pulling data out of the database via bridges or gateways and storing the data as files in the file system so that textual indexing can be applied. Some other conventional systems actually store index data in a database but use an external engine to build and query the index. This approach does not provide a seamless way for a user to combine a textual query with a regular structured relational query and limits the extent to which a query can be optimized.
Although full-text searching is frequently a capability of database management systems, the conventional implementation of full-text search is unable to take advantage of the features of the database management system. Database management systems are generally unable to accommodate full-text searching of documents within the structure of the database because the full-text capabilities are only loosely coupled therewith. For instance, typically, the index created to search a document database is not itself part of the database system (i.e., is separate from the database's index system). Because the index created is not part of the database system, certain limitations arise and certain highly advantageous aspects of database systems do not apply to typical full-text search systems.
Accordingly, there is a need in the art for a full-text search system that can employ separately provided index schemas and ranking algorithms to efficiently generate relevant results for targeted domains. Furthermore, there is a need for a full-text search system that can be tightly integrated with a database management system to, inter alia, leverage the highly optimized and advantageous features thereof.