1. Field of the Invention
The present invention relates to database systems and, more particularly, to a method and apparatus for optimizing queries accessing both structured data and text data.
2. Description of the Related Art
There is a growing number of text information sources available today. These text information sources range from traditional library information systems such as Library of Congress LOCIS and Stanford University FOLIO, to on-line services such as Dialog, and eventually to comprehensive digital libraries. Many applications and end users need to combine the retrieval of text information from these sources with structured data retrieved from other database systems. For example, a hospital information system may permit physicians to access a patient's medical record, progress notes, medical literature and drug formularies which are stored in one or more database systems.
Although structured database systems provide powerful query languages for querying structured data, such database systems are not well suited for storing or querying text information. For example, Structured-Query-Language (SQL) provides only rather primitive string matching operations. Text retrieval systems, on the other hand, use indexing techniques and processing algorithms that are specialized for querying text information, but have limited query languages. In particular, text retrieval systems do not support join-like operations for combining data from multiple sources.
Extensible database systems are known to provide "tight" integration of structured data and text. See e.g., M. Carey and L. Haas, "Extensible Database Management Systems," ACM SIGMOD Record, December 1990. These systems provide hooks in the data model, query language, and implementation architecture that allow new data types (e.g., text, maps, images) to be added.
However, it is important to distinguish between "tight" and "loose" integration. Tight integration assumes that the new data types are to be stored and processed within the database system, i.e., that the access methods and evaluation methods are visible to the query processor (and may in fact be modified to match the query processing strategies). See e.g., Bertino et al., "Query Processing in a Multimedia Document System," ACM Transactions on Office Information Systems, 6(1), 1988; W. Lee and D. Woelk, "Integration of Text Search with Orion," IEEE Data Engineering Bulletin, 13(1), 1990; and C. A. Lynch and M. Stonebraker, "Extended User-defined Indexing with Application to Textual Databases," VLDB, 1988. This previous work on tight integration considered query processing and optimization for queries on only textual data. On the other hand, loose integration assumes that the new data types are to be managed by external data managers. Once these specialized data managers have been registered with the database system, queries can include operations (sometimes called "foreign functions") over the new data types and operations that span multiple data types. Since the problem overcome by the invention relates to accessing external data sources, loose integration of a database system with external text sources is the only option available.
There has also been previous work on general frameworks for extensible query optimization. See e.g., H. Pirahesh et al., "Extensible/Rule Based Query Optimization in Starburst," SIGMOD, 1992; S. Chaudhuri and K Shim, "Query Optimizing in the Presence of Foreign Functions," VLDB, 1994; A. Kemper et al., "A Blackboard Architecture for Query Optimization in Object Bases," VLDB, 1993; and G. Mitchell et al. "Control of an Extensible Query Optimizer," VLDB, 1993. These previous works do not concern the problem of query processing for operations that span different data types or the impact of these methods on query optimization.
Further, query processing and optimization in the presence of foreign functions was examined in LDL (D. Chimenti et al., "Towards an Open Architecture for LDL," VLDB, 1989) and Papyrus (T. Connors et al., "The Papyrus Integrated Data Server," Proceedings of the First international Conference on Parallel and Distributed Systems, Miami Beach, Fla., Dec. 1991). However, in these works, the only join method used was tuple substitution and its variants. Tuple substitution corresponds to a nested loop join, with the relation as the outer operand and the document set as the inner. However, since it involves repeated invocations of the external data manager, in many cases tuple substitution will be prohibitively expensive.
There is recent work on optimizing queries over structured documents. See e.g., S. Abiteboul et al., "Querying and Updating the File," VLDB, 1993; V. Christophides, "From Structured Documents to Novel Query Facilities," SIGMOD, 1994; and M. P. Consens and T. Milo, "Optimizing Queries on Files," SIGMOD, 1994. The focus of this work was to reduce the amount of data retrieved from the text system for queries with complex selection conditions. This work did not address or provide a solution to the problem of integration between a database system and a text retrieval system.
Thus, there is need to loosely integrate the capabilities of both structured database systems and text retrieval systems to support uniform ad hoc queries over structured data and text.