The increased availability of computer systems and the ability to connect the computer systems using various networks such as intranets and the Internet, for example, has made vast repositories of information available to a large number of people. In many instances, having such a large amount of information at one's fingertips greatly enhances productivity.
But these advances in information accessibility and processing have created other challenges, e.g., how to search and manage such a large collection of information, especially when the information is stored in various formats and repositories. Many new tools have been developed to deal with the ever-expanding volume of information that is now available for consumption in an electronic form.
For example, referring to FIG. 1, conventional record-based data storage 100 is typically organized around the concepts of tables, columns and rows. A table is defined by a series of columns, each having certain characteristics (i.e., corresponding to a particular category of data, e.g., a name or date), and the data is stored as rows in the tables. For example, a database of customer-related data may include a CUSTOMER table 110, having columns for CUST_ID (customer identifier), L_NAME (last name), and F_NAME (first name). Another table (the PURCHASES table 115) may include data related to purchases made by the customers, and be defined as having columns named CUST_ID, PUR_ID (purchase identifier) and PUR_DATE (purchase date). Furthermore, each purchase may include more than one product, so a PRODUCT table 120 may be defined as having a PUR_ID column and a PRO_ID (product identifier) column. Such an arrangement allows for multiple purchases (each may have more than one product) to be recorded for an individual customer without having large amounts of redundant data (e.g., having to store data in the LAST NAME for every purchase). As an example, customer number 00001 (Ann Smith) made three purchases (PUR_IDs 9901, 9902 and 9903). Further, one of those purchases (9901) included two products, AAAA and BBBB. However, a simple inquiry into one table (e.g., the PURCHASES table 115) may not provide all the desired information for a report because certain descriptives such as customer names and product names are stored in other tables.
To accommodate queries and other data transactions that require data from more than one table, certain columns may be designated as foreign keys. A foreign key is a referential constraint between two tables that identifies a column (or a set of columns) in one table (typically referred to as the “referencing” table) that refers to a column or set of columns in another table (the “referenced” table). Using the example above, the CUST_ID column serves as a foreign key from the CUSTOMER table 110 to the PURCHASES table 115, and the PUR_ID column serves as a foreign key to the PRODUCT table 120. Therefore, a request to retrieve a listing of all products purchased by customer 00001 with the customer's name and date of purchase may be formulated as follows:
Select CUSTOMER.F_NAME, CUSTOMER.L_NAME,PURCHASES.PUR_DATE, PRODUCT.PRO_ID fromCUSTOMER, PURCHASES, PRODUCT whereCUSTOMER.CUST_ID = PURCHASES.CUST_ID,PURCHASES.PUR_ID = PRODUCT.PUR_ID, CUST_ID =‘00001’by taking advantage of the foreign keys from each table. Such an approach works well for applications that utilize a database interface for information retrieval. However, with the increased popularity and simple user interface of search engines, the desire to use conventional database retrieval techniques has waned.
Unfortunately, database design techniques that are aimed at reducing data redundancy and enforcing data normalization rules typically do not support full-text indexing and querying of text documents as do modern search engines.
For example, the World Wide Web (“WWW” or “web”) can provide access to a vast amount of information, and specialized search tools, known as “search engines” (e.g., Google, Yahoo, and MSN Search) have achieved great success in facilitating searching of static text documents. Conventional web-based search engines, however, are not designed for use in an enterprise environment because data can be stored in many different forms, using various localized repositories and databases. While a data repository on the Internet or an intranet may contain record-based data relevant to a search query, the search engine may not be capable of indexing and/or accessing the data. A similar problem may be encountered with other forms of content such as word-processing documents, graphical or image files, MP3 clips, interactive blogs, and other data that may change in real time.
Conventional methods of executing a query referencing multiple tables in a search engine tend to fall into one of two categories: (i) denormalization, in which the joined tables must be combined at index time, or (ii) subdivision, where the query is divided into two or more table queries which are processed independently, and the results combined in a post-processing phase. Denormalization has several drawbacks, primarily the increase in the size of the index, because tables with multiple foreign keys can expand by orders of magnitude after denormalization. The post-processing approach involves extracting a large volume of data from the index (typically the entire contents of one or more tables) and then winnowing the data down based on the join constraints. This is also an inefficient use of resources.