The invention relates to computer database systems and more specifically to distributed computer database systems.
Organizations routinely collect large amounts of data on their customers, products, operations and business activities. Insights buried in this data can contribute to marketing, reducing operating costs and strategic decision-making. For example, if there is a strong correlation between the customers who buy one product with those who buy another product, then those customers who have bought just one of them might be good prospects for buying the other product.
Analytical processing of data is primarily done using statistical methods to extract correlations and other patterns in the data. This kind of processing has been variously called data mining, knowledge discovery and knowledge extraction. A search for a specific pattern or kind of pattern in a large collection of data will be called a pattern query.
Large enterprises typically maintain many databases, many of which are transactional databases. The requirements of these databases are often in conflict with the requirements of data mining. Transactional databases are updated using small transactions operating in real time. Data mining, on the other hand, uses large pattern queries that do not have to take place in real time. To resolve this conflict, it is now common for data from a variety of sources to be downloaded to a centralized resource called a data warehouse.
The downloading and centralizing of data from diverse, often disparate sources requires a number of tasks. The data must be extracted from the sources, transformed to a common, integrated data model, cleansed to eliminate or correct erroneous or inaccurate data and integrated into the central warehouse constituting yet another it database in which all the data is stored. In addition, one must ensure that every instance of every business entity, such as a customer, product or employee, has been correctly identified. This is known as the problem of referential integrity. All of these are difficult tasks, especially ensuring referential integrity when the data is being downloaded from databases that identify the business entities in slightly different ways. Current technology downloads data to the data warehouse as an independent activity. from data mining. In contrast with data mining, for which there is a large research literature and many commercial products, data warehousing does not have a strong theoretical basis and has few good commercial products.
Because data warehouses integrate many diverse data sources, it is necessary to specify an integrated data model for the data warehouse as well as a data mapping that extracts, transforms and cleanses data from each data source. It is known in the art that richer data models, such as object-oriented data models, are better suited for defining such an integrated data model and for defining the data mappings, than more limited data models, such as the relational model. Yet most data warehouses still use a flat record structure such as the relational model. Relational databases have a very limited data structure, so that synthesizing more complex data structures is awkward and error-prone. Some of the kinds of data that are poorly suited to storage in a relational database include: textual data in general, hypertext documents in particular, images, sound, multimedia objects and multi-valued attributes. Relational databases are also poorly suited for representing records that have a very large number of potential attributes, only a few of which are used by any given record.
An object database consists typically of a collection of data or information objects. Each information object is identified uniquely by an object identifier (OID). Each information object can have features, and some features can have associated values. Information objects can also contain or refer to other information objects.
To assist in finding information in a database, including the warehousing database, special search structures are employed called indexes. Large databases require correspondingly large index structures to maintain pointers to the stored data. Such an index structure can be larger than the database itself. Current technology requires a separate index for each attribute or feature. This technology can be extended to allow for indexing a small number of attributes or features in a single index structure, but this technology does not function well when there are hundreds or thousands of attributes. Furthermore, there is considerable overhead associated with maintaining an index structure. This limits the number of attributes or features that can be indexed, so the ones that are supported must be chosen carefully. For transactional databases, the workload is usually well understood, so it is possible to choose the indexes so as to optimize the performance of the database. For a data warehouse, there is usually no well defined workload, so it is much more difficult to choose which attributes to index.
Further information can be had regarding the foregoing concepts with reference to the following publications:
1 L. Aiello, J. Doyle, and S. Shapiro, editors. Proc. Fifth Intem. Conf. on Principles of Knowledge Representation and Reasoning. Morgan Kaufman Publishers, San Mateo, Cali., 1996.
2 K. Baclawski. Distributed computer database system and method, December 1997. U.S. Pat. No. 5,694,593. Assigned to Northeastern University, Boston, Mass.
3 A. Del Bimbo, editor. The Ninth International Conference on Image Analysis and Processing, volume 1311. Springer, September 1997.
4 N. Fridman Noy. Knowledge Representation for Intelligent Information Retrieval in Experimental Sciences. PhD thesis, College of Computer Science, Northeastern University, Boston, Mass., 1997.
5 M. Hurwicz. Take your data to the cleaners. Byte Magazine, January 1997.
6 Y. Ohta. Knowledge-Based Interpretation of Outdoor Natural Color Scenes. Pitman, Boston, Mass., 1985.
7 A. Tversky. Features of similarity. Psycho ogical review, 84(4):327-352, July 1977.
8 S. Weiss and N. Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann Publishers, Inc., San Francisco, Cali., 1998.
9 J.-L. Weldon and A. Joch. Data warehouse building blocks. Byte Magazine, January 1997.
The disclosures of the publications referenced in this xe2x80x9cBackground of the Inventionxe2x80x9d are incorporated herein by reference.
It would be desirable to provide improved systems for data warehousing and-data mining, which overcomes many-of the performance and other problems and limitations of current systems.
The present invention combines the two activities of data warehousing and data mining, thereby improving the basis and support for data warehousing. The term knowledge extraction will be used herein for the integration of the data warehousing and data mining activities.
The invention resides in an information retrieval apparatus and method for processing a query from a user, including, e.g., a query, for retrieval of information from the data warehouse. The apparatus includes a mechanism for locating a number of features and feature fragments in an index database; an evaluating mechanism for identifying a number of sub-queries of a number of levels contained in the query and recursively evaluating the sub-queries using each of the located features and feature fragments; and a mechanism for collecting and storing a number of results of the recursive evaluation of the query and sub-queries pursuant to computing an overall result of the query.
As used herein, xe2x80x9cevaluationxe2x80x9d is a process by which a response to a query is generated, characterized by retrieval of information, information location specifiers, or data regarding the information, which match criteria set forth in the query. Recursive it evaluation is a type of query evaluation in which new queries, called sub-queries, are generated from the query and evaluated. The sub-queries so generated can be regarded as nodes in a query tree, with the original query as a base node, and each sub-query having a corresponding level within the tree defined by its relationship with predecessor queries from which it was generated. All of the sub-queries, i.e., predecessor queries and child queries, are evaluated recursively, and the results collected, stored, and provided to the user in response to the query.
The invention can eliminate the need in conventional retrieval systems for providing a new, separate, centralized replica within the data warehouse of the data in the diverse external databases. The invention can thus avoid the problems of replication of such data in conventional systems, in which the data may become stale or is subject to errors arising during replication for warehousing. Instead, the data warehouse can contain an index database, which stores entries providing data regarding the information stored in the external databases, such as information location specifiers for that data within those databases, relational information and statistics. The invention can also provide a robust, versatile indexing system. The index of the invention supports, e.g., indexing of sparse records that have large numbers of potential attributes, only a few of which are used in a particular record. The present invention also supports, e.g., indexing of very large numbers of attributes in a substantially uniform data structure, making it much easier to determine the workload characteristics necessary for achieving high performance.
More specifically, according to an-aspect of the invention, a distributed computer database system includes one or more front end computers and one or more computer nodes interconnected by a network into a data warehouse and data mining engine, which indexes objects including images, sound and video streams, as well as plain and structured text. An object from an external database is downloaded from the network by a node, termed the warehousing node. The warehousing node extracts some features from the object, fragments each of the extracted features into a number of feature fragments, and hashes these features fragments. Each hashed feature fragment is transmitted to one node on the network, called an index node. Each node on the network that receives a hashed feature fragment uses the hashed feature fragment of the object to perform a search on its respective partition of the index database. The results of the searches of the local databases are gathered by the warehousing node. The warehousing node uses these results to determine whether the object has already been indexed in the data warehouse. The warehousing node then extracts the features from the object, fragments the features, and hashes these feature fragments. Each hashed feature fragment is transmitted to one node on the network. Each node on the network that receives a hashed feature fragment uses the hashed feature fragment of the object to store the feature in its respective partition of the index database.
The query can be, for example, a pattern query. A pattern query is a search for a pattern in the data. A pattern query from a user is transmitted to one of the front end computers which forwards the pattern query to one of the index nodes, termed the home node, of the data mining engine. The home node-decomposes the pattern query into one or more sub-queries, each sub-query being stored in memory and including an object feature and a computer-executable program implementing a method, e.g., a computation. The computation may involve additional sub-queries. The home node fragments each of the sub-query features into one or more sub-query feature fragments and then hashes the feature fragments. Each sub-query feature fragment is transmitted to one node on the network, according to the hashed feature fragment. Each node on the network that receives a sub-query uses the hashed feature fragment of the sub-query to perform a search on its respective partition of the index database, and the accessed data is used by the computation of the sub-query. If the computation of a sub-query contains additional sub-queries (and it may contain zero, one or more sub-queries), then the additional sub-queries are evaluated recursively, and the data obtained by the recursive evaluation is used by the computation of the sub-query. The results of the searches of the local index databases and the results of any recursive evaluations are gathered by the home node. The results of the pattern query are determined by the home node and returned to the user.
In another aspect of the invention, a distributed computer database system includes one or more front end computers and one or more computer nodes interconnected by a network to operate as a knowledge extraction engine, which supports both the data warehousing activity and the data mining activity.
First consider the data warehousing activity. The downloading of objects from another database to the warehouse is performed by a warehousing node. For an object downloaded from another database, the warehousing node first determines whether the object might already be represented in the data warehouse due to a download from another database. If this might be the case, the warehousing node extracts one or more of the features of the object, fragments each of the object features into a number of feature fragments, and then hashes each of these feature fragments. A portion of each hashed feature fragment is used by the warehousing node as an addressing index by which the warehousing node transmits the hashed object feature to an index node on the network. Each index node on the network that receives a hashed object feature fragment uses the hashed object feature fragment to perform a search on its respective index database. Nodes finding data corresponding to the hashed object feature return the OIDs of the warehoused objects possessing this feature fragment. Such OIDs are then gathered by the warehousing node and a similarity function is computed. The similarity function is used to determine whether the object is already stored in the data warehouse. If the object is determined to be represented in the data warehouse, then the OID of the warehoused object is used for the downloaded object. If it is not already represented, then a unique OID is chosen for the object. The warehousing node then extracts features of the object, fragments them, and then hashes these feature fragments. A portion of each hashed feature fragment is used by the warehousing node as an addressing index by which the warehousing node transmits the hashed object feature fragment to an index node on the network where the feature is stored in the data warehouse.
Next consider the data mining activity. A user wishing to evaluate a query, such as to search for a pattern in the data, transmits a query to one of the front end computers which in turn forwards the query to one of the index nodes of the network. The node receiving the query, termed the home node of the data warehouse, decomposes the query into one or more sub-queries. A sub-query includes a feature and a computer-executable program implementing a method, e.g., a computation, which may include additional sub-queries. The home node stores them, and fragments the features of each sub-query into one or more sub-query feature fragments, and then hashes each of the feature fragments of the sub-queries. A portion of each hashed feature fragment is used by the home node as an addressing index by which the home node transmits the sub-query to a node on the network. Each index node on the network that receives a sub-query uses the hashed sub-query feature to perform a search on its respective index database. Nodes finding data corresponding to the hashed sub-query feature fragment, perform the computation specified in the sub-query. If the computation does not contain any additional sub-queries, then the results of the computation are returned to the home node. If the computation does contain additional sub-queries, then the node takes the role of the home node with respect to the sub-queries contained in the computation. In particular, the node hashes the feature fragments of the contained sub-queries and transmits the sub-queries to other nodes. This process continues recursively until the computation is complete and the final results are returned to the original home node. Upon receiving the results of the computation, the home node performs any remaining data aggregation specified by the original pattern query and transmits the information to the front end node. The front end node formats the response to the user, and transmits the formatted response to the user.