1. Field of the Invention
This invention relates in general to a computer-implemented encoding system, and more particularly, to an encoding system for concept or group hierarchies.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. Relational databases are organized into tables which consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple tuples and multiple columns. The tables are typically stored on random access storage devices (RASD) such as magnetic or optical disk drives for semi-permanent storage.
RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO). The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host languages, such as C and COBOL. SQL allows the user to manipulate the data.
The definitions for SQL provide that a RDBMS should respond to a particular query with a particular set of data given a specified database content, but the method that the RDBMS uses to actually find the required information in the tables on the disk drives is left up to the RDBMS. Typically, there will be more than one method that can be used by the RDBMS to access the required data. The RDBMS will optimize the method used to find the data requested in a query in order to minimize the computer time used and, therefore, the cost of doing the query.
It is common in Internet text mining and Business Intelligence applications for atomic data in a relational database to be related by one or more hierarchies of concepts or groups. A "concept" or "group" is a generalization of one or more keywords or parts of parsed text. A text search query against a large corpus (i.e., collection of data) may be modeled as a search for various concepts or groups.
For both Internet text mining and Business Intelligence applications, the computation of aggregate functions for each individual concept or group is a basic, often repeated, operation. The aggregate grouping operator is used to rank one collection versus another. A. Klug, Access Path in the "Abe" Statistical Query Facility, Proc. ACM SIGMOD, 1982, pp. 161-172, which is incorporated by reference herein, teaches that special treatment should be given to aggregation of groups. Efficiency for the aggregate grouping operator was addressed in D. J. Haderle, and E. J. Lynch, Evaluation of Column Function on Grouped Data During Data Ordering, IBM Technical Disclosure Bulletin, Mar. 10, 1990, pp. 385-386, which is incorporated by reference herein.
The following example is from the Business Intelligence domain. One skilled in the art would recognize that there are equivalent examples in the domain of Internet text mining. Example concepts or groups are (sales by) STATE(s), (sales by) CITY(s), etc. Concepts and groups have members, and, typically, members are related by a hierarchy. For example, for the group STATE(s), one natural hierarchy is based on geography (e.g., East Coast states, Southwest states, Midwest states, Western states, etc.).
Members can be related by more than one hierarchy. For example, in a department store, individual products are related hierarchically by product categories (e.g., housewares, hardware, etc.), as described in R. A. Kevin, D. Kalyanaraman, and D. J. Howard, Product Hierarchy and Brand Strategy Influence on the Order of Entry Effect for Consumer Packaged Goods, Journal of Product Innovation Management, Vol. 13, No. 1. Jan. 1996, which is incorporated by reference herein. In addition, they are also related by employees of the department store, commonly referred to as buyers, who are responsible for buying the products, and related by the management structure of these employees.
If members of the hierarchy were represented just by their names, e.g., "12 oz. Tide" in a table (e.g., a DETAIL table) of transaction records, conventional systems employ one of two methods to compute the aggregations.
In the first method, a sequence of joins are performed between the DETAIL table and other tables that carry a map of the hierarchy. The number of joins is determined by the maximum depth of the hierarchy. After the sequence of joins, this method performs aggregations. This method is described in more detail in M. Wang and B. Iyer, Efficient Roll-Up and Drill-Down Analysis in Relational Databases, SIGMOD Data Mining Workshop, 1997, which is incorporated by reference herein. The first method does not clearly address the issue of input/output ("I/O") efficiency and is also subject to the known "quirkiness" of commercial RDBMS optimizers that sometimes cause the RDBMS to execute a query sub-optimally.
The second technique stores the names of ancestors of members along with the members in the DETAIL table. This second technique requires a large amount of memory for storing the names of the ancestors. Therefore, the second method is wasteful of space.
There is a need in the art for an improved method of computing aggregations.