The present invention generally relates to data processing and more particularly to a system and method for performing complex data queries.
Databases are computerized information storage and retrieval systems. A relational database management system is a computer database management system (DBMS) that uses relational techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application, the operating system or a user) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL). SQL is a standardized language for manipulating data in a relational database.
Illustratively, SQL is used to compose queries that retrieve information from a database and to update information in a database. Commercial databases include products such as International Business Machines' (IBM) DB2, Microsoft's SQL Server, and database products from Oracle, Sybase, and Computer Associates. The term “query” denotes a set of commands used to retrieve or update data by specifying columns, tables and the various relationships between them relevant to the query. Queries take the form of a command language allowing programmers and application programs to select, insert, update, add, modify, and locate data in a relational database.
One issue faced by data mining and database query applications, in general, is their close relationship with a given database schema (e.g., a relational database schema). This relationship makes it difficult to support an application as changes are made to the corresponding underlying database schema. Further, the migration of the application to alternative underlying data representations is inhibited. In today's environment, the foregoing disadvantages are largely due to the reliance applications have on SQL, which presumes that a relational model is used to represent information being queried. Furthermore, a given SQL query is dependent upon a particular relational schema since specific database tables, columns and relationships are referenced within the SQL query representation. As a result of these limitations, a number of difficulties arise.
One difficulty is that changes in the underlying relational data model require changes to the SQL foundation that the corresponding application is built upon. Therefore, an application designer must either forgo changing the underlying data model to avoid application maintenance or must change the application to reflect changes in the underlying relational model. Another difficulty is that extending an application to work with multiple relational data models requires separate versions of the application to reflect the unique SQL requirements driven by each relational schema. Yet another difficulty is evolution of the application to work with alternate data representations because SQL is designed for use with relational systems. Extending the application to support alternative data representations, such as XML, requires rewriting the application's data management layer to use additional data access methods.
A typical approach used to address the foregoing problems is software encapsulation. Software encapsulation involves using a software interface or component to encapsulate access methods to a particular underlying data representation. An example is found in the Enterprise JavaBean (EJB) specification that is a component of the Java 2 Enterprise Edition (J2EE) suite of technologies. In the case of EJB, entity beans serve to encapsulate a given set of data, exposing a set of Application Program Interfaces (APIs) that can be used to access this information. This is a highly specialized approach requiring the software to be written (in the form of new entity EJBs) whenever a new set of data is to be accessed or when a new pattern of data access is desired. The EJB model also requires a code update, application build and deployment cycle to react to reorganization of the underlying physical data model or to support alternative data representations. EJB programming also requires specialized skills; since more advanced Java programming techniques are involved. Accordingly, the EJB approach and other similar approaches are both inflexible and costly to maintain for general-purpose query applications that access an evolving physical data model.
Another approach used to address the foregoing problems is creating a data abstraction layer. A data abstraction layer sits between an application and the underlying physical data. The data abstraction layer defines a collection of logical fields that are loosely coupled to the underlying physical mechanisms storing the data. The logical fields are available to compose queries to search, retrieve, add, and modify data stored in the underlying database.
One difficulty encountered constructing an abstraction layer is representing data derived from multiple rows of data stored in an SQL table (i.e., columnar data). An aggregate data value is calculated from the rows of a query result (or a grouping of these rows). For example, an aggregate may be calculated from multiple rows returned by a query such as an average, a sum, or a slope (used to detect trends within data).
One approach to solve this limitation is to have a database administrator create individual SQL views that perform aggregation calculations and then specify these as a data source that the abstract model may query and join with other results. This solution, however, requires that a database administrator become involved in the creation of these views, and thus can become a bottleneck in having queries created. Also, because a static view performs the aggregation function, a database administrator must create a static view for each different aggregation. Stated another way, users cannot dynamically change the rows included in the aggregation. For example, one static view may provide an aggregate value defined by the average age of adult males living in a particular city. If a user wanted to perform a similar query substituting females, a new static view would have to be created. Finally, as the underlying data sources change, particularly in a distributed environment, statically created views may reference underlying data that is no longer available to respond to query request.
Accordingly, it would be useful to view aggregate values for related groupings of rows joined to non-aggregate data without the requirement of maintaining a static view for each aggregation. Further, it would be useful to allow users to apply conditions that restrict the rows included in a particular aggregation (e.g., only include columnar data in an aggregation when a contemporaneous condition is true). Also, users should be able to apply conditions to the results generated for an aggregation (e.g., return only results where an aggregate value from a group of aggregate values crosses a dynamically selected threshold value).