The present invention relates generally to the field of computers and computer systems. More particularly, the present invention relates to data handling in relational database management systems (RDBMSs) supporting structured query language (SQL) operators.
Data mining or business analytics algorithms are commonly required to handle large amounts of data stored in RDBMSs. Such algorithms may be used in social network analysis (SNA), data mining clustering, and other applications such as handling outlier detection, and computing Eigen values and vectors representing targeted (“interesting”) data mining knowledge such as principal components. The algorithms use matrix or vector mathematical forms which are not easily coded with SQL available in current RDBMSs.
Some limited matrix features exist, including matrix operations implemented either with stored procedures (SP), or user defined functions (UDF), user defined table functions (UDTF), or user defined types (UDT), for instance PIVOT( )/UNPIVOT( ) operators. However, these built-in functions are primarily used to provide reporting capabilities for the RDBMS and are not suited for more generalized matrix operations.
Conventional matrix processing techniques involve extracting data from a RDBMS into a specific application (MATLAB® or R-package for instance) or distributing it in memory over a cluster of machines specifically coded for processing the data in that way: high-performance computing (HPC) clusters or parallel-machine learning (PML) clusters are examples. After processing, the results are stored back in the original RDBMS. This is inefficient as many copies of the data are required, often copied over the network over and over again if whole matrices cannot fit into main memory of the processing device.
Examples of systems in which reporting data is exported from a RDBMS for processing are described in U.S. Pat. No. 6,421,655 and U.S. Pat. No. 7,734,652. The problem of the need to export certain forms of data from an RDBMS to enable processing is described and discussed further in U.S. Pat. No. 7,181,449 (commonly assigned with the present application) which relates to the application of a scalar function across a varying number of records within an RDBMS using the RDBMS's SQL capabilities, that is to say without pre-processing and without copying rows of data to another processor or process.
It has been proposed to manually write SQL statements for specific in-RDBMS matrix operations. However, such SQL statements merely optimize data fetching and rewriting and do not enable the RDBMS engine to optimize the overall set of matrix operations according to the matrix algebra and characteristics of the data (sparse matrices for instance). Even though some matrix operations can be written as two lines of SQL once the formatting of table data has been done in the appropriate manner, it is still a cumbersome design and development phase and not that flexible (e.g., statements have to be rewritten if matrix dimensions change). Some matrix operations are very complicated even when written as standard SQL statements and no standard programmer can support them it when they involve recursion and intricate features. For instance, a basic matrix multiplication may be defined in a short SQL sequence finally, but the sequence to achieve this involves:
1) getting the proper basic raw business information stored in tables (call detail records (CDR)s, customer information, orders, and so forth);
2) formatting the columns and records in a matrix-friendly manner (e.g., column I, column J, value);
3) writing a SQL statement that will allow the matrix operation to occur; and
4) storing the results back into a RDBMS table (re-formatting of columns/records may have to occur again).
As will be understood, in current forms of SQL, it is not always possible to rearrange the business data to match an easier matrix notation. The values may be stored in multiple columns with no indirection (for instance: CDR information). Therefore, creating a generic matrix multiplication for application to this data requires re-formatting of the data to a more matrix-friendly form before applying the SQL, and this is usually not done because it requires expert SQL skills. Crafting SQL statements that can combine multiple matrix operations is even more complicated and is often crafted as a set of multiple independent operations or independent SQL statements, potentially creating more input/output (I/O) operations and increasing latency in the processing.