In data processing it is advantageous to determine relationships between data values in large data sets. Such approaches to characterizing data values include clustering or classification in which different techniques are used to group and characterize the data (as set out, for example, in M. Ester, H.-P. Kriegel, and X. Xu. A Database Interface for Clustering in Large Spatial Databases. In Proc. of the Int'l Conf. on Knowledge Discovery & Data Mining, 1995, T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In ACM SIGMOD Int'l Conf on the Management of Data, Montreal, Canada, 1996, and M. Mehta, R. Agrawal and J. Rissanen. SLIQ: A Fast Scalable Classifier for Data Mining. In Advances in Database Technology—Int'l Conf. on Extending Database Technology (EDBT), Avignon, France, March 1996). Such techniques permit the development of a more “parsimonious” version of the data (as described in H. V. Jagadish, J. Madar, and R. T. Ng. Semantic Compression and Pattern Extraction with Fascicles. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 186–197, 1999). Data may be compressed and data may he analyzed to reveal hidden patterns and trends in the data (data mining). Association rules and fascicles are used in the prior art to determine characteristics of a data set.
The data patterns discovered by the prior art data mining techniques are defined by a measure of similarity (data values must be identical or similar to appear together in a pattern) and some measure of degree of frequency or occurrence (a pattern is only interesting if a sufficient number of values manifest the pattern).
Where a data set has two attributes that are of interest, and the attribute values are discrete, a discrete binary matrix may be created to represent the data values in the data set with respect to those attributes. Where such a discrete binary matrix is defined, characteristics of the data may be analyzed by determining which portions of the matrix contain rectangles of homogenous values. Typically the verse being determined are zero values in the binary matrix and the rectangles being determined or discovered are termed empty rectangles.
Prior art approaches to determining empty rectangles include finding or determining the location of a set of maximal empty rectangles in a binary matrix (see for example A. Namaad, W. L. Hsu, and D. T. Lee. On the maxim empty rectangle problem. Applied Discrete Mathematics, (8):267–277, 1984, M. J. Atallah and Fredrickson G. N. A note on finding a maximum empty rectangle. Discrete Applied Mathematics, (13):87–91, 1986, Bernard Chazelle, Robert L. (Scot) Drysdale III, and D. T. Lee. Computing the largest empty rectangle. SIAM J. Comput., 15(1):550–555, 1986, and M. Orlowshi. A New Algorithm for the Largest Empty Rectangle Problem. Algorith-mica, 5(1):65–73, 1990).
In the prior art approaches the method for determining the maximal empty rectangles in a binary max requires continual access and modification of a data structure that is as large as the original matrix itself. This approach does not scale well for large data sets due to the memory requirements inherent in the approach.
Another prior art approach (referred to in Orlowshi, above) considers points in a real plane instead of discrete elements or entities in a binary matrix. In this method, an assumption is made that points have distinct x and y coordinates and so the approach does not disclose determining empty rectangles where there are multiple values possible in the data set being considered.
A common application for the characterization of similarity of data values in large data sets is for relational databases. In particular, a useful application of this data mining approach is for implementation of the relational join operation for large data sets. Because the calculation of a join over large relational tables is potentially expensive in time and memory, the characterization of data in the relational tables is desirable to achieve efficiencies in the implementation of a join over such data tables.
It is therefore desirable to have a computer system for the determination of maximal homogenous rectangles in a binary matrix which is able to be carried out with efficient use of memory and disk access and which facilitates the efficient implementation of the relational join over large relational data tables.