Large databases often include millions of records or more, with each record having many attributes. Statistical operations may be performed on such databases using sampling techniques that generally involve selecting records at random from the database. The selected records may then be analyzed to generate statistics characterizing the complete set of records in the database. In order to ensure that the resulting statistics accurately characterize the database, stratified sampling techniques may be used. In stratified sampling, the database records are separated into sub-groups or “strata,” and one or more records are then randomly selected from each of the sub-groups for analysis. An example of a conventional stratified sampling technique is described in U.S. Patent Application Publication No. 2002/0198863, entitled “Stratified Sampling of Data in a Database System.”
A problem with conventional stratified sampling techniques is that such techniques typically attempt to separate the records into mutually exclusive sub-groups, and can therefore only consider a limited number of attributes. The number of attributes per record is generally referred to as the “dimensionality” of the database, and the conventional stratified sampling techniques are practical only in low dimensionality situations. However, many modern databases, such as those used to track connection data in telecommunication applications, have a very high dimensionality.
Consider by way of example a database that stores N records, each with K attributes, where each attribute takes mk discrete values, 1≦k≦K. If K is small, one can simply concatenate the attributes in order to partition the database into mutually exclusive sub-groups. The number of sub-groups in this case is given by πk=1Kmk. However, as K gets larger, this approach is impractical. For example, if mk=5 and K=10, then there are nearly 107 sub-groups, many of which will contain no records or only a small number of records. In this type of high dimensionality context, conventional stratified sampling techniques are unable to provide an appropriate stratified sample for each of the K attributes. The problem is apparent in numerous information processing applications, including large scale database integration and maintenance, data mining, data warehousing, query processing, telecommunication network traffic analysis, opinion polls, etc.