1. Field
Embodiments of the invention relate to computing frequency distribution for many fields in one pass in parallel.
2. Description of the Related Art
Relational DataBase Management System (RDBMS) software may use a Structured Query Language (SQL) interface. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
A RDBMS uses relational techniques for storing and retrieving data in a relational database. Relational databases are computerized information storage and retrieval systems. Relational databases are organized into tables that consist of rows and columns of data. The rows may be called tuples or records or rows. Columns may be called fields. A database typically has many tables, and each table typically has multiple records and multiple columns.
A common task in data exploration is to compute a “frequency distribution” for each field in a dataset (e.g., each column in a table). The frequency distribution for a given field is a two-column table (also referred to as a frequency distribution table), with each row of the two-column table consisting of a distinct field value in the dataset and a count of the number of occurrences of that field value. The frequency distribution can be used to answer a variety of questions about the field, such as: How many distinct field values are there for the field? Which occurs most frequently? Is there a distinct field value for every record in the dataset, which suggests that the field is a “key” field?
Table A is a frequency distribution table for the following list of colors, which are field values: Blue, Red, Red, Green, Blue, Red, Blue, Green, Red, Red, Red, Blue
TABLE AColorCountRed6Green2Blue4
There are many approaches to compute a frequency distribution, and many of these approaches fall into one of two categories: a “table in memory” approach or a “sort and count” approach. With the “table in memory” approach, a frequency distribution table is built by creating a frequency distribution table for a field with a row for each distinct field value, and the count of each field value is directly updated as that field value is encountered in the dataset. The “table in memory” approach builds the frequency distribution table in memory. With the “sort and count” approach, all of the field values are sorted, the number of occurrences of each field value is counted, and one row of the result table is created each time a new field value is encountered in the sorted stream. The “sort and count” approach uses extra disk storage to perform the sort and count.
The “table in memory” approach works well for fields with a relatively small number of distinct field values, in which case the frequency distribution table fits into available memory, and the “sort and count” approach works well for fields with a large number of values where the frequency distribution table exceeds the size of available memory. The number of distinct field values is often not known a priori, making the selection of one of these approaches difficult. The problem is further complicated when attempting to compute a frequency distribution for all of the fields in a record in a single pass, and when attempting to compute the frequency distributions using a parallel processor.
Thus, there is a need in the art for improved computation of frequency distribution.