The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A data item may be a set of associated data values that quantify various characteristics. Data items quantify characteristics of a wide-variety of items, such as events, entities, places, transactions, or concepts. A data item may be stored as one or more blocks, segments, or other units of data within one or more computer-readable media. The unit(s) of data in which a data item are stored are logically mapped by one or more computing devices to a logical representation of the data item. A logical representation may be, for instance, one or more data records, table rows, log entries, text files, documents, object instances, images, videos, audio recordings, and so forth. A data item may comprise one or more values that correspond to defined characteristics, such as, for example, the columns or fields of a database table or the properties of a data file. A data item may also or instead have one or more values that yield derived characteristics. Such derived characteristics may be determined by analysis of the value(s) of a data item, and may include, for example, a metric calculated by application of a function or a classification determined by pattern recognition.
Similarly structured data items are often grouped together in collections herein referred to as data sets. For example, the rows in a table may represent distinct data items that have been grouped together because they can be described by similarly labeled columns or fields. Mechanisms for defining and organizing collections include, without limitation, tables, arrays, folders, directories, indexes, and so forth.
A useful task in data mining or data analysis is to identify values that occur above a threshold frequency for certain characteristic(s) within the items of a data set. This task is sometimes referred to as an “iceberg query” or “hot list analysis.” For example, one may wish to identify the names of lending banks that appear more than five percent of the time in a set of items that represent distinct subprime loans. As another example, one may wish to identify pairs of source Internet Protocol addresses and destination Internet Protocol addresses that appear in more than 0.2 percent of logged packets of network traffic.
Some approaches to iceberg queries or hot list analysis involve the allocation of memory buckets to count each occurrence of each distinct value (or combination of values) for the targeted characteristic(s). However, such approaches can be memory-intensive, especially when the targeted characteristic(s) may have a large number of possible distinct values. A less memory intensive approach is described by Karp, et al., in “A Simple Algorithm for Finding Frequent Elements in Streams and Bags,” ACM Transactions on Database Systems, Volume 28 Issue 1, March 2003, the contents of which are hereby incorporated by reference for all purposes as if set forth their entirety. However, this approach, hereinafter referred to as the Karp approach, assumes that the entire data set is analyzed serially.