Field
This disclosure relates to analyzing collections of data items.
Background
The analysis of large amounts of data items can consume excessive amounts of memory and processing power. Although a table can be used to determine how many times each unique value appears in a collection of data, the memory required to store the table and the processing delays associated with accessing and updating the table for a very large number of data items may be excessive.
A table of fixed size (or a variable size not to exceed a defined maximum) may be used for determining, from a data collection, the unique values with the highest frequencies of occurrence. Such a table can be used to maintain counts of the most frequent of the unique values, while ignoring the less frequent unique values. But, analyzing large collections of data items to determine the most frequently occurring data items (e.g., unique values) using a limited amount of memory can be a challenge. For example, a determination must be made as to which of an unknown number of different unique values are to be maintained in the table.
Some conventional techniques collect a predetermined number of item value and count pairs, and in order to maintain the table below a predetermined size, periodically sort the collected value-count pairs to discard a portion of the collected data that is below a threshold. The discarding of a portion of the collected value and count pairs may result in some item values that should have been considered as most frequent item values being lost based upon where in the stream of input those occurred. Therefore, such conventional techniques, although using reduced memory and processing power, may not be sufficiently accurate. Other conventional approaches have used priority queues to keep track of an approximation of the current most frequent elements, but fail to provide a realistic lower bound and upper bound in many situations. Moreover, these conventional techniques may not yield sufficiently reliable results when extended to distributed environments.