In many applications it is required to rank items in a data set. For example, the data set may comprise categories of sensor readings taken from a mechanical apparatus which it is required to control. It is often required to find a “hot list” of data items, for example, the top 5 most frequently occurring items. This information may then be fed back to the control system which is controlling the mechanical apparatus. In order to find such a “hot list” the entire data set must be gone through which is often practically infeasible where the data set is very large (e.g. Peta bytes) or is a continuous data stream.
The scale of data held in data centers or databases often reaches an enormous scale and it is important to be able to efficiently query such large scale data sets with respect to space and time. Time efficient computation is crucial for fast resolving of queries and this is also crucial for energy savings.
In cases where the items in the data set have values associated with them it may be desired to find the distribution of values across all the items. For example, the values may be sensor readings taken from a manufacturing plant which it is required to control. If a control system controlling the plant needs to carry out fine scale processing for a particular range of sensor reading values and coarse scale processing for the other sensor readings then it is difficult to partition the data set quickly, accurately and efficiently. In order to achieve this exactly the entire data set must be gone through which is not practical for large scale data sets and/or where the data is a continuous data stream.
Some previous approaches to ranking items in a data set have been based on randomized hashing schemes. However, these types of scheme require prior knowledge of all the distinct items in the data set. This knowledge is used to construct the hash functions. For large scale data sets this knowledge is typically not available or practical to obtain.
Other approaches have used random sampling techniques and there is a desire to improve such techniques.
The embodiments described herein are not limited to implementations which solve any or all of the disadvantages of known ranking systems.