The present invention relates generally to data processing and, more particularly, to a method and apparatus for providing processing of top-k queries from samples.
A network service provider may receive a request to process a Top-k query. For example, a query to perform an aggregation operation over a value of an attribute of a network data may be received. For example, the query may be to determine the top 100 packet source autonomous systems, the top 100 ports, the top domain names, and so on. If all records/data for the network can be processed, the top-k items may be obtained by counting the frequency of each item. However, the full dataset is not observable for any reasonable size network application. In addition, even if the network is small and the full dataset is available, the network resources for exhaustive data counting and analysis would be costly. Thus, top-k queries are processed from samples. However, top frequencies in a sample are biased estimates of the actual top-k frequencies wherein the bias depends on the distribution.