1. Field of the Invention
The invention relates to a linear-time top-k sort method, and more specifically, to an algorithm that retrieves only k elements (i.e., top-k results) having the largest (or smallest) key values from a dataset in a sorted order in a time proportional to the dataset size.
2. Description of the Related Art
Recently, wide use of Internet and the convergence of digital technologies have is caused a significant increase in the amount of data that need to be processed by application systems with limited resources.
In particular, in Web and multimedia search systems or distributed systems that deal with a huge amount of data, it is sometimes difficult to process a large number of user queries due to the drastic increase of the amount of the query results. Accordingly, efficient query (or search) processing on a vast amount data has become a very important issue.
Hence, top-k query processing, which returns the highest-ranked k answers (or top-k results) from a dataset according to the importance, is useful in the problems mentioned above.
Accordingly, top-k query processing has been studied actively in a variety of areas such as Web and multimedia search systems and distributed systems, which deal with a vast amount of data. Ilyas et al. summarize the studies in “A Survey of Top-k Query Processing Techniques in Relational Database Systems,” Ilyas, I., Beskales, G., and Soliman, M., ACM Computing Surveys, Vol. 40, No. 4, Article 11, 2008 (‘Related Art 1’).
Conventional methods of top-k query processing can also be found in “Algorithms in C++,” Sedgewick, R., Addison Wesley, 1992 (‘Related Art 2’). Although top-k results can be obtained by using conventional sort algorithms described in ‘Related Art 2’ such as insertion, bubble, merge, heap, and quick sort algorithms, those algorithms should sort the entire data according to their key values.
As an example, according to “Fundamentals of Data Structures in C++,” Horowitz, E., Sahni, S, and Mehta, D., W. H. Freeman and Company, 1995 (‘Related Art 3’), the smallest time complexity of existing sort algorithms for n data elements is O(n log n), and thus, top-k query processing using these algorithms must have an O(n log n) time complexity. This means is that we cannot obtain top-k results by using those algorithms in linear time, and thus, we can hardly apply them to the latest applications such as Web search and sensor networks, which deal with a huge amount of data.
More specifically, for example, the time complexity of the heap sort algorithm is O(n log n) as described in ‘Related Art 3’. For n data elements, the algorithm sorts all the elements in the ascending (or descending) order by the following two steps: 1) insert all the elements to an n-sized min (or max) heap structure, and 2) extract the element from the heap structure one by one.
Since the heap sort algorithm has an O(n log n) time complexity, top-k query processing using heap sort has an O(n log n) time complexity as well. We, hence, cannot obtain top-k results in linear time, i.e., in a time proportional to the size of the dataset. Accordingly, we can hardly apply the heap sort algorithm to the applications such as Web search or distributed systems, which deal with a huge amount of data.
In order to efficiently process the latest applications such as Web and multimedia search systems or distributed systems that deal with a vast amount of data, it is important to find top-k results in linear time. Hence, as an efficient top-k query processing method, it is desirable to provide a linear-time top-k sort algorithm that has a time complexity linear in the number of data elements, n, instead of existing sort algorithms. However, any method satisfying this requirement has not been proposed yet.