The invention relates to the field of computer systems. More specifically, a method, device and computer program for efficiently identifying items having a high frequency of occurrence among items included in a text data stream.
Usually, when approximately identifying the frequency of occurrence for an item included in a text data stream of continuously inputted items, the number of occurrences has to be counted and stored for each item. As a result, the required memory capacity is enormous. A well-known algorithm for improving memory efficiency is lossy counting (LC). LC is an approximate calculation method in which the memory is divided into two levels, one for items with a high frequency of occurrence and one for all other items.
In prior art, LC is used in order to divide the memory into two levels according to the frequency of occurrence for items included in a data stream and to reduce memory usage by excluding items with a frequency of occurrence below a predetermined value from the count. By providing a memory structure with multiple levels, memory can be used efficiently when items with a high frequency of occurrence are to be identified.
However, when the LC technique is used, memory usage increases logarithmically relative to the data length of items. Therefore, when both the amount of data in a data stream and the number of items with a high frequency of occurrence is enormous, items with a high frequency of occurrence cannot be identified with accuracy using the LC technique due to, for example, insufficient memory capacity. Also, considering the fact that the LC technique can only divide memory into two levels, this technique cannot fully exploit multi-level cache memory that has different memory capacities and access times. Therefore, when multi-level cache memory is used, which is common in current computer systems, the frequency of occurrence is not calculated efficiently by the LC technique because the performance of the multiple-level cache memory cannot be fully exploited.
Therefore, improvements to prior art are still desired to solve the above-mentioned one or more problems in prior art.