Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. During the data analysis, a number of unique values of a set of values is often required to be estimated. For example, a server of a network application often needs to estimate a number of user accounts (userId) of the network application. When estimating the number of user accounts, the user accounts are analyzed as unique values.
Conventional methods for estimating a number of unique values is using high-level language programs, such as Java, C++, and Python, written in a way to count the number of the unique values. For example, set objects of the high-level language programs can be used to store user accounts for various network applications. The number of unique values (user accounts) can be obtained by getting a size of the set object. One example of the high-level language program for counting the number of the user accounts may include following codes (hereafter called HashSet method):
HashSet<String> set = new HashSet<String>( );while(...){String userId = xxx;set.add(xxx);}return set.size( ).
The conventional method for estimating the number of unique values is simple, because it needs less code and is easy to be understood. However, in massive data analysis, for example, in many SNS (social network service) applications, the number of the user accounts is a massive data. Therefore, in the course of the program running, the set object in the above code will consume too much memory, so as to cause a memory overflow problem.