A multiset is a set of data that allows for repeated elements. For example, many of the records within a database may be duplicates of one another. Thus, while a database may include a large number of elements, it may be the case that only a subset of those elements are unique.
The cardinality of a multiset is the number of distinct elements within the multiset. HyperLogLog (HLL) is an algorithm that estimates the cardinality of a multiset. Calculating the exact cardinality of a multiset may take a significant amount of time and may require a large amount of memory, particularly for large multisets. Probabilistic cardinality estimators, such as an HLL estimator, are significantly faster and require much less memory, at the cost of obtaining only an approximation of the cardinality. The approximation, however, is generally fairly accurate.
HLL estimators work well with multisets that contain very large numbers of values. For example, an HLL estimator may be used to estimate the number of searches that end users perform on an Internet search engine within a day. Trying to pull all of the searches into memory to work with them would be virtually impossible because of the amount of memory required and the amount of time it would take. An HLL estimator converts the data into a hash of random numbers representing the cardinality of the data supplied.
The basis of an HLL estimator is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2n.
To improve overall accuracy, the multiset can be split into numerous subsets. An estimate of the cardinality for each subset may be determined, and the cardinality of the whole multiset may be estimated by determining the harmonic mean of all of the estimates.