It is common to use "fingerprints" to identify data records such as bit strings and character strings. A fingerprint is generated when, for example, a polynomial or hash function is applied to a data record to produce a relatively small bit string which is strongly dependent on the content of the record.
With a good fingerprinting scheme, data records having different content will most likely produce different fingerprints. As an advantage, fingerprints provide a way for identifying data records without any sort of central management, i.e., the identification arises purely from the content of the records themselves. Simple fingerprinting schemes are based on the probabilistic assumption that there is some level of randomness in the input data records.
More sophisticated schemes, such as Rabin fingerprints and strong universal hashing, do not assume anything about the input. In generally, an arbitrary set of bit strings records is first chosen for fingerprinting, and second, a function is randomly selected from some family of fingerprinting functions. Then, the selected function is applied to the set of target bit strings. See, M. Rabin, "Probabilistic Algorithms in Finite Fields," SIAM Journal of Computing, Vol. 9, No. 2, pp. 273-280, 1980, and Carter et al., "Universal Classes of Hash Functions, JCSS 18 pp. 143-154, 1979. In practice, the assumption is violated to some extent because usually the function is chosen first at a time when the set of bit strings records is still unknown.
Fingerprints can be used in a variety of applications, see A. Broder, "Some applications of Rabin's fingerprinting method," Sequences II, Methods of Communications, Security, and Computer Science, pp. 143-152, Springer-Verlag, 1993. For example, fingerprints can be used to identify World-Wide-Web (WEB) pages for "Web" search engines. For instance, the AltaVista search engine from Digital Equipment Corporation uses fingerprints to identify the millions of Web pages for which it maintains a comprehensive full-word index. Thus, when a page is located at a "new" Universal Resource Location (URL), a determination can be made, by comparing fingerprints of the content, whether or not the page has been previously indexed. Because a large proportion of Web pages are duplicates, this check can save considerable amount of storage space.
However, there is a small probabilistic chance that different data records produce identical fingerprint. This is called a collision. Obviously, increasing the number of bits used for a fingerprint decreases the probability that a collision will occur. However, increasing the number of bits in a fingerprint increases the time required to generate the fingerprint and the amount of memory required to store the fingerprint. It always is possible to directly compare the records themselves, but for large records this also would be computationally expensive.
In order to deal with the possibility of collisions, two fingerprints can be maintained. If two data records have identical first fingerprints, then a comparison can be made on second fingerprints generated for the records using a different fingerprinting function. However, adding a second fingerprint substantially increases storage requirements. For example, for the AltaVista search engine, a second eight byte fingerprint would require an additional 800 MB of memory, increasing the cost of the system considerably.
In order to better evaluate fingerprinting techniques, it is desired to estimate the probability of collisions of fingerprints.