A hash function is a well-defined procedure that converts a large, possibly variable-sized, amount of data into a small datum that may serve as an index into an array. Hash functions are used to speed up table lookup or data comparison tasks, such as finding items in a database, detecting duplicate or similar records in a large file, and so on.
One problem encountered with hash functions is collisions. Collisions occur when two different data inputs to a hash function result in the same hash value. With many common data hashing functions, two similar (but not identical) data inputs will have a high probability of hash value collision. For example, the two input strings, “cat” and “hat”, may have a higher probability of hash value collision for most common hash functions. In other words, the distribution of hash values in the space of possible hash value results is not uniformly distributed in the case of most common hash functions.
Such an uneven distribution of hash values can be problematic because the existence of similar strings in a data input to a hash function frequently occurs in practice. Consequently, reducing the probability of collisions with a hash function will result in a more efficient and better-performing hash table. A hash function where the probability of collision for any two input strings is the same, regardless of the content of the strings, would be beneficial.