A hash table is a data structure in computer science that implements an associative array between particular buckets identified by keys, and the values stored there. A hash function is used to determine a key for any particular value. A hash function can be any operation that generates a suitable key for a particular hash table's usage. Hash functions are generally chosen so that the distribution of keys is good over a particular expected set of values. A perfect hash function is used to denominate a hash function that does a good job of distributing values by selecting unique keys and thus does not cause any collisions. A collision occurs when two values map to the same key and thus the same hash bucket. Whether collisions are a problem or not depends on how the hash table is being used, but where collisions matter a perfect hash function avoids extra logic for handling collisions.
Hash functions often use a technique referred to as a checksum for determining hash keys from input values. A checksum can be thought of as a summary of data. A long set of bytes of data when run through a checksum function will generally produce a fixed-length value (e.g., 32 bits or 64 bits) referred to as the checksum. This value can then be compared to the checksum of other data or of the same data at a different time to quickly detect changes to the data. The checksum function is generally chosen so that even small changes to the data will result in a different checksum value, so that changes can be efficiently detected. The most common checksum function for detecting data changes is a group of algorithms known as cyclic redundancy check (CRC). Cryptographic hash algorithms, such as MD5 and SHA1 are also commonly used.
Hash tables experience worst-case performance when large numbers of input values all have the same key and thus are hashed into the same bucket. This can be caused by either poor hash function choice or maliciously chosen input that takes advantage of the predictable nature of the hashing function. Non-cryptographic hash codes (such as those used in the hash table data structure) can be predictably calculated outside of an application's boundaries. A malicious attacker can use this to force many entries into the same hash table location, deliberately creating worst-case performance, and to cause substantial delays in calculation over otherwise acceptable input data.
One potential solution to the latter is to mix secret data into the predictable hash function that is not known to the attacker. However, existing hash functions are built for either speed, good distribution of outputs over the input data, or high likelihood that related inputs are spread throughout the table evenly. There are few good places to include extra secret data in existing functions that attackers cannot work around. There are several solutions to dealing with malicious inputs to hash tables. One is to use a hash table with good worst-case performance. Unfortunately, such hash tables are more complicated and sometimes have poorer average performance.