A hash function is generally a procedure or mathematical function which maps a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. Hash functions are mostly used to speed up table lookup or data comparison tasks, such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.
Typically, a hash function may map two or more keys to the same hash value. In many applications, it is desirable to minimize the occurrence of such collisions, which means that the hash function should map the keys to the hash values as evenly as possible. In some applications, there is a distinguished set of inputs that are fixed or slowly varying, such as keywords in a programming language or the names of US states. As can be appreciated, collisions generally increase the average lookup cost due to the time required to resolve the collisions. In those applications with fixed or slowing varying input, reducing collisions on the distinguished set of inputs can be especially beneficial given the expected frequency of occurrence of those inputs. Depending on the application, other properties may be required as well including low cost, determinism, uniformity, compactness, or other properties.
A hash function that is injective over a distinguished set, that is, one that maps each element of the distinguished set to a different hash value, is said to be “perfect.” That is, a perfect hash function for a set S is a hash function that maps distinct elements in S to distinct hash values, such as distinct integers, with no collisions. A perfect hash function with values in a limited range may be used for efficient lookup operations by placing keys from S (or other associated values) in a table indexed by the output of the function As noted above, hash functions may have uses in a variety of table lookup operations. For illustration purposes, a particular application for utilizing hash functions is described herein which involves a construct of programming languages. Many computer programming languages include a language construct that may be viewed as a multi-way branch in which the value of a run-time variable or expression may be compared with members of a set of constants. A branch selection is made based on the result of the comparisons. Such constructs are often known as “switch statements.” The description below provides an introduction and background information on these switch statements.
In typical switch statements, the run-time variable or expression, sometimes referred to as the “control variable” or “input control variable,” may be tested against a set of case labels (or “case values”). If the value of the control variable matches a case label, the program will execute a block of code associated with that case label. If the value of the control variable does not match that case label, the next case label may be examined, and the process repeats until a match is found or until the given set of case labels is exhausted. In some languages, a “default” case label may be used to handle situations where the control variable does not match any of the case labels. Further, the method of terminating a block of code associated with a case label may vary by programming languages. Typically, a “break” keyword is used to signal the end of a block, thereby causing the program execution to exit the multi-branch construct. If no “break” keyword is present at the end of a block of code, in many languages the program execution “falls through” to the code associated with the next case label in the construct, as if its value also matched the control variable. In other languages, “fall throughs” are not permitted and a “break” keyword is implicit and does not need to appear in the source code.
One example programming language that includes the aforementioned multi-way branches is the Java programming language. The keyword used in Java for this type of function is “switch.” To provide a context for the features presented herein, an example of the use of the “switch” statement available in the Java programming language is described below.
A Java switch statement works with data types including the byte, short, char, and int primitive data types. The switch statement also works with enumerated types and a few special classes that “wrap” certain primitive types. The following example program declares an integer variable named “month” whose value represents a month of the year. The program assigns the name of the month to a string variable “str” based on the value of the integer variable “month”, using a switch statement as follows:
int month = 8;String str;switch (month) {case 1: str = “January”; break;case 2: str = “February”; break;case 3: str = “March”; break;case 4: str = “April”; break;case 5: str = “May”; break;case 6: str = “June”; break;case 7: str = “July”; break;case 8: str = “August”; break;case 9: str = “September”; break;case 10: str = “October”; break;case 11: str = “November”; break;case 12: str = “December”; break;default: str = “Invalid month.”;break;}In this case, the variable str is set to “August” since the integer “month” is set to equal 8.
The switch statement above could also be implemented with if-then-else statements:
int month = 8;String str;if (month == 1) {str = “January”;} else if (month == 2) {str = “February”;}. . . // and so on
Deciding whether to use if-then-else statements or a switch statement may be based on several factors, including readability, compile time requirements, execution time requirements, memory requirements, or other factors. It is noted that if-then-else statements and switch statements may be expressed in terms of one another. Generally, the if-then-else construct is more powerful since it may be used to compare multiple variables at once and compare a variable against a range of values. However, a switch construct is more readable when only one variable is being compared against a restricted set of values.
As noted above, the break statements are used because without them, case statements fall through. That is, without an explicit break, control of the program will flow sequentially through subsequent case statements. The following program illustrates why it might be useful to have case statements fall through:
int month = 2;int year = 2000;int numDays = 0;switch (month) {case 1:case 3:case 5:case 7:case 8:case 10:case 12:numDays = 31;break;case 4:case 6:case 9:case 11:numDays = 30;break;case 2:numDays = ((year % 4 == 0) && !(year % 100 == 0)) || (year%400 == 0)) ?29 : 28;break;default:numDays = −1;break;}In this example, since the integer variable “month” is set to 2 and the integer variable “year” is set to 2000, the variable numDays is assigned to 29.
If the range of case labels is relatively small and has only a few gaps (i.e., the case labels form a dense set), compilers may implement the switch statement as a branch table or an array of indexed function pointers rather than a lengthy series of conditional instructions. As can be appreciated, using such methods for case labels that form a sparse set could result in relatively inefficient programs.