Embodiments of the invention relate to rejecting rows when scanning a collision chain.
In a hashed table organization, one key (formed by one or more columns of a database table) is typically defined as a hash key, and this hash key is used to quickly locate rows that have equal hash key values.
For example, the hash key is like an index key, and the hash key can be a single column (e.g. account number) or multiple columns (e.g. last name, first name, middle name). The hash key has an associated hash key value, and, from the hash key value, a hash value is computed, which may appear to be random number (e.g. an 8 byte integer). A hash function is used to map any particular hash key value to a hash value.
Once a hash value is known, the hash value modulo “a number of pages” is used to determine a page number (target page) from a set of pages in a database. Prime numbers may be used to reduce collisions. Typically, a hash function (e.g., an algorithm) is applied to the bytes of the hash key to generate a quasi-random numeric hash value. The hash value is then used to identify a page that a particular hash key value is located on. Typically the page is identified by using modulo arithmetic (i.e., dividing the hash value by the number of pages, and then using the remainder as the page number, which can be written as “hash_value mod n”, where n is the number of pages to hash into). Implementations often also hash to a particular row, or collision chain of rows, within the page using the same approach. Often, there is a group of rows that need to be verified to determine whether each of the rows matches the hash key value (i.e., matches the one or more column values forming the hash key).
Once the hash page is identified, the hash value modulo a “number of collision chains or number of rows” is used to determine where the row goes (i.e. a target collision chain or a target row).
Thus, with a hashed table, given a hash key, a hash value is determined and used to identify the page and location within the page of one or more rows with that particular hash key value. The following illustrates the use of the hash key to obtain a hash value, which is used to identify a page and a collision chain on the page.
Hash Key→Hash Value→(page, collision chain)
It is possible for a lot of different hash keys to produce the same hash value, and it is possible for different hash values to result in the same page number and/or collision chain. For this reason, any given hash collision chain can contain rows with many different hash key values as well as hash values.
In particular, once on a target page, different hash keys may map to the same row or collision chain, and this may be described as a hash collision. Once at the collision chain, the row that truly matches the hash key value is identified. In such cases, the hash value modulo the number of collision chains in the page may be used to identify a particular collision chain in the page. The number of collision chains in a page can vary from one (in which case every row on the page is considered a hash collision) to many.
That is, many hash key values can result in the same hash value, and many different hash values can ‘map’ to the same page and the same collision chain. For example, if there are 16 data pages, hash values of 1 and 17 would both map to the same page, and, if there were 8 collision chains in a page, hash values of 1 and 17 would also end up on the same page. Prime numbers for the number of pages/collision chains may be used to help reduce the number of collisions. Such use of a hash key and a hash value behaves somewhat randomly. For example, if there are 1 million hash pages, there is about a 1 in a million chance that any two rows will be on the same page, and, if there are 30 collision chains within that page, there is about a 1/30 chance that any two rows on a particular page will be on the same collision chain. While looking for a row, if one is found on a collision chain, it is possible that it is the row to be located or that it is a different hash key that just happened to end up on the same collision chain.
Once at the collision chain level, the collision chain is scanned to find all occurrences of the particular hash key value to be located. This is typically done by comparing the actual columns of the hash key (i.e. comparing the account number or last/first name columns of the hash key) to the same columns of the row in the collision chain. Note that comparing the hash value alone is not sufficient because multiple hash keys can result in the same hash value. If the hash key is known to be unique, processing is done until a matching hash key is found (and only half the collision chain is scanned on average if the matching hash key is on the collision chain, and, because the hash key is unique, processing is done once the match is found). If the hash key is unique, but the particular hash key to be located does not exist, the entire collision chain is scanned to determine that the hash key does not exist. If the hash key is not unique, then the whole collision chain is scanned to find all matches.
When trying to locate a row with a particular hash key value, the hash value is computed from the hash key (i.e., from the one or more columns making up the hash key), and then the target page and collision chain are computed arithmetically using modulo or other arithmetic means. Once the collision chain is accessed, the collision chain is typically scanned to identify one or more rows on the collision chain that match the desired hash key value. When collision chains are very short (e.g., zero or one entries on the collision chain), scanning the rows on the collision chain is very fast and uses little Central Processing Unit (CPU) processing. However, as collision chains get longer, the amount of the CPU processing used to identify which rows match a particular hash key becomes more expensive.
Also, in some cases, rows are compressed in the page. In this case, the row is first decompressed to determine whether the hash key of that row matches the target hash key. In the event that a query is looking for a row that does not exist, it is necessary to scan all entries on the collision chain just to determine that the row does not exist. Thus, CPU processing is used to find the correct row on the collision chain.