The inventors are aware that large lists or sequences of data elements are commonly generated for many applications. Modernly, the use and popularity of encryption codes has made the generation large sequences of unique numbers an important objective. Commonly, the generation of lists of unique numbers is accomplished using random number generation algorithms or other related processes. However, as is known to those of ordinary skill in the art, such algorithms are not perfect in their generation of completely random data sets having no duplicate values. Each data set must be meticulously checked for the presence and frequency of duplicate values.
On its surface, the task of checking for the presence of duplicate values does not appear too daunting. On the other hand, the data elements themselves are becoming larger and larger. This makes such checking an increasingly time intensive process. This is especially the case when checking the 256-bit and larger data elements coming into common usage. When this increasing data element size is coupled with the fact that data sequences comprising millions or even hundreds of millions of data elements (or more) are now being used, the task of finding duplicates becomes much more difficult and time consuming. In fact, using present methods and technologies, searching such lists to determine if duplicate values are present is a massive undertaking. Even networked computing systems can take as long as a month to identify duplicate data elements in a data set of 100 million data elements. Even using relatively fast processing languages (e.g., C++, Assembly, and the like) such duplicate value searches can take many days to identify duplicate data elements.
Among the present methods in use for detecting duplicate values is a single match sorting algorithm. This method begins with the first data element in the data set and then compares it with every other element in the data set. If there is no match, the data element is identified as unique. The next data element is then searched in a similar fashion. In data sets of many millions of data elements this can take days or even weeks. In other word the process can be so time consuming as to be completely prohibitive. Another present approach requires that each data element be read and sorted into a “bin”. Bins having more than one data element contain duplicate data elements. In such an approach every data element must be completely sorted and then put in a bin. This is also a very time consuming process, especially so when large data elements are used (128-bit, 256-bit, and larger data elements). The process is made even more time consuming when one considers that even the fastest and most powerful computers in usage today use 64-bit logic which can only slowly process larger word sizes (e.g., 128-bit words and larger). Thus, these restrictions are even more burdensome when one considers that a typical computer uses 32-bit word sizes. Consequently, both of these common sorting approaches are slow and inefficient for sorting large data sets having large size data elements.
Additionally, when sequences of data elements are generated, it is important to know where in the sequence each duplicate value is. This information can, for example, help to troubleshoot the random number generation algorithms used to generate the data values. Thus, there is also a need for methods of tracking the position of duplicate data elements in a data set.
With each new set of data encryption codes for credit cards, bank accounts, e-mail accounts, financial transaction codes, and every other manner of encrypted data, the need for large data sets with non-duplicate data elements is becoming ever more important. This increases the necessity for testing of the data sets. Also it is important that such testing for duplicate data values be performed rapidly.
The inventors have recognized that there is a need for improving existing search methods. The invention described herein discloses method and apparatus for enabling faster and more complete searches to be performed using larger and larger data sets having larger data elements.