Numerous organizations, including industry and government entities, recognize that important conclusions can be drawn if massive data sets can be analyzed to identify patterns of behavior that suggest dangers to public safety or evidence illegality. These analyses often involve matching data associated with a person or thing of interest with other data associated with the same person or thing to determine that the same person or thing has been involved in multiple acts that raise safety or criminal concerns.
Yet, the quality of the analytical result arising from use of sophisticated analytical tools can be limited by the quality of data the tool utilizes. For certain types of analyses, an acceptable error rate must be literally or nearly zero for an analytical conclusion drawn from the data to be sound. Achieving this zero or near-zero error rate for datasets comprising tens or hundreds of millions of records can be problematic. Present data comparison tools are not well suited to solve these issues.
The issues discussed above are particularly acute for analyses involving data related to identifying persons or things for inquiries relating to public safety. For example, analytical tools for identifying potential safety threats generally do not have an acceptable error rate greater than zero because the cost of mistakenly identifying the presence of a safety threat (i.e., a “false positive”) or allowing a safety threat to go undetected (i.e., a “false negative”) is unacceptably high. Therefore, tools supporting public safety must correctly relate data associated with persons or things of interest with other data related to the same person or thing.
Some tools exist for accurately comparing data, but they are computationally impractical to use with datasets containing millions of records. For example, one solution to determining whether two particular objects are associated with the same person or thing of interest is to compare each element of one object to a corresponding element in the second object. For example, for objects containing M elements, a first element in the first object may be compared to a corresponding first element in the second object, and corresponding comparisons may be made for each of the remaining M−1 elements common to the first and second objects. If the elements within each object are collectively adequate to uniquely identify the represented person or thing with certainty, and corresponding elements within the first and second objects match, a conclusion may reasonably be drawn that the objects reflect the same person or thing. As an alternative, each object could be converted (serialized) into a single string reflecting the contents of each element to be compared. Thereafter, a string generated from one object could be compared to a string generated from another object as a form of object comparison.
For certain datasets, the above approaches may consume little memory or system resources, because the objects or their serialized strings can be stored on disk rather than in main memory. However, the above approaches may quickly become impractical with large or non-trivial datasets. As the number of objects to compare increases, the number of comparisons and thus the processing time of the comparisons increases exponentially; i.e., proportional to n2/2, where n represents the number of objects to be compared. Thus, a comparison of 500 objects using a serialized approach, whose processing time may be approximated as the time to perform 125,000 string comparisons, may be computationally tractable. However, a comparison of 100 million (100M) records using that approach, whose processing time may be approximated as the time to perform 5 quadrillion (5e15) string comparisons, may be computationally intractable. Additionally, reading strings from disk rather than reading them from memory may add additional processing time.
Another solution for identifying matching objects within a corpus of objects is to store each object in a multimap. This multimap is an associative array that stores multiple values for each key. Importing the objects into the multimap leads to objects with the same element data being stored in a single entry of the multimap. Thus, use of a multimap associates identical objects.
One drawback to using a multimap for object comparisons is that the multimap is typically stored in main memory, due to algorithmic considerations related to key organization within the multimap, so an object comparator must have sufficient main memory to hold a multimap comprising the entire corpus in memory. Therefore, a multimap solution can be impractical for datasets at or above 100M objects. Similar drawbacks exist to each approach as applied to other object comparison problems, such as efficiently identifying unique objects within a corpus of objects and efficiently comparing a single object to all objects within a corpus of object.
Neither solution is viable for datasets approaching or exceeding 100M objects. Yet, object datasets comprising 100M or more objects are not uncommon today. Therefore, the problems described above are quite real and a need exists for improved object comparators.