Increasingly, commercial, governmental, institutional and other entities collect vast amounts of data related to a variety of subjects, activities and pursuits. Society's appreciation for and use of information technology and management to analyze such data is now well ensconced in everyday life. For example, collected data may be examined for historical, trending, predictive, preventive, profiling, and other many useful purposes. Although the technology for collecting and storing such vast amounts of data is in place, efficient and effective technology for accessing, processing, verifying, analyzing and decisioning relating to such vast amounts of data is presently lacking or at the least in need of improvement. There exists broad and eager anticipation for unleashing the potential associated with such vast amounts of data and expanding the power that intelligent business solutions brings to commercial, governmental, and other societal pursuits. There exists a need and desire for intelligent solutions to realize this potential.
Applications for exploiting collected data include, but are not limited to: national security; law enforcement; immigration and border control; locating missing persons and property; firearms tracking; civil and criminal investigations; person and property location and verification; governmental and agency record handling; entity searching and location; package delivery; telecommunications; consumer related applications; credit reporting, scoring, and/or evaluating; debt collection; entity identification verification; account establishment, scoring and monitoring; fraud detection; health industry (patient record maintenance); biometric and other forms of authentication; insurance and risk management; marketing, including direct to consumer marketing; human resources/employment; and financial/banking industries. The applications may span an enterprise or agency or extend across multiple agencies, businesses, industries, etc.
One technique for using data to achieve a useful purpose is record linkage or matching. Record linkage generally is a process for linking, matching or associating data records and typically is used to provide insight and effective analysis of data contained in data records. Data records, which may include one or more discrete data fields containing data, may be derived from one or more sources and may be linked or matched, for example, based on: identifying data (e.g., social security number, tax number, employee number, telephone number, etc.); exact matching based on entity identification; and statistical matching based on one or more similar characteristics (e.g., name, geography, product type, sales data, age, gender, occupation, license data, etc.) shared by or in common with records of one or more entities.
Record linkage or matching involves accessing data records, such as commonly stored in a database or data warehouse, and performing user definable operations on accessed data records to harvest or assemble data sets for presentation to and use by an end user. As a prelude or adjunct to record linkage, processes such as editing, removing contradictory data, cleansing, de-duping (i.e., reducing or eliminating duplicate records), and imputing (i.e., filling in missing or erroneous data or data fields) are performed on the data records to better analyze and present the data for consumption and use by an end user. This has been referred to as statistical data editing (SDE). One category of statistical processes that has been discussed, but not widely implemented, for use in performing SDE is sometimes referred to as “classical probabilistic record linkage” theory and in large part derives from the works of I. P. Fellegi, D. Holt and A. Sunter. Such models generally employ algorithms that are applied against data tables. More widely adopted general models, such as if-then-else rules, for SDE have the disadvantage of being difficult to implement in computer code and difficult to modify or update. This typically requires developers to create custom software to implement complex if-then-else and other rules. This process is error-prone, costly, inflexible, time-intensive and generally requires customized software for each solution.
Although record linkage may be conducted by unaided human efforts, such efforts, even for the most elementary linkage operation, are time intensive and impractical for record sets or collections of even modest size. Also, such activity may be considered tedious and unappealing to workers and would be prohibitively expensive from an operations standpoint. Accordingly, computers are increasingly utilized to process and link records. However, the extensive amount of data collected that must be processed has outpaced the ability of even computerized record linkage systems to efficiently and quickly process such large volumes of data to satisfy the needs of users. Speed of processing data records and generating useful results is critical in most applications. The veracity of data records may be the most critical factor in some applications. There is a constant balance between the speed of processing and compiling data, the level of veracity of composite data records linked and presented, and the flexibility of the processing system for user customizable searching and reporting. Even with applications where speed of results generation is not critical, it is always desired. Most present day record linkage systems are OLAP, OLTP, RDBMS based systems using query languages such as SQL. There are many drawbacks associated with this technology, which has not effectively met or balanced the competing interests of speed, veracity and flexibility. Such systems are limited as to the complexity of the processes, such as deterministic, probabilistic and other statistical processes, that may be effectively performed on databases or data farms or warehouses.