1. Field of the Invention
The present invention relates generally to information processing environments and, more particularly, to a system providing methodology for searching for matching names.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level.
DBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. Today, one generally finds database systems implemented as one or more PC “client” systems, for instance, connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these “client/server” systems include Powersoft® clients connected to one or more Sybase® Adaptive Server® Enterprise database servers. Both Powersoft® and Sybase® Adaptive Server® Enterprise (formerly Sybase® SQL Server®) are available from Sybase, Inc. of Dublin, Calif. The general construction and operation of database management systems, including “client/server” relational database systems, is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Addison Wesley, 2000, the disclosure of which is hereby incorporated by reference.
DBMS systems are in use today in a wide range of applications, including banking, insurance, manufacturing, airline ticketing, and many others. Tremendous quantities of information are stored in DBMS systems and they are used in many “mission critical” applications. Although DBMS systems are widely used and store large amounts of information, it can be difficult in certain circumstances to accurately identify particular information of interest in a DBMS system. One particular problem is in determining whether a database contains information about a particular person.
More generally, name recognition and name matching (whether in the context of a DBMS system or otherwise) are increasingly important to both government and business users. The terrorist events of Sep. 11, 2001 and the passage of the USA Patriot Act have greatly increased the pressure on federal agencies and private organizations such as banks, airlines, and insurance companies to ensure great diligence concerning business conducted with specific individuals and organizations. For example, when presented with a document such as a passport or credit card, many organizations are now required by law to check whether the name on the document is also on a “watch list” of terrorist sympathizers and their supporters.
While it might seem simple to check given names against a list (e.g., an official watch list), there are a number of fundamental problems in accurately identifying a particular person by name. First, official lists frequently contains spelling errors, abbreviations, and other anomalies that make matching a name on the list extremely difficult. These lists also contain a mixture of business names, individual names, and aliases. In addition, many names originate from foreign countries, which adds even more complexity to the name matching process. For these reasons, among others, determining whether a given name matches a person on a watch list can be extremely difficult. In addition to the risk of failing to identify a terrorist or another name on a watch list, these complexities also result in a large risk of creating false positives. False positives, in turn, may result in offending or denying service to a valuable customer.
For these and other reasons, name recognition and name matching are inherently difficult tasks. Exact string matching is of very limited utility as a match will not be recognized if there is any discrepancy between two names. Other existing name matching solutions are incomplete and of only limited utility in addressing the problem of identifying matching names.
Many relational database systems currently include a “soundex” function for lexically comparing two slightly dissimilar strings. These functions are based on a “Soundex” system that was originally developed a number of years ago as an index filing system for grouping similar sounding names. The initial version of the system was patented by Robert C. Russell in 1918 as U.S. Pat. No. 1,261,167. Russell's system, which became known as “soundex” or “soundexing”, used a simple phonetic algorithm to reduce a name to a four character alphanumeric code. The first letter of the code corresponds to the first letter of the last name. The remaining three digits of the code consist of numerals derived from the syllables of the word.
The so-called “American Soundex” system is an improvement on Russell's invention, and was used by the National Archives and Record Administration to index the 1880, 1890, 1900, 1910, and 1920 U.S. Censuses. The Soundex code consists of the first letter of the surname. Then each letter (ignoring punctuation such as spaces and hyphens) is converted to a number as provided in the following table:
Number Letters                1=B F P V        2=C G J K Z S X Z        3=D T        4=L        5=M N        6=R        
Four simple rules are then applied. First, vowels (‘A’, ‘E’, ‘I’, ‘O’, ‘U’, ‘Y’) and the letters ‘G’ and ‘H’ are not coded—they are ignored. Second, double letters are coded as one letter (e.g., “Williams” has a code of W452). Third, letters of the same code not separated by other letters are coded as one letter (e.g., “Schmidt” has a code of S530). Fourth, the code is truncated if more than four characters long or is padded by adding zeros to the end if less than four characters long (e.g., “Lee” has a code of L000). The resulting 4 character code is the simplified “American Soundex” code for the name.
More recently, many database vendors have implemented variations of the Soundex function for use in database systems as a mechanism for comparing slightly dissimilar strings. Although these Soundex functions enable users to locate information based on phonetic similarities, they are well known to be too coarse for reliable name matching. In addition, various database vendors have slightly different Soundex implementations.
Accordingly, there is a need for a reliable name matching solution that provides for fine-grained analysis of potentially matching names and generates useful results. The present invention provides a solution for these and other needs.