Embodiments of the invention relate to identification of related names. Also, embodiments of the invention relate generally to automatic data processing systems that search and retrieve records from a database based on matching of personal names, and to improved systems and methods for intelligently processing name comparisons.
Information about individuals is often stored in a computer. Access to that information is most readily gained by using the name of the individual involved. The nature of names, however, their behavior and permutations, pose significant challenges to information retrieval. Names vary during one's life (e.g., through marriage or professional preparation); they take on different forms, depending on the formality of the situation (WILLIAM CARVER/BILLY CARVER); they may be spelled differently if recorded by someone other than the individual (PRICE PRIES). To amplify the difficulties even more, naming conventions vary across cultures. It may not be appropriate to assume that the typical American name structure of single given name (first name), single middle name or initial followed by a surname (last name) applies in a database that contains names from all over the world, a situation that is usual in today's world of global technology and communication. Names from other cultures may have compound surnames or may be composed of only one name. Names written in writing systems other than Roman may be transcribed in a variety ways into the Roman alphabet because there is no single way to represent sounds that occur in another language but do not occur in English, causing significant differences in the spelling (KIM/GHIM).
Adequate information retrieval that is based on the name must anticipate the range and kinds of variation that can occur in names, both generally and in specific cultures. Other name search or information retrieval systems are generally unable to recognize or address the full range of variation in names. Some systems assume that names are static and search only for an exact match on the name. These systems cannot accommodate even the slightest spelling variations, initials or abbreviations (JOS. Z. BROWN/JOSEPH ZACHARY BROWNE). Other systems may use techniques or keys (such as Soundex or Soundex-like keys) that permit some minor spelling differences between names (DORSHER/DOERSHER) but these techniques generally fail to cope with significant variation (DOERSHER/DOESHER) or problems posed by names from non-Anglo cultures (ABDEL RAHMAN/ABDURRAMAN). If cultural differences are recognized, it is typically through use of equivalency lists or tables. Some of the more common variants can be accommodated in this way, but retrieval is then limited to those items on the list and cannot accommodate new representations or random variation or keying errors (GOMEZ/BOMEZ).
For a system to reach a level of adequacy for automatic name searching, it must therefore address a diverse set of issues related to name variation. Although spelling variations can often be addressed through character-matching techniques (e.g., SMITH/SMYTH), false-positive matches can result from traditional string or character comparisons when common morphological endings, such as OVICH, occur at the end of otherwise dissimilar names (e.g., ZELENOVICH/JOVANOVICH). Transcription from foreign writing systems to the Roman writing system poses additional spelling concerns. Different character sets, dialectal variations and sounds that are not represented in Roman alphabetic form at all contribute to the possibility of multiple, and often inconsistent, representations of the same name. A single Chinese character (ideogram) can be transcribed to produce numerous roman forms that have little or no resemblance to one another due to dialectal variations. For example, the character CHANG, JANG and ZHANG are different roman representations of the exact same Chinese name, as are the names WU, MHO and ENG. Similarly, a single Arabic name can result in transcriptions as diverse as KHADHAFI, CODOFFI, QATHAFI.
Character-based systems may also be confronted with significant retrieval problems caused by names with the same pronunciation but with divergent spellings. WOOSTER, WORCHESTER, and WUSTER may all share at least one identical pronunciation and yet show very different spellings. When name data are shared orally, the speaker's pronunciation, the listener's hearing (or mishearing) of the name and the speaker's expectations about the spelling of the name will impact the final written representation of a name. For example, a telephone reservationist may record a caller's name with a variety of phonetically correct spellings, which may not correspond (and may therefore not be matched to) an existing database record for that caller.
Another common cause of name variation, which creates retrieval difficulty for name search systems, is the inclusion or exclusion of name data. Depending on the data source, names may be formal such as THOMAS EDWARD WINTHROP III, or informal such as TOM WINTHROP. An ideal name search system would be capable of correlating these two names, even though only a portion of the full name is available. To predict the relationship among variant formats of names, the system must also be able to recognize what rules govern which elements can be deleted or included or changed in different cultures. MARIA DEL CARMEN BUSTOS SAENZ will become MARIA DEL CARMEN BUSTOS DE LOPEZ, if she marries JUAN ANTONIO LOPEZ GARCIA. Predicting the relationship between these names is fundamental to retrieval success.
In many name search applications, it is important to identify variant forms of a name that are considered legitimate and to link and preserve the variations; in others, it may be appropriate to establish one form of a name and to treat all other forms as errors. Even if the data base is cleaned by linking variant forms and eliminating identifiable errors, users may search for names under yet more variations.
U.S. Pat. No. 5,040,218 to Vitale et al. discloses a voice synthesis system which attempts to identify the origin of a name to enhance pronunciation. The system first searches a dictionary for a name, and if the name is not found, uses grapheme and n-gram analysis to identify the name's likely origin. Similarly, U.S. Pat. No. 5,062,143 to Schmitt shows a system that identifies name origin using n-gram analysis.
U.S. Pat. No. 5,724,481 to Garberg et al. shows a method of matching proper names in a database using a phonemic representation.
U.S. Pat. No. 5,758,314 to McKenna shows an international database processing system. However, this system uses Soundex algorithms to process Unicode input for all cases, rather than providing a name searching system with culture-specific algorithms.
Design Pat. D359,480 shows an IPA-based computer keyboard, but does not disclose any use of IPA for identifying data records.
The article “Identifying Source Languages: the Case of Proper Names” by Valencia and Yvon (1997) discloses statistical models for name searching based on n-gram comparisons. The article also discloses determination of the source language and the use of different statistical models for comparisons, based on the source language.
John Hermansen, a named inventor, authored a doctoral dissertation, “Automatic Name Searching in Large Data Bases of International Names” (1985) which explores the concept of cultural differences in names. The document suggests searching using different culturally specific algorithms, but discloses only a simple n-gram based algorithm.
The assignee has developed a software program known as PC-NAS. An early version of this program was incorporated into a government computer system more than one year before the priority date of this application. This early version performed name searching using a combination of n-gram distribution and positional properties, and included a limited name regularization algorithm as part of an Arabic processing algorithm. Its architecture included sets of algorithms applicable to different cultures, but no automatic classification of the cultural origin of a name.
U.S. Pat. No. 5,485,373 to Davis et al. discloses a text searching system which relies on a Unicode representation (not a phonetic alphabet). The Davis system may vary algorithms based on the language being searched, but has no name classifier. This system is not designed to search for proper names; comparisons are performed based on a Unicode representation, which is not a phonetic alphabet.
Other patents relating generally to computerized language analysis and processing include: U.S. Pat. No. 5,323,316 to Kadashevich et al.; U.S. Pat. No. 5,337,232 to Sakai et al.; U.S. Pat. No. 5,369,726 to Kroeker et al.; U.S. Pat. No. 5,369,727 to Nomura et al.; U.S. Pat. No. 5,371,676 to Heemels et al.; U.S. Pat. Nos. 5,375,176 and 5,425,110 to Spitz; U.S. Pat. No. 5,377,280 to Nakayama; U.S. Pat. No. 5,432,948 to Davis et al.; U.S. Pat. No. 5,434,777 to Luciw; U.S. Pat. No. 5,440,663 to Moese et al.; U.S. Pat. No. 5,457,770 to Miyazawa; U.S. Pat. No. 5,490,061 to Tolin et al.; U.S. Pat. No. 5,515,475 to Gupta et al.; U.S. Pat. No. 5,526,463 to Gillick et al.; and U.S. Pat. No. 5,548,507 to Martino et al.
None of these earlier systems provide a satisfactory system and method for multicultural name searching. Thus, the inventors believe there is a need for an improved system and method for searching name-based records and for determining the degree of similarity between two name representations.
A database is a collection of information organized in such a way that a computer program can quickly and easily select desired pieces of data. A database typically includes a number of records, and each record includes one or more fields. Each field typically stores a single piece of information.
In such databases, retrieval of records that are associated with a person typically involves use of a unique identifying value or “key”, such as an ID number. For certain retrieval tasks, a unique identifying value is not always available, and the person's name itself must be used as the identifying value or “key”.
However, personal names have several limitations inhibiting their effectiveness as identifying values for retrieval of information from a database. For example, personal names are not unique. Numerous individuals may possess names with some or even all elements in common with many other individuals. In extreme cases, the same name may be commonly used by thousands or even millions of different people. Conversely, people who are closely related sometimes exhibit significant differences in the way each spells a commonly held family name. Moreover, a specific person may be represented in many different records with a database, and that person's name may be rendered in slightly or greatly differing forms within those database records.
Additionally, names are not used consistently. Within the U.S. society, as indeed in most societies around the world, individuals are permitted a certain degree of latitude in determining the form of the name they provide, orally or in writing, when providing information that is subsequently placed in a database.
Furthermore, names change over time. Names are social objects that are used to record various kinds of information, so they can be modified in various ways as time passes, in order to reflect changes in social or personal status by the bearer. In many Western societies, for example, names may change over time in order to reflect changes in marital status, educational or professional achievements, or even gender affiliation.
Yet another drawback of using personal names as a database key is that names are not consistently captured. Because it is more difficult to validate the spelling of names than it is to validate the spelling of most other words in a particular language, name information in a database is correspondingly subject to a greater incidence of spelling and keying errors.
Amplifying the difficulties associated with using personal names as identifiers, naming conventions tend to vary across cultures. It may not be appropriate to assume that the typical American name structure of single given name (first name), single middle name or initial followed by a surname (last name) applies to a database that contains names from all over the world. For instance, names from other cultures may have compound surnames or may be composed of only one name.
Moreover, between languages/cultures and within a single language/culture, names may have different forms and variations. Several variations of the same name may refer to a single person or entity. For example, a name may be spelled differently based on the language in which it is written, with different spellings referring to a single person. In addition, a person's name and its prefixes/suffixes may change in patterned, predictable ways as the result of an event, such as marriage, widowhood, or graduation from professional school. Similarly, typing errors or other sources of noise may create a variation on a name that is to refer to the same person as the original name. Rather than treating each variation of a name as referring to a distinct person or entity, it may be advantageous to match variations of a name that may all refer to the same person.