Embodiments of the invention relate to processing names in general, including parsing personal names that are representative of multiple cultures.
Also, embodiments of the invention relate generally to automatic data processing systems that search and retrieve records from a database based on matching of personal names, and to improved systems and methods for intelligently processing name comparisons.
Information about individuals is often stored in a computer. Access to that information is most readily gained by using the name of the individual involved. The nature of names, however, their behavior and permutations, pose significant challenges to information retrieval. Names vary during one's life (e.g., through marriage or professional preparation); they take on different forms, depending on the formality of the situation (WILLIAM CARVER/BILLY CARVER); they may be spelled differently if recorded by someone other than the individual (PRICE PRIES). To amplify the difficulties even more, naming conventions vary across cultures. It may not be appropriate to assume that the typical American name structure of single given name (first name), single middle name or initial followed by a surname (last name) applies in a database that contains names from all over the world, a situation that is usual in today's world of global technology and communication. Names from other cultures may have compound surnames or may be composed of only one name. Names written in writing systems other than Roman may be transcribed in a variety ways into the Roman alphabet because there is no single way to represent sounds that occur in another language but do not occur in English, causing significant differences in the spelling (KIM/GHIM).
Adequate information retrieval that is based on the name must anticipate the range and kinds of variation that can occur in names, both generally and in specific cultures. Other name search or information retrieval systems are generally unable to recognize or address the full range of variation in names. Some systems assume that names are static and search only for an exact match on the name. These systems cannot accommodate even the slightest spelling variations, initials or abbreviations (JOS. Z. BROWN/JOSEPH ZACHARY BROWNE). Other systems may use techniques or keys (such as Soundex or Soundex-like keys) that permit some minor spelling differences between names (DORSHER/DOERSHER) but these techniques generally fail to cope with significant variation (DOERSHER/DOESHER) or problems posed by names from non-Anglo cultures (ABDEL RAHMAN/ABDURRAMAN). If cultural differences are recognized, it is typically through use of equivalency lists or tables. Some of the more common variants can be accommodated in this way, but retrieval is then limited to those items on the list and cannot accommodate new representations or random variation or keying errors (GOMEZ/BOMEZ).
For a system to reach a level of adequacy for automatic name searching, it must therefore address a diverse set of issues related to name variation. Although spelling variations can often be addressed through character-matching techniques (e.g., SMITH/SMYTH), false-positive matches can result from traditional string or character comparisons when common morphological endings, such as OVICH, occur at the end of otherwise dissimilar names (e.g., ZELENOVICH/JOVANOVICH). Transcription from foreign writing systems to the Roman writing system poses additional spelling concerns. Different character sets, dialectal variations and sounds that are not represented in Roman alphabetic form at all contribute to the possibility of multiple, and often inconsistent, representations of the same name. A single Chinese character (ideogram) can be transcribed to produce numerous roman forms that have little or no resemblance to one another due to dialectal variations. For example, the character CHANG, JANG and ZHANG are different roman representations of the exact same Chinese name, as are the names WU, MHO and ENG. Similarly, a single Arabic name can result in transcriptions as diverse as KHADHAFI, CODOFFI, QATHAFI.
Character-based systems may also be confronted with significant retrieval problems caused by names with the same pronunciation but with divergent spellings. WOOSTER, WORCHESTER, and WUSTER may all share at least one identical pronunciation and yet show very different spellings. When name data are shared orally, the speaker's pronunciation, the listener's hearing (or mishearing) of the name and the speaker's expectations about the spelling of the name will impact the final written representation of a name. For example, a telephone reservationist may record a caller's name with a variety of phonetically correct spellings, which may not correspond (and may therefore not be matched to) an existing database record for that caller.
Another common cause of name variation, which creates retrieval difficulty for name search systems, is the inclusion or exclusion of name data. Depending on the data source, names may be formal such as THOMAS EDWARD WINTHROP III, or informal such as TOM WINTHROP. An ideal name search system would be capable of correlating these two names, even though only a portion of the full name is available. To predict the relationship among variant formats of names, the system must also be able to recognize what rules govern which elements can be deleted or included or changed in different cultures. MARIA DEL CARMEN BUSTOS SAENZ will become MARIA DEL CARMEN BUSTOS DE LOPEZ, if she marries JUAN ANTONIO LOPEZ GARCIA. Predicting the relationship between these names is fundamental to retrieval success.
In many name search applications, it is important to identify variant forms of a name that are considered legitimate and to link and preserve the variations; in others, it may be appropriate to establish one form of a name and to treat all other forms as errors. Even if the data base is cleaned by linking variant forms and eliminating identifiable errors, users may search for names under yet more variations.
U.S. Pat. No. 5,040,218 to Vitale et al. discloses a voice synthesis system which attempts to identify the origin of a name to enhance pronunciation. The system first searches a dictionary for a name, and if the name is not found, uses grapheme and n-gram analysis to identify the name's likely origin. Similarly, U.S. Pat. No. 5,062,143 to Schmitt shows a system that identifies name origin using n-gram analysis.
U.S. Pat. No. 5,724,481 to Garberg et al. shows a method of matching proper names in a database using a phonemic representation.
U.S. Pat. No. 5,758,314 to McKenna shows an international database processing system. However, this system uses Soundex algorithms to process Unicode input for all cases, rather than providing a name searching system with culture-specific algorithms.
Design Pat. D359,480 shows an IPA-based computer keyboard, but does not disclose any use of IPA for identifying data records.
The article “Identifying Source Languages: the Case of Proper Names” by Valencia and Yvon (1997) discloses statistical models for name searching based on n-gram comparisons. The article also discloses determination of the source language and the use of different statistical models for comparisons, based on the source language.
John Hermansen, a named inventor, authored a doctoral dissertation, “Automatic Name Searching in Large Data Bases of International Names” (1985) which explores the concept of cultural differences in names. The document suggests searching using different culturally specific algorithms, but discloses only a simple n-gram based algorithm.
The assignee has developed a software program known as PC-NAS. An early version of this program was incorporated into a government computer system more than one year before the priority date of this application. This early version performed name searching using a combination of n-gram distribution and positional properties, and included a limited name regularization algorithm as part of an Arabic processing algorithm. Its architecture included sets of algorithms applicable to different cultures, but no automatic classification of the cultural origin of a name.
U.S. Pat. No. 5,485,373 to Davis et al. discloses a text searching system which relies on a Unicode representation (not a phonetic alphabet). The Davis system may vary algorithms based on the language being searched, but has no name classifier. This system is not designed to search for proper names; comparisons are performed based on a Unicode representation, which is not a phonetic alphabet.
Other patents relating generally to computerized language analysis and processing include: U.S. Pat. No. 5,323,316 to Kadashevich et al.; U.S. Pat. No. 5,337,232 to Sakai et al.; U.S. Pat. No. 5,369,726 to Kroeker et al.; U.S. Pat. No. 5,369,727 to Nomura et al.; U.S. Pat. No. 5,371,676 to Heemels et al.; U.S. Pat. Nos. 5,375,176 and 5,425,110 to Spitz; U.S. Pat. No. 5,377,280 to Nakayama; U.S. Pat. No. 5,432,948 to Davis et al.; U.S. Pat. No. 5,434,777 to Luciw; U.S. Pat. No. 5,440,663 to Moese et al.; U.S. Pat. No. 5,457,770 to Miyazawa; U.S. Pat. No. 5,490,061 to Tolin et al.; U.S. Pat. No. 5,515,475 to Gupta et al.; U.S. Pat. No. 5,526,463 to Gillick et al.; and U.S. Pat. No. 5,548,507 to Martino et al.
None of these earlier systems provide a satisfactory system and method for multicultural name searching. Thus, the inventors believe there is a need for an improved system and method for searching name-based records and for determining the degree of similarity between two name representations.
Culturally diverse names may be parsed differently, despite having similar syntactic characteristics. For example, in an English name that includes three tokens, the first two tokens typically represent given names, and the last token typically represents a surname. However, in names of other ethnicities, the middle token may represent a qualifier for the last token, so the first token may represent a given name, and the last two tokens may collectively represent a single surname. As another example, a given name typically precedes a surname in an English name, while a surname typically precedes a given name in an Asian name. For these and other reasons, parsing a group of names correctly and consistently can be difficult, particularly when names within the group represent multiple cultures.