1. Technical Field
The present disclosure generally relates to record linkage techniques for identifying same entities in multiple information sources. More particularly, and without limitation, the present disclosure relates to methods and systems for normalizing names and matching records.
2. Background Information
Vast amounts of information are stored in heterogeneous distributed sources. For example, LexisNexis stores large and diverse content including non-structured data such as text-based news articles and legal cases, and structured data such as public records and attorney and judge attorney directories. Therefore, records (e.g., of organizations, persons, and addresses) that pertain to the same entity may be stored in various forms. These variations of the same entities pose a problem when comparing names or linking records, because an entity that has its name stored in two different forms may be determined as being two different entities when its varying names are compared or when its varying records are linked.
For example, law firm names vary greatly in format. For example, some law firm names include last names of partner attorneys, such as “Finnegan, Henderson, Farabow, Garrett & Dunner, L.L.P.” Other law firm names do not include any last names but may include an area of specialty such as “The Injury Lawyers, P.C.” In addition, various different forms of the same law firm name may be used in different contexts. Often, long law firm names are shortened for convenience. For example, “Law Office of Smith, Johnson & Miller” may be shortened to “Law Office of Smith et al.”, and “Finnegan, Henderson, Farabow, Garrett & Dunner, L.L.P.” may be referred to as just “Finnegan.” Also, a law firm name that includes a middle initial of an attorney such as “John D. Smith, Attorney at Law” may be referenced without the middle initial “D.”
Due to the vast amounts of information distributed across multiple sources, there is a need and desire to resolve entity relationships and integrate distributed information together so that related information can be easily packaged and presented together. For example, in the context of legal information, an attorney's professional identification presented in a case law document may be linked with the attorney's entry in a structured authority directory that includes the attorney's biographical and employer information.
To resolve entity relationships and integrate distributed information, there is a need to develop a record matching method. Using a probabilistic classifier, such as a naive Bayes classifier, to match records may result in undesirable levels of precision error and recall error. Accordingly, there is a need for methods and systems that match records with high precision and recall.
Furthermore, in support of record matching, normalization may be necessary before the matching process begins for improved comparison results. Past normalization techniques have included pure rule-based approaches and pure probability-based approaches, which may be less effective than combined approaches.
In view of the foregoing, there is a need for improved methods and systems for matching records, and methods and systems for normalizing names.