1. The Field of the Invention
The present invention relates to database entry and more particularly to an improved method of bibliographic field normalization of database entries.
2. The Relevant Technology
Many database systems contain many thousands or even millions of records. Typically, one or more fields of such records are predominantly used for cataloguing or searching database records. These fields are known as bibliographic fields.
Not infrequently, a plurality of database records will have a common value for such bibliographic fields. For example, in a database recording details of a plurality of patent records, the name of an individual or a corporation, who may be an inventor and/or assignee of the patent, may be used for accessing the patent database. But, that same individual or corporation's name may be used for several patents having the same inventor and/or assignee name.
Where, as is often the case, record entries are manually entered, it is not unusual to encounter incorrect entries. This is so, even with the establishment of standard naming conventions, such as for individual's names (for example, that the last name be followed by the given name, separated by commas, or that the name be preceded by one of a subset of salutations, e.g., “Mr.”, “Ms.”).
Moreover, the record data may be correctly entered, but the information on the record itself may represent a latent entry error at an earlier stage, for example, a typographical error in the name of the inventor on the cover page of a granted patent.
Typically, most database entry systems implement a human verification step whereby the verifier manually checks the records entered, or checks for a match between the record fields being entered and corresponding entries already entered in the database. This ensures that the database is maintained in a correct form throughout and thus is suitable for searching.
However implemented, where a record contains even a small number of bibliographic fields, such a human verification process is costly and does not guarantee universal compliance with any naming conventions or 100% accuracy of data entries. Indeed, if the error is latent, that is, incorrectly entered on the document or record now being entered into the database, the verification process will have no impact.
Furthermore, the cost of such a process mandates that such verification typically is only implemented for a small subset of identified key bibliographic fields, for example, in a patent database, the name of the primary inventor and/or the assignee. Other bibliographic fields, such as co-inventor names, agents, or other parties, typically remain unverified and presumably fraught with database entry errors. Thus, to the extent that a search is conducted using such secondary bibliographic fields, the human verification task will not provide any assurances that the correct or desired records will be uncovered by the search.
As a result of the foregoing, there has been interest in developing normalization processes, which, rather than forcing the correctness of database entries, work with potentially incorrect entries and generate metrics for identifying which non-identical bibliographic fields refer to the same entity for purposes of searching the database.
Many of these processes make use of edit distance algorithms, including but not limited to the Levenshtein, Hamming and Damerau-Levenshtein algorithms for quantifying the similarity between two words. Also known as fuzzy searching, such algorithms typically measure the correlation between two text strings by weighting the difference between them, with a zero weight corresponding to identical strings, a weight of one corresponding to strings that differ by a single substitution (the change created by a single letter in a word) and so on.
Using such a metric, the lower the weighting, the more likely that the strings under consideration constitute a match, that is, refer to the same bibliographic entity, which may be identified using a look-up table or dictionary.
There are a number of prior art systems directed to methods to automatically correct textual errors in a query.
For example, U.S. Pat. No. 7,076,732, issued Jul. 11, 2006, to Nagao, and entitled “Document Processing Apparatus Having an Authoring Capability for Describing a Document Structure,” describes the use of dictionary looping to correct errors in phrasal strings. Phrasal strings refer to a string of words that do not form a complete sentence, such as key words in a search engine. The method, taught by Nagao, segments the entire phrasal string into substrings, rather than space-delineated words, and compares these substrings against entries in a phrasal dictionary to obtain a best match. Nagao is primarily geared to spelling correction within a search engine and is of limited applicability in normalizing bibliographic fields within a large database.
U.S. Pat. No. 6,556,991, issued Apr. 29, 2003, to Borkovsky and entitled “Item Name Normalization” groups similarly spelled candidate bibliographic fields together to form clusters in a dictionary relating to a selected normalized bibliographic field. A candidate field entered into the database is mapped to the corresponding normalized field for such cluster. Borkovsky limits the matching capabilities to consideration of a dictionary listing only. Thus, weighting of candidate records is based only on the value of the bibliographic field in question.
Trajtenberg et al., in a presentation entitled “The Names Game: Using Inventors Patent Data in Economic Research” at the NBER and CEPR Conference at Tel Aviv University in 2004, online: <www-siepr.stanford.edu/programs/SST_Seminars/Seminar_Stanford—1.ppt>,(“Trajtenberg No. 1”), and in a paper entitled “The ‘Names Game’: Harnessing Investors' Patent Data for Economic Research” National Burearu of Economic Research, Working Paper 12479 (August 2006), online: National Bureau of Economic Research <www.nber.org/papers/w12479>(“Trajtenberg No. 2”) describe a method to obtain data useful in economic research from patent information and, more specifically, from inventor information. Record fields corresponding to the inventor are normalized during searches by matching a candidate to the query bibliographic field by using a related field, for example, matching patent number and inventor name field pairs. Trajtenberg Nos. 1 and 2, however, use pair-wise matching techniques to match pairs of these related fields, and do not consider more than one related field or any potential related records in the database related to the bibliographic field in question.
It would therefore be advantageous to devise an improved automated bibliographic field normalization approach that minimizes the use of humans to verify the accuracy of the data input of records into the database.