1. Field
The present disclosure relates to searching and matching data, and more particularly, to searching and matching data containing non-phonetic, semantic, or ideogrammatic content.
2. Description of the Related Art
Efficient database access and searching capabilities are important for effective utilization of data in databases. Crucial to this objective is the ability to enable efficient retrieval of the correct data sought by means operating to find a match without having to search through each data element stored in the reference universe.
Searching and matching systems are known, and provide useful ways to retrieve relevant information from a database for a variety of uses. For example, in the credit industry, credit history information on a given business entity being considered for credit is typically processed through a commercially available database. A user may input the name of a business entity into a processor connected to the database, which then locates that given entity in the database and retrieves its credit history information. Other examples include applications where a user may wish to integrate information from among disparate sources to get a common view of a customer or supplier.
An exemplary method and system for searching and matching input data with stored data is disclosed in U.S. patent application Ser. No. 10/702,114, published as U.S. Patent Application Publication No. 2004/0220918 A1, which is incorporated herein in its entirety by reference. The basic approach includes three sequentially performed processes, which are shown in FIG. 1:
1. Cleansing, Parsing and Standardization. This process includes a) identification of key components of inquiry data; b) normalization of name, address and city data; and c) standardization of address data.
2. Candidate retrieval. This includes a) selecting keys based on data provided in inquiry, b) optimizing keys to improve retrieval quality and speed, and c) gathering best possible match candidates from a reference database.
3. Evaluation and Decisioning. This step involves evaluating matches according to consistent standards utilizing consistent, reproducible match quality feedback to translate otherwise subjective decisions into objective criteria such as matchgrade patterns to reflect individual attribute decisioning and a confidence code for overall stratification of results into groupings of similar quality among other benefits. These treatments enable autodecisioning.
Prior Asian match feedback information was limited to strata in which match inquiry results are categorized as A, B or C. This level of feedback, without the ability to differentiate between results within the A or B level matches, at the individual level, is less than ideal since the number of matches within each category, particularly those in a B category, would be significant and there would be no way to differentiate among them without manual intervention by a native language speaker.
Resolving A, B and C matches is possible, but costly in that it is a very manual-intensive process, requiring human interaction to verify matches.
The matched categories were described as follows. An “A” match indicates a high likelihood of a match, but could contain matches to duplicates or false matches. A “B” match indicates a possible match, but one that would require manual study to resolve. A “C” match indicates a probable mis-match, which may also be due to deficiencies in inquiry data.
The issue with autodecisioning in the above-mentioned environment is a lack of granularity. Absent further feedback on the quality of the matches, a user had no way to choose among the many “B” matches in order to select the best matches. Even among the A matches, there is no ability to improve confidence short of manually reviewing each match.
A diagram of the prior art matching system is shown in FIG. 2.
In the present system, to further differentiate among inquiry results having different levels of matching, the high level match feedback is made more granular and mapped to a corresponding confidence code. Target confidence codes (“CC”) are preferably chosen at the conservative end of the range. Subsequent tuning enhances the distribution of this mapping. An example of this mapping is shown in FIG. 3.
At a confidence code of 7 or above, many customers in marketing will set a system to auto-decision due to the ability to accept these matches without human intervention. Not all confidence code 7 matches will be perfect matches, so it is preferable to consider the autodecisioning threshold carefully. Conversely, many good matches would be ignored if confidence codes of lower than 7 are used. 7 is therefore the conservative end of the quality threshold, particularly for matches in complex languages such as Japanese.
A confidence between 5 and 7 indicates that there are still available “good” matches, especially where input data is sparse. Results having this confidence code range often require careful inspection to confirm, in the example of Japanese characters, due to the inherent complexity of the native language and multiple writing systems used. Some false matches may also exist due to duplication.
A confidence code of 4 is usually the lowest confidence code that many processes will even consider displaying. These matches are “unlikely” to be correct matches, and generally shouldn't be used unless the inquiry data is very sparse or other mitigating circumstances can be cited.
However, as the above examples show, although a set of data such as the identification and contact information of a business matches closely enough to be considered a “7 or above” confidence code match, that does not mean that the matched data is completely accurate. Likewise, “5 to 6” confidence code matches do not all have the same level of matching. Accuracy is a term best described for a unique business application.
Matchgrade patterns demonstrate different levels of individual attribute matching. An “A” symbol in the matchgrade results indicates a high confidence match in that data attribute between the customer information and the matched record. A “B” indicates similarity, but not to the level of similarity indicated by “A.” An “F” symbol indicates that both the customer data and the matched record have different data for a given attribute. A “Z” indicates that either the customer information and/or the database record do not include any information for a given field. Evaluations are based not only on a character-by-character comparison, but also on semantic meaning, tone, lexemic variation, and other factors. Furthermore, these assignments are made not at the inquiry level overall, but on an individual attribute level to increase granularity and enable autodecisioning.
A confidence code may then be assigned to each different matchgrade string to allow stratification of results. Each of the component processes described above are further broken down into functional areas as shown in FIG. 4.
Using enhanced feedback, a user may enable business rules such as one that subdivides “5-6” confidence code matches, accepting those with perfect name and city, for example, and ordering lookup on those with the correct prefecture (municipality or province) but missing city, and disregarding those with a low quality match on the name. As a result, the feedback enables automated decisioning.
Additional challenges are posed to matching in databases where the process can not rely upon distinctions provided by writing systems that contain phonetic alphabets, such as English, French, and Greek. In languages such as Chinese and Japanese, writing systems embody semantic meaning and are constructed from ideograms, which present a unique challenge to searching and matching. Additionally, countries using these writing systems often freely integrate other writing systems that are phonetic to allow for the presentation of foreign words or new words. The challenge for evaluation in ideogrammatic writing systems is the semantic nature of the writing. Traditional methods for scoring based solely on orthography would be sorely inadequate to discern meaning at a level sufficient to differentiate “similar” from “same”, which is at the heart of the inventive matchgrade processes.
Thus, there is a need to improve on existing search and match systems and methods, particularly by providing additional criteria for evaluating the quality of a match result in non-phonetic writing systems. There is also a need for a system and method for differentiating among machine matches without costly human intervention in data which is presented wholly or partially in an ideogrammatic context; thereby allowing for consistency and scalability. There is also a need for a system and method for fully-automated searching and matching that deals with the challenges of non-phonetic, ideogrammatic writing systems.