As to records each configured by a set of values, as a function for matching the records to judge identicalness, similarity, and relevance between the records, there is a name identification function. In the name identification function, for example, a set of records to be identified by name is referred to as a name identification source, and a set of records that are opponents for name identification is referred to as a name identification target. FIG. 24 is a diagram illustrating the name identification function. As illustrated in FIG. 24, in a name identification process that realizes the name identification function, a record that is identical to the name identification source, a record that is similar to the name identification source, or a record that is relevant to the name identification source is detected from the name identification target, and a detection result is output as a result of the name identification process.
Relating to a name identification function for customer information, a technique is disclosed which refines matching data by searching for customer information stored in a name identification database (DB) based on customer data acquired by arranging address information and name information in order and compares the matching data with the customer data. According to such a technique, the degree of matching is judged based on a function for comparing the refined matching data and the customer data serving as a name identification source, and, in a case where the customer data that is compared is judged as customer data of a new customer in accordance with the degree of matching, the customer data is newly registered in the name identification DB serving as a name identification target.
First, a conventional name identification function will be described with reference to FIGS. 25 to 29. FIG. 25 is a diagram illustrating the operation of the name identification function. As illustrated in FIG. 25, in a name identification process that realizes the name identification function, each record J1 of the name identification source is matched with records M (M1 to Mn) of the name identification target so that name identification is performed.
In the name identification process, the values of each item of the identification target (referred to as a “name identification item”) of the record J1 of the name identification source and a record M1 of the name identification target are matched by applying an evaluation function that is defined for each name identification item thereto. Here, it is assumed that the name identification items include a name, an address, and a date of birth, and, in the name identification process, a matching is made by applying each evaluation function of fa( ) to a name, fb( ) to an address, and fc( ) to a date of birth out of the name identification items. Then, the evaluation value of each name identification item that is derived as a result of the matching is weighted in accordance with the name identification item, and the acquired values are added together, whereby a total evaluation value is derived. In addition, in the name identification process, total evaluation values are derived for all the remaining records M2 to Mn of the name identification target with respect to the record J1 of the name identification source. In each name identification process, a name identification candidate set that includes the total evaluation values for sets of the record J1 of the name identification source and the records M1 to Mn of the name identification target is generated.
Then, in the name identification process, name identification is performed for sets of records that belongs to a name identification candidate set based on thresholds defined in advance. For example, in the name identification process, a set of records that are judged to completely be same each other is automatically judged as ““White””, and a set of records that are judged to be different is automatically judged as ““Black””, and the results are output as identification results. In addition, in the name identification process, a set of records which is automatically difficult to judge are judged as ““Gray”” and are outputted to a candidate list. Then, a staff undertakes the judgment with respect to the set outputted to the candidate list. In addition, name identification definitions that are need to be set by a staff include a selection of name identification items, a selection of evaluation functions, and setting of weighing factors and thresholds.
Next, the sequence of the name identification process will be described with reference to FIGS. 28 and 29. FIG. 28 is a flowchart illustrating the name identification function, and FIG. 29 is a flowchart illustrating the sequence of a matching process.
First, in the name identification process, the operating environment is set by reading a name identification definition in Step S100, and records of a name identification source (hereinafter, referred to as “name identification source records”) which are name identification targets are sequentially chosen from the name identification source in Step S101. Then, in the name identification process, records of a name identification target (hereinafter, referred to as “name identification target records”) that are identification opponents are sequentially chosen from the name identification target for each name identification record in Step S102. Here, when the name identification record is changed to another, the process is returned to the start point of the name identification target, and the name identification target records are chosen sequentially.
Next, in the name identification process, a matching process of the name identification source record and the name identification target record is performed in Step S103. Then, in the name identification process, a matching result is stored in the name identification candidate set in Step S104. In addition, the matching result includes a total evaluation value.
Subsequently, in the name identification process, it is judged whether or not there is a remaining name identification target record in the name identification target in Step S105. In a case where it is judged that there is a name identification target record remaining (Yes in Step S105), in the name identification process, the process is returned to Step S102 so as to extract the remaining name identification target records.
On the other hand, in a case where it is judged that there is no remaining name identification target record (No in Step S105), in the name identification process, a judgment is made for each total evaluation value stored in the name identification candidate set by using thresholds, and judgment results are output in Step S106. For example, in the name identification process, in a case where the total evaluation value is an upper-position threshold or larger, it is judged that the matched set of the name identification source record and the name identification target record is a set of records that are the same as each other, and “White” is judged for this set. In addition, in the name identification process, in a case where the total evaluation value is smaller than the upper-position threshold and a lower-position threshold or larger, it is judged that the matched set of the name identification source record and the name identification target record is automatically difficult to judge, and “Gray” is judged for this set. On the other hand, in the name identification process, in a case where the total evaluation value is smaller than the lower-position threshold, it is judged that the matched set of the name identification source record and the name identification target record is a set of records that is different each other, and “Black” is judged. Then, in the name identification process, the judgment results other than the result of “Black” are outputted as results. Since the set of records that is judged as “Black” can be considered to a set that is neither the set of records judged as “White” nor the set of records judged as “Gray” from the judgment results, the judgment result of “Black” does not need to be output as a result. In addition, there is a case where the output of the result is divided into “White” and “Gray”, and “Gray” is referred to as a “candidate list” that means judgment candidates that need to be judged by a staff. In the description and diagrams described below, the upper-position threshold is abbreviated as an “upper threshold”, and a lower-position threshold is abbreviated as a “lower threshold”.
Next, in the name identification process, it is judged whether or not there is a remaining name identification source record in the name identification source in Step S107. In a case where it is judged that there is a remaining name identification source record in the name identification source (Yes in Step S107), the name identification process proceeds to Step S101 so as to extract the remaining name identification source record one by one. On the other hand, in a case where it is judged that there is no remaining name identification source record in the name identification source (No in Step S107), the name identification ends.
Next, the sequence of the matching process of Step S103 illustrated in FIG. 28 will be described with reference to FIG. 29. FIG. 29 is a flowchart illustrating the sequence of the matching process. The matching process is a process of performing matching so as to derive a total evaluation value for each set of the name identification source record and the name identification target record.
First, in the name identification process, name identification items defined in the name identification definition are sequentially selected in Step S110. Here, it is assumed that the name identification items are a set of items to be compared, which is configured by items of the name identification source and items of the name identification target, and are defined in the name identification definition in advance. Next, in the name identification process, values corresponding to the selected name identification items are respectively designated for the name identification source record and the name identification target record in Step S111, and an evaluation value is calculated by applying an evaluation function to the designated two values in Step S112. In addition, the evaluation function is a function that is defined in advance for the name identification item and is assumed to be defined in the name definition.
Subsequently, in the name identification process, it is judged whether or not there is a remaining name identification item in Step S113. In a case where it is judged that there is a remaining name identification item (Yes in Step S113), the name identification process proceeds to Step S110 so as to apply an evaluation function to the remaining name identification item.
On the other hand, in a case where it is judged that there is no remaining name identification item (No in Step S113), in the name identification process, the evaluation value of each name identification item is weighted for each name identification item, and the evaluation values resulting from the weighting are added up in Step S114. Then, in the name identification process, the value of the result of the addition is outputted as a total evaluation value for the target set of records in Step S115, and the matching process for one set ends.
Next, a detailed example of the name identification process will be described with reference to FIGS. 26 and 27. FIG. 26 is a diagram illustrating an example of the data structure of name identification definitions, FIG. 26(A) illustrates the contents of the name identification definitions, and FIG. 26(B) illustrates a detailed example of the name identification definitions. FIG. 27 is a diagram illustrating a detailed example of the name identification.
As illustrated in FIG. 26(A), in the name identification definition, a name identification method d1, a name identification source designation d2, a name identification target designation d3, a name identification item designation d4, and a threshold d5 are associated with one another for the definition. In the name identification method d1, a method of identifying names is designated. For example, as a method of identifying names, there is a “self name identification” in which name identification is performed between records within a set in a round-robin system with one record set being set as a target, and duplicate records are eliminated by detecting records that are the same as each other. In the self name identification, since the name identification source and the name identification target are the same set, the structures (items of the record) thereof are the same. In addition, as another method of identifying names, there is a “different party name identification” in which name identification is performed on a combination of a name identification source record and a name identification target record, with respect to different sets of the name identification source and the name identification target, records that match each other between the name identification source and the name identification target are detected, and the corresponding records are associated with each other. In the different party name identification, since the name identification source and the name identification target are different sets, generally, the structures (items of records) thereof are different from each other. In the name identification source designation d2, access information of the name identification source such as a database name and items of a record of the name identification source are designated. In the name identification target designation d3, access information of the name identification target such as a database name and items of a record of the name identification target are designated. In the name identification item designation d4, the name identification items are designated as a combination of items of the name identification source and items of the name identification target, and an evaluation function and a weighting factor that are applied to each name identification item are designated. In addition, in the threshold d5, an upper threshold used for judging “White” and a lower threshold used for judging “Black” are designated.
As illustrated in FIG. 26(B), for example, in the name identification method d1, the “self name identification” is designated. In the access information of the name identification source designation d2, a “customer table” is designated, and, in the record information of the name identification source designation d2, items including an identification (ID), a name, a zip code, an address, and a date of birth are designated. In addition, in a case where the name identification method is the “self name identification”, the name identification target designation d3 is the same as the information of the name identification source, and a definition thereof is not necessary. In the name identification item designation d4, the name identification items are designated in the form of name: name, zip code: zip code, address: address, and date of birth: date of birth. The reason for this is that the name identification item is designated with a set of an item of the name identification source and an item of the name identification target, and in a case where the name identification method is the “self name identification”, the record configurations are the same, and thus, generally, the same item names are designated as the set. For each name identification item, an evaluation function and a weighting factor to be applied are designated. For example, in a case where the name identification item is “name:name”, “edit distance” is designated as the evaluation function, and 0.3 is designated as the weighting factor. On the other hand, in a case where the name identification item is “zip code: zip code”, “complete matching” is designated as the evaluation function, and 0.2 is designated as the weighting factor. In the threshold d5, 0.72 is designated as the upper threshold, and 0.26 is designated as the lower threshold. Hereinafter, a name identification item in which the same item names are paired will be represented as one item name. For example, “name identification item name: name” is represented as “name identification item name”. Here, the “edit distance” is an evaluation function that represents a minimum number of times of editing at the time of transforming the value of the name identification target into the value of the name identification source for a matching of values of the name identification items of the name identification source and the name identification target, as a distance. For example, in a case where a transformation is not necessary, 1.0 is returned, and, in a case where all the transformations are necessary, 0 is returned. On the other hand, in a case where some of the transformations are sufficient, a value in the range of 0 to 1.0 is returned in accordance with the number of the transformations, which is a value that decreases as the number of transformations increases. Here, the “complete matching” is an evaluation function that represents whether or not two values completely are same each other in matching of the values of the name identification items of the name identification source and the name identification target. In a case where the two values completely are same each other, 1.0 is returned, but otherwise 0 is returned. In addition, the evaluation function is not limited thereto, and there is an “N-gram” that evaluates the degree in which N characters adjacent to each other for the value of the name identification source are included in the value of the name identification target or the like.
FIG. 27 illustrates an intermediate transition and a result of a name identification process with respect to one record M1 of the name identification source and each name identification target, as a part of the name identification process defined in FIG. 26. In the customer table M of the name identification target, for example, two million records are stored. In the name identification process, each one of the records is used as a name identification target and matched with the record M1 of the name identification source. For example, in the name identification process, as an intermediate result of the matching, for each set of the record M1 of the name identification source and records M1 to M6 of the name identification target, a result of applying the evaluation function, a weighting result, and a total evaluation value are output with being associated with one another. Then, in the name identification process, after the matching, for each set of the record M1 of the name identification source and the records M1 to M6 of the name identification target, the judgment on the name identification is made, and the judgment results are output.
However, there is a problem in that it is difficult to allow all the name identification results to be clearly understood in a large-scale name identification process. In other words, in a conventional name identification process, since records of the name identification source and the name identification target are matched in a round-robin system, in order to store the matching results of all the combinations, a considerable storage capacity is used, and enormous time is required for the analysis thereof. In addition, the processing time that is required for the matching relating to the name identification process is enormous. For example, in a case where a self-name identification is designated as the name identification method, and the name identification source and the name identification target respectively have two million records, 2 million records×2 million records=combinations of four trillion sets are matched. Here, assuming that the data capacity relating to the matching result of one set is 50 bytes, the data capacity relating to the matching results of all the sets is 200 terabyte (TB), and enormous time is required for the analysis thereof. Accordingly, it is not practical to store the matching results of all the sets and analyze and visualize the stored matching results of all the sets. Consequently, it is difficult to allow the matching results of all of the sets to be clearly understood.
In addition, in order to appropriate perform the name identification process, setting that is based on the experiences and the adjustment thereof through feedback of the processing result are done, and a name identification result and a matching result need to be clearly understood for effective feedback.