Recently, in a variety of fields, supervised learning is used. The supervised learning represents a learning system in which labeled data is learned by a machine learning device as supervised data, and the label of test data is predicted. As a machine learning device of supervised learning, a support vector machine (SVM) is known.
For example, a technique applied to the diagnosis of medical images is disclosed in a medical field, and supervised data as a support vector (SV) is acquired for a category of abnormal shading and a category of normal shading based on supervised data group. Then, an identification function for maximizing a margin area is calculated based on the supervised data as the support vector, and a SVM is built. At this time, it is judged whether or not the supervised data as the SV is an appropriate SV. In a case where it is judged that the supervised data is judged as an inappropriate SV, the supervised data judged as the inappropriate SV is removed, and an SV and an identification function are calculated again.
For records each configured by a set of values, as a function for combining the records and judging the identity, the similarity, and the relevance between the records, there is a name identification function. In the name identification function, for example, a set of records to be identified is referred to as a name identification source, and a set of records that are opponents for identification is referred to as an identification destination. FIG. 9 is a diagram illustrating the name identification function. As illustrated in FIG. 9, in a name identification process that realizes the name identification function, a record that is the same as the name identification source, a record that is similar to the name identification source, or a record that is relevant to the name identification source is detected from the name identification target, and a detection result is output as a result of the name identification process. Relating to the name identification function, there is a technique for name identification that uses supervised learning.
Patent Document 1: Japanese Laid-open Patent Publication No. 2005-198970
First, a conventional name identification function will be described with reference to FIGS. 10 to 12. FIG. 10 is a diagram illustrating the operation of the name identification function. As illustrated in FIG. 10, in a name identification process that realizes the name identification function, each record J1 of the name identification source is collated with records M (M1 to Mn) of the name identification target so that name identification is performed.
In the name identification process, the values of each item of the identification target (referred to as a “name identification item”) of the record J1 of the name identification source and a record M1 of the name identification target are collated by applying an evaluation function that is defined for each name identification item thereto. Here, it is assumed that the name identification items include a name, an address, and a date of birth, and, in the name identification process, a matching is made by applying each evaluation function of fa( ) to a name, fb( ) to an address, and fc( ) to a date of birth out of the name identification items. Then, the evaluation value of each name identification item that is derived as a result of the matching is weighted in accordance with the name identification item, and the acquired values are added together, whereby a total evaluation value is derived. In addition, in the name identification process, total evaluation values are derived for all the remaining records M2 to Mn of the name identification target with respect to the record J1 of the name identification source. In each name identification process, a name identification candidate set that includes the total evaluation values for sets of the record J1 of the name identification source and the records M1 to Mn of the name identification target is generated.
Then, in the name identification process, a name identification is judged for sets of records that belong to the name identification candidate set based on thresholds defined in advance. For example, in the name identification process, a set of records that are judged to be completely matched with each other is automatically judged as “White”, and a set of records that are judged not to be matched at all is automatically judged as “Black”, and the results are output as identification results. In addition, in the name identification process, a set of records that is difficult to automatically judge is judged as “Gray” and is output to a candidate list. Then, the judgment of the set output to the candidate list is assigned to a staff. In addition, as name identification definitions that are need to be set by a staff, there are a selection of name identification items, a selection of evaluation functions, and setting of weighing factors and thresholds.
Next, a detailed example of the name identification process will be described with reference to FIGS. 11 and 12. FIG. 11 is a diagram illustrating an example of the data structure of name identification definitions, FIG. 11(A) illustrates the contents of the name identification definitions, and FIG. 11(B) illustrates a detailed example of the name identification definitions. FIG. 12 is a diagram illustrating a detailed example of the name identification.
As illustrated in FIG. 11(A), in the name identification definition, a name identification method d1, a name identification source designation d2, a name identification target designation d3, a name identification item designation d4, and a threshold d5 are associated with one another for the definition. In the name identification method d1, a method of identifying names is designated. For example, as a method of identifying names, there is a “self name identification” in which name identification is performed between records within a set in a round-robin system with one record set being set as a target, and duplicate records are eliminated by detecting records that match each other. In the self name identification, since the name identification source and the name identification target are the same set, the structures (items of the record) thereof are the same. In addition, as another method of identifying names, there is a “different party name identification” in which name identification is performed on a combination of a name identification source record and a name identification target record, with respect to different sets of the name identification source and the name identification target, records that match each other between the name identification source and the name identification target are detected, and the corresponding records are associated with each other. In the different party name identification, since the name identification source and the name identification target are different sets, generally, the structures (items of records) thereof are different from each other. In the name identification source designation d2, access information of the name identification source such as a database name and items of a record of the name identification source are designated. In the name identification target designation d3, access information of the name identification target such as a database name and items of a record of the name identification target are designated. In the name identification item designation d4, the name identification items are designated as a combination of items of the name identification source and items of the name identification target, and an evaluation function and a weighting factor that are applied to each name identification item are designated. In addition, in the threshold d5, a upper threshold used for judging “White” and a lower threshold used for judging “Black” are designated.
As illustrated in FIG. 11(B), for example, in the name identification method d1, the “self-name identification” is designated. In the access information of the name identification source designation d2, a “customer table” is designated, and, in the record information of the name identification source designation d2, items of an identification (ID), a name, a zip code, an address, and a date of birth are designated. In addition, in a case where the name identification method is the “self-name identification”, the name identification target designation d3 is the same as the information of the name identification source, and a definition thereof is not necessary. In the name identification item designation d4, the name identification items are designated as name: name, zip code: zip code, address: address, and date of birth: date of birth. The reason for this is that the name identification item is designated as a set of an item of the name identification source and an item of the name identification target, and in a case where the name identification method is the “self-name identification”, the record configurations are the same, and thus, generally, the same item names are designated as the set. For each name identification item, an evaluation function and a weighting factor to be applied are designated. For example, in a case where the name identification item is “name: name”, “edit distance” is designated as the evaluation function, and 0.3 is designated as the weighting factor. On the other hand, in a case where the name identification item is “zip code: zip code”, “complete matching” is designated as the evaluation function, and 0.2 is designated as the weighting factor. In the threshold d5, 0.72 is designated as the higher threshold, and 0.26 is designated as the lower threshold. Hereinafter, a name identification item in which the same item names are paired will be represented as one item name. For example, “name identification item name: name” is represented as “name identification item name”. Here, the “edit distance” is an evaluation function that represents a minimum number of times of editing at the time of transforming the value of the name identification target into the value of the name identification source for a combination of values of the name identification items of the name identification source and the name identification target, as a distance. For example, in a case where a transformation is not necessary, 1.0 is returned, and, in a case where all the transformations are necessary, 0 is returned. On the other hand, in a case where some of the transformations are sufficient, a value in the range of 0 to 1.0 is returned in accordance with the number of the transformations. Here, the “complete matching” is an evaluation function that represents whether or not two values are completely matched with each other for a combination of the values of the name identification items of the name identification source and the name identification target. In a case where the two values are completely matched with each other 1.0 is returned, but otherwise 0 is returned. In addition, the evaluation function is not limited thereto, and there is an “N-gram” that evaluates the degree in which N characters adjacent to each other for the value of the name identification source are included in the value of the name identification target or the like.
FIG. 12 illustrates an intermediate transition and a result of a name identification process with respect to one record M1 of the name identification source and each name identification target, as a part of the name identification process defined in FIG. 11. In the customer table M of the name identification target, for example, two million records are stored. In the name identification process, each one of the records is used as a name identification target and collated with the record M1 of the name identification source. For example, in the name identification process, as an intermediate result of the matching, for each set of the record M1 of the name identification source and records M1 to M6 of the name identification target, a result of applying the evaluation function, a weighting result, and a total evaluation value are output with being associated with one another. Then, in the name identification process, after the matching, for each set of the record M1 of the name identification source and the records M1 to M6 of the name identification target, the judgment on the name identification is made, and the judgment results are output.
Next, the name identification function performed by a machine learning unit corresponding to a machine learning device will be described with reference to FIG. 13. FIG. 13 is a diagram illustrating name identification that is performed by the machine learning unit. As illustrated in FIG. 13, in the name identification process that realizes the name identification function, a machine learning unit that realizes supervised learning is provided. The machine learning unit acquires a training data that is supervised data representing an example of a record pair that represents a positive judgment result and learns judgment criteria used in the name identification process using the acquired training data. These judgment criteria are used as a threshold that is applied to the weighting of each name identification item and the judgment of a name identification target record.
Then, in the name identification process, a record of the name identification source is combined with a record of the name identification target, and a judgment of the name identification is made by using the judgment criteria acquired by the machine learning device, and the judgment result is output. At this time, in the name identification process, a set that is difficult to automatically judge for the name identification is output to the candidate list so as to be given over to a judgment made by a staff. Then, for the set output to the candidate list, by appropriately feeding back a training data in accordance with the judgment made by a staff, the name identification process realizes a high-accuracy judgment through supervised learning.
However, in a name identification process performed by a conventional machine learning device, there is no unit that verifies the validness of a training data or a contradiction between training datas, and accordingly, it is difficult to generate an appropriate training data and apply appropriate feedback to the training data.
In addition, in the name identification process performed by a conventional machine learning device, since there is no unit that evaluates incorrect learning, it is difficult to prevent the quality of the judgment result of a name identification from being degraded. In other words, in a case where the machine learning device performs incorrect learning (erroneous learning) based on an incorrect training data or performs incorrect learning (over-learning) based on biased training datas due to addition of biased training datas in a large scale, it is difficult to detect the degradation of correctness of the judgment result of the name identification. As a result, it is difficult to prevent the quality of the judgment result of the name identification from being degraded.