Typically, while generating a new database by means of integrating a plurality of databases, a name identification operation is performed. In the name identification operation, resemblance is determined among data items in the databases to be integrated; resembling data items are extracted; and common data items are decided from the extracted data items.
As a specific example, degrees of resemblance are calculated among data items in the sets of master data stored in a plurality of systems such as a bookkeeping system, a customer system, and a delivery system. Then, the data items having a high degree of resemblance are extracted.
As a method of determining the resemblance among data items, either the visual judgment of an operator can be used or automated estimation can be implemented. For example, when the visual judgment of an operator is used, the operator identifies resembling data items by referring to the explanation of data items given in the design specification of each database. Accordingly, the operator determines that, for example, a data item “order receiving entity” stored in a database A resembles to a data item “valued customer” stored in a database B.
When automated estimation is implemented, degrees of resemblance are calculated while sampling the data items, and resembling data items are identified according to the degrees of resemblance. For example, the degrees of resemblance are calculated using the string lengths of attribute values associated with the data items, or using the frequency of appearance of special strings appearing in the data items or in the attribute values, or using the number of partial strings in common.
Patent Literature 1: Japanese Laid-open Patent Publication No. 2003-271656
Patent Literature 2: Japanese Laid-open Patent Publication No. 11-143902
Patent Literature 3: Japanese Laid-open Patent Publication No. 06-325091
Patent Literature 4: Japanese Laid-open Patent Publication No. 2001-067378
However, even after implementing the conventional technology, depending on the number of data items or depending on the result of calculating the degrees of resemblance among the data items, it may take a lot of processing time to determine resembling candidates.
For example, when the visual judgment is used, since the data items stored across databases are generally named in a varied manner, it is often difficult to determine the resemblance only by referring to the data item names written in the design specification. Moreover, in past systems, the design specification of databases is often not maintained; or even if the design specifications are maintained, the updated portion may not be reflected therein. In such cases, the operator has to determine the resemblance by checking the attribute values of data items, that is, by checking the data itself. That task takes an immense amount of time in case the number of data items is large.
During automated estimation, calculation is done regarding combinations of data items. Hence, if there are a large number of data items or if resembling items are not narrowed down in advance, automated estimation takes a lot of time for the calculation. For example, if there are 100,000 data items, then the resemblance needs to be calculated for 100000×100000/2=5000000000 (5 billion) times, which is not a realistic task.
Moreover, during automated estimation, data items having a high degree of resemblance are extracted as data items that are likely to be integrated. That is, on the other hand, data items having a low degree of resemblance are automatically left out of consideration. Hence, regarding such data items which although have a low degree of resemblance but still are targets for integration due to a high degree of relevancy, eventually the method of using visual judgment needs to be implemented.