In recent years, technologies such as Web services, SOA (service oriented architecture), EAI (enterprise application integration), ESB (enterprise service bus) have been developed as fundamental technologies for operating multiple computer systems in a linked manner. These are technologies that enable passing data between systems by unifying data formats using XML (extensible markup language) or the like. However, even when unifying data formats, systems cannot be simply linked. This is caused by inconsistencies in data representation methods.
Four terms “attribute,” “attribute name,” “attribute value,” and “schema” are defined as follows. An “attribute” refers to a concept that represents a characteristic of an object or event of the real world when representing the object or event as data. For example, in generating data about a human, “namae (meaning “name” in Japanese)” or “seibetsu (meaning “sex” in Japanese),” which is a characteristic of the human, is an attribute. An “attribute name” is a term that is used to represent what is represented by an attribute. That is, an “attribute name” is a term serving as the name of an attribute. For example, in using a name or sex as an attribute in order to represent a human as data, a term “namae” or term “seibetsu” is an attribute name. An “attribute value” is the value of a characteristic specified by an attribute (the value of a characteristic that is an attribute). For example, “Tarou Tanaka” for the attribute “namae” and “male” for the attribute “seibetsu” are attribute values.
A “schema” contains multiple attributes that define an object or event of the real world and have defined attribute names. Also, in a schema, the description formats of attribute values are defined. Examples of a schema defining a “human” include a schema that contains four attributes having attribute names “name,” “address,” “seibetsu,” and “age” and defined by a rule for describing the attribute values of attributes, such as “seibetsu has a value ‘otoko (meaning “male” in Japanese)’ or value ‘onna (meaning “female” in Japanese)’.”
The inconsistencies in data representation methods mean that the schema used to represent data varies among the creators of data. Specific examples of such an inconsistency include an example where a piece of data having an attribute name “namae,” of pieces of personal information data is handled as an attribute name “name” in another service. Another example is one where an attribute value representing the time is described as “2006/09/06” in one piece of data and described as “Sep. 6 (2006)” in a difference format in another piece of data. Yet another example is one where the rule for describing attribute names and attribute values varies among databases or data warehouses, as in a case where information handled as two attributes “hourly wage” “working hours” in one database is handled as one attribute “salary” in another data warehouse.
As seen, in a case where multiple computer systems are operating in a company, each system defines data using a specific schema. Accordingly, data indicating the same fact often has different attribute names among the systems or the rule for describing attribute values often varies thereamong. For this reason, in order to transfer data from a system A to a system B and link the systems, there is a need to find a method for converting data represented using the schema of the system A into data represented using the schema of the system B. In this specification, such a problem will be referred to as a “schema integration problem.”
For example, Patent Document 1 describes a system for solving schema integration problems. The system described in Patent Document 1 converts multiple schemas into abstract schema graphs and converts all or some of the abstract schema graphs, according to a predetermined rule. Also, the system matches class names or attribute names defined by the abstract graphs with each other, generates integrated abstract schema graphs, based on the matching results, and converts the generated integrated abstract schema graphs into integrated schemas.
Actual schema integration problems are often caused by the difference between the attribute value description rules. FIG. 1 is a diagram showing an example of schemas that have different attribute value description formats. In FIG. 1, an example is shown where a conversion-source schema (source schema) S including pieces of employee information that requires conversion is converted into an intended schema (target schema) T. In the source schema S shown in FIG. 1, the attribute value of an attribute having an attribute name “family name” is a family name and the attribute value of an attribute having an attribute name “first name” is a first name. On the other hand, in the target schema T, the attribute value of an attribute having an attribute name “namae” is a full name. Accordingly, there is a need to connect the character strings, that is, connect the attribute value of the attribute having the attribute name “family” in the source schema S and the attribute value of the attribute having the attribute name “first name” therein and use the connected character strings as the attribute value of the attribute having the attribute name “namae”. Also, in the source schema S, the attribute value of an attribute having an attribute name “seibetsu” is “dansei” (meaning “male” in Japanese) or “josei” (meaning “female” in Japanese); in the target schema T, the attribute value of an attribute having an attribute name “seibetsu” is “m” or “f.” Accordingly, conversions such as a conversion from “dansei” into “m” must be performed. Similarly, with respect to other attributes illustrated in FIG. 1, there is a need to perform various conversions such as one where a prefecture name is extracted from an address and the prefecture name is converted into a region name such as “Kinki” and one where an age such as “22” is converted into an age group such as “20 s.” As seen, if the schemas have different attribute value description formats, it is necessary to not only simply find correspondences between attribute values but also convert attribute values between the corresponding attributes.
Hereafter, an attribute having an attribute name “A” may be referred to as ‘an attribute of an attribute name “A.”’
Non-Patent Document 1 describes an example of a schema integration system that converts data automatically in cases where the schemas have different attribute value description rules. The system described in Non-Patent Document 1 includes multiple searcher modules for performing fundamental data conversions (character string connection, four fundamental operations, etc.), an evaluator module that determines whether the result of a data conversion performed by each searcher module is correct, and a search unit that searches for a proper conversion method in a searching manner using the searchers and evaluator.
The system described in Non-Patent Document 1 operates as follows. When receiving a pair of data sets (tentatively, referred to as “S” and “T”) represented by different schemas, the search unit searches for a set of searchers that convert attribute values in the S into the schema of the T. This search is performed as follows. First, the system generates a pair of attributes (tentatively, referred to as “s1”) in the S and attributes (tentatively, referred to as “t1”) in the T (step 1). Next, the system tries to perform data conversion on the attribute value set of s1 using any searcher (step 2). Also, the system compares values obtained from the conversions with a set of the attribute values of the t1 using the evaluator and checks whether these values are similar to each other (step 3). If these values are highly similar to each other, the system determines that it has found correspondences between these attributes as well as a conversion method and outputs the searcher module used in the try as an inter-attribute conversion method and ends the search. If the similarities between the post-conversion values of the attributes s1 and the attribute values of the target attributes t1 are low, the system returns to step 1, generates another pair of attributes, and performs the same steps. Except for the above-mentioned cases, the system performs data conversions using each searcher again while using the conversion results of the s1 as inputs, and performs step 3 again. As seen, the system described in Non-Patent Document 1 tries data conversions in a searching manner until it obtains correct conversions, so as to obtain a conversion method.
[Patent Document 1]
    Japanese Unexamined Patent Application Publication No. 2003-162533 (paragraphs 0065 to 0069, FIG. 3)[Non-Patent Document 1]    Robin Dhamalekar, Yoonkyong Lee, AnHai Doan, Alon Halevy, Pe dro Domingos, “iMAP: Discovering Complex Semaletic Matches between Database Schemas,” Proceedings of ACM SIGMOD, pp. 383 to 394, 2004.