The present invention relates to a method of comparing data sequences, and in particular to a method of comparing data sequences that allows for database schema information to be matched.
Many large organisations and businesses have a large number of databases and information sources that contain data relating to their customers, business relationships, technical data, operational data, financial performance etc. It is conventional to use data warehouses to bring together disparate and silo data sources through ETL processes, i.e., Extract, Transform and Load, so that techniques, such as, for example, data mining, can be used to establish correlations and causal connections between the different metrics and operations that are represented by the different data types. There is also a growing interest in business intelligence, in which operational data that is collected by a business operation can be used to infer or predict the tactical or strategic performance of the business. Typically, each of the different databases or information sources will have been developed for a different reason and thus will have a different data schema, which will hinder a simple correlation of the data held in the different databases or information sources. In current implementations, schema matching is typically performed manually, supported by a graphical user interface. This approach has a number of limitations: manually specifying schema matches is tedious, time-consuming, error-prone, and expensive. This is a growing problem given the rapidly increasing number of data sources and E-business to integrate. Moreover, as systems handle more complicated databases and applications, their data schema become larger, which causes an increased number of matches to be performed. Schema matching techniques are also of interest in the fields of schema integration, E-commerce, semantic query processing and the semantic web.
To give an example of a typical problem, in a database there may be a column which stores all names of employees in the company, in the order of surname followed by first name. The surname and first name are typically separated using one of commas, spaces or tabs. Also, in this database, there may be another column in a table which stores employee names in the order of the initial followed by the surname. In this case, these two forms to store employee names are literally different but semantically similar. However, schema metadata-based matching often could not discover they are all people's names. Also, this kind of similarities is important to join data across data sources as they often do not contain foreign keys between tables. For example, in one table the columns may contain the data of each employee's ID number and their billing addresses while another table may contain each employee's postal addresses and contact telephone number. By finding the similarity of each employee's-’ billing addresses in the first table and each employee's postal address in the second table, these two tables could be joined so that more information can be maintained consistently. Although there are methodologies for designing consistent databases from scratch (for example, using Normal Forms), given large quantities of many separated databases and many years' legacy data, data inconsistency is a pervasive problem existing currently within warehouses of large organisations.
Current approaches include discovering similarities between different schemas manually and automatically (or semi-automatically) finding potential matches at the schema level or at the instance level. There are some disadvantages associated with these approaches.
For the manual approach, the obvious disadvantage is that this approach is time-consuming, sometimes to the extent that it falls outside the time frame allowed in any project and for daily maintenance. As a result, there is a very significant cost involved in hiring sufficient database experts to undertake the required work.
There are schema level approaches that use database schemas for schema matching. Schemas are descriptions of the structures of databases. In this approach, only schema information is exploited for schema matching. Such information includes properties of schema elements, e.g. name, description, data type, relationship type, constraints; and schema structures (see, by way of example, L Palopoli et al, “Semi-automatic, semantic discovery of properties from database schemas”, Proc Int. Database Engineering and Applications Symp. (IDEAS), IEEE Comput, pp. 244-253, 1998, L Palopoli et al, “An automatic technique for detecting type conflicts in database schemas”, Proc 7th Int Conf On Information and Knowledge Management (CIKM), pp. 306-313, 1998, AH Doan et al, “Learning source descriptions for data integration”, Proc of WebDB Workshop, pp. 81-92, 2000, S Bergamaschi et al, “Semantic integration of semistructured and structured data sources”, ACM SIGMOD Record 28(1):54-59, 1999, S Castano et al, “Global viewing of heterogeneous data sources”. IEEE Trans Data Knowl Eng 13(2):277-297, 2001). Auxiliary information is also used for schema matching such as dictionaries, thesauri, matching rules provided by the users and previously matched schemas.
However, due to the abstractness of the description, sometimes there is no adequate information about the data to enable the relevance of columns to be judged. For example, at a schema level, sometimes, data type, length limit, data column names are not enough for judging the difference between the data in different table columns. “John Smith” and “Smith, John” are very similar at the schema level. Although, “broadband modem” and “John Smith” are not similar in terms of semantics, they are similar at the schema level, because they both can be expressed as the string type, the data length is similar and column names can be both “names” for product names and people's names. Therefore, due to the limitations discussed above, there is a need for automatically finding similar patterns in databases by using other available information as a complement, rather than by observing schema only at an abstracted level.
The instance level approach can gain better insight into the characteristics of the actual schema elements when compared with schema, level approaches, because more information is available which is not limited to the abstract data description. This is especially useful when little or no schema information is given. Typical examples of these methods reply on techniques such as linguistic characterisation, or constraint-based characteristics. Judging using linguistic characteristics is of most use when applied to text elements. For example, if many of the instances of a schema element can be found in a name database, then probably this schema element is more related to any schemas which have “name”, or similar, as the name of a schema element than others which do not have.
Constraint-based characterisation is of most relevance when applied to numerical value ranges, averages or character patterns. For example, normal UK telephone numbers begin with ‘0’ and have 11 digits in total, therefore, a judging rule which has length limit of 11, and has ‘0’ as the first character can be generated as a constraint.
Various approaches have been proposed to perform instance matching or classification, such as methods by rules, neural networks, and machine learning technologies (see, for example, J Berlin & M Motro, “Autoplex: automated discovery of content for virtual databases”, Proc 9th Int Conf On Cooperative Information Systems (CoopIS), Lecture Notes in Computer Science, vol. 2172, 2001, pp. 108-122, AH Doan et al, “Learning source descriptions for data integration”, Proc of WebDB Workshop, pp. 81-92, 2000, A H Doan, P Domingos, A Halevy, “Reconciling schemas of disparate data sources: a machine learning approach”, in Proc of SIGMOD, 2001, W Li & C Clifton, “Semantic integration in heterogeneous databases using neural networks”, Proc 20th Int Conf On Very Large Data Bases, pp. 1-12, 1994, Wen-Syan Li, Chris Clifton: “SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks”, Data Knowl. Eng. 33(1): 49-84 (2000) & Wen-Syan Li, Chris Clifton, Shu-Yao Liu: “Database Integration Using Neural Networks: Implementation and Experiences”, Knowl. Inf. Syst. 2(1): 73-96 (2000).
In these approaches, machine learning techniques are used. However, these machine learning techniques require training before patterns can be recognised to match similar patterns. The training process is often a time-consuming process, that requires a large number of training samples to be built such that the patterns can be recognised.
In these approaches, machine learning techniques are used. However, these machine learning techniques require training before patterns can be recognised to match similar patterns. The training process is often a time-consuming process, that requires a large number of training samples to be built such that the patterns can be recognised.
According to a first aspect of the present invention there is provided a method of determining one or more patterns in a sequence of symbols, wherein the method comprises the steps of: a) analysing the sequence of symbols, such that any patterns of two symbols are replaced by a rule if: i) the rule does not comprises a repeated combination of two symbols; and ii) the rule occurs more than once in the sequence of symbols.
The method may comprise the further step of: b) adding a adding symbol to the sequence of symbols; and then repeating step a). In one embodiment of the present invention step a) is repeated until no further patterns of symbols may be replaced by a rule.
This method has advantages over the known approaches to these types of problem. Significantly, a method according to the present invention could be implemented automatically as a method for monitoring databases and discovering similarity between columns without much effort from human database experts. This will give significant savings in development and maintenance time and cost. When compared with database schema comparison methods, a method according to the present invention utilises data instances, so it can more accurately reflect the patterns expressed in data, avoiding the inaccurate matching typically found in schema matching methods due to inadequate amount of schema information available. Rather than giving an exact match, a method according to the present invention uses patterns identified in the data, rather than matching the same strings. This provides a fuzzy matching feature, as the data records may not necessarily be arranged in the same order, data records may not necessarily be in the same quantity, nor are they exactly the same. An indication of similarity is determined by a value for closeness calculated by comparing rules generated from both database attributes.
According to a second aspect of the present invention there is provided a method of determining the similarity between a first data series and a second data series, wherein the first data series and the second data series have been generated from a respective first sequence of symbols and second sequence of symbols, and a similarity value is generated which indicates the similarity between the set of rules comprising the first data series and the set of rules comprising the second data series.
The similarity value may have a value of 0%, indicating that there are no rules present in the first data series that are present in the second data series. The similarity value may have a value of 100%, indicating that i) the first data series comprises the same rules as those present in the second data series; and ii) each rule present in the first data series is present the same number of times in the first data series as in the second data series.
According to a third aspect of the present invention there is provided a computer program product, comprising computer executable code for performing a method as described above.
According to a fourth aspect of the present invention there is provided an apparatus configured to perform as described above.