1. Technical Field
The present invention relates to the field of data mining and establishing patterns in data.
2. Discussion of the Related Art
Enterprises store significant quantities of data as information assets. However, this data is often in the form of free text and is of poor quality. In order to increase the quality and usefulness of the data, the data is standardized by employing rule based data standardization systems in which domain experts manually code rules for handling important and prevalent patterns.
A lexicon may be composed for establishing patterns in text data. Consider, for example, a fictitious noisy record such as “256 B Smith Towers HL Road Somecity 45”. This record may be represented with the following expression referred to as the following pattern: (^++R+SC^), where “^” is a marker representing a number (e.g., “256” and “45”), “+” is a marker representing unknown text (e.g., “B Smith” and “HL”), and “R”, “S” and “C” are markers representing a building (e.g., “Towers”), a street (e.g., “Road”) and a city (e.g., “Somecity”). The text data is typically represented in a manner such as this in order to identify various semantic entities and also to identify and correct mistakes (also referred to as standardization of text) or missing text. For example the above text is segmented into various components such as door number (256 B), building name (SMITH), and building type (TOWERS), StreetName(HL), Street type(ROAD), CITY (SOMECITY) and PIN (45). To identify such segments from the text data as above one has to identify the important sub-patterns from the input text which represent a single semantic element. For example, the sub-pattern “^+” identifies the door number, “+R” represents the building information of which first half represents the building name and the second half represents the building type. Similarly, other sub-patterns for Street information, city and pin are “+S”, “C”, and “^” respectively.
Finding patterns in text can be laborious and time consuming, particularly for noisy or highly specialized data sets such as the previous example. In particular, domain experts must hand craft the pattern rules, and this can be a very time consuming and costly process. Finding such patterns can also be subjective to the persons determining the patterns.