The present invention relates to data cleansing, and in particular, to a system and method of data cleansing using rule based formatting.
Extract, transform, and load (ETL) may be some processes that are performed as part of managing databases. A subset of desired data may be extracted from various data sources as part of the extract component. The transform component may convert the extracted data into a suitable state. Finally, the load component of ETL may include transferring the transformed data to a target data source like another database, a data mart, or a data warehouse, for example. Thus, ETL allows data that is extracted from various data sources to be converted into some desirable format and transferred to another data source.
Data cleansing may be a process that is performed in the transform component of ETL. Data cleansing may include the detection of incorrect data, which may then be corrected or removed, and the formatting of data. Moreover, the detection of data may be accomplished by tokenizing the data and parsing the data according to some predetermined rules. One technique of parsing data is to use rules (i.e., rule-based parsing). When formatting the output data, it may be desirable to control how the tokens may be ordered or what strings may delimit the tokens. Thus, it may be desirable to tokenize and parse data. However, when using a rule-based parsing technique to parse data, it may be difficult to control how the parsed components may be ordered or what strings may delimit the parsed components.
Thus, there is a need for improved data cleansing that allows control of the formatting of output data when rule-based parsing is used. The present invention solves these and other problems by providing a system and method of data cleansing using rule based formatting.