Large data sets representable in tabular form may have dozens, or even hundreds, of columns and millions or billions (or more) of rows of data. Often, the generation of the data set involves receiving information that was entered into many different computers by a many people (e.g., thousands, tens of thousands, or millions). Not surprisingly, such data sets include errors, for example, misspelled words, erroneous white spaces, incorrect placement of punctuation, incorrect data types, duplicate information (e.g., double entry), and inconsistent information.
At least due to the sheer size of the data set, manually correcting all of the errors in the data set is effectively an impossible task. Even if the errors could be identified, manually correcting all the errors would be extremely time-consuming, and overburden resources for data sets containing hundreds of millions of rows of data. However, if the data is not corrected, thousands of errors could occur during subsequent processing of the data, which would slow down or stop the processing, and it could also result discarding important data due to its inconsistencies. In addition, even if one large data set is corrected, second large data set will have similar problems and manual corrections would again be required to each incorrect data field.