Companies and other organizations commonly maintain large quantities of data in different databases and in different formats (even within the same database). Data mastering—compiling diverse data into a single, unified database while eliminating duplication and errors—has become increasingly important in ensuring data quality and accessibility. A number of software-based tools are available for automated data mastering, such as Zoomix ONE, produced by Zoomix Data Mastering Ltd. (Jerusalem, Israel).
One of the challenges of data mastering is to convert unorganized text into orderly sets of attributes and values. For example, records in an enterprise database corresponding to different products may each contain a list of terms describing the corresponding product without indicating the product attribute that each term identifies. (For example, a “description” field of the record may contain the terms “lamp,” large” and “yellow,” listed as free text, without specifying that these terms are values of the respective attributes “product type,” “size,” “color,” etc.) Normalization of the records—i.e., associating the terms with a common, standardized set of attributes—is important in enabling applications to search the records efficiently, as well as in data cleansing (detection and correction of data errors) and categorization.