After over two-decades of electronic data automation and the improved ability for capturing data from a variety of communication channels and media, even small enterprises find that the enterprise is processing terabytes of data with regularity. Moreover, mining, analysis, and processing of that data have become extremely complex.
Updating, mining, analyzing, reporting, and accessing the enterprise information can still become problematic because of the sheer volume of this information and because often the information is dispersed over a variety of different file systems, databases, and applications. In fact, the data and processing can be geographically dispersed over the entire globe. When processing against the data, communication may need to reach each node or communication may entail select nodes that are dispersed over the network.
Collecting, indexing, and managing data from a variety of sources and a variety of formats is challenging for any enterprise because data fields in one source may be different or may be associated with one field in another source. To deal with this, enterprises often spend a lot of time and resources to manually analyze the sources of data and to then convert those sources into a normalized format.
Even when the above work is done by an enterprise, the data managed may still not be associated with comprehensive records that avoid duplication. That is, duplication can affect the accuracy of the data and results associated with mining the data. Some enterprises may employ additional resources to ensure that data duplication is detected and corrected. These resources may work full time cleaning data received and processed by an enterprise on a daily basis.