Data duplication and proliferation within enterprises is a problem. Typically, enterprises manage data with a heterogeneous infrastructure of different data management systems (e.g., Relational Database Management Systems, Document/Content Management Systems, file servers, Network-Attached Storage, etc.). These systems often require data to be replicated and stored as several copies because different applications require data to conform to different schemas. Schemas are used by databases to describe the structure of the data in the database. Schemas can contain, e.g., tables, which contain columns that have names. Additionally, data duplication may be used to improve performance and reliability within a system. However, all of this data duplication contributes to a proliferation of data over time. Therefore, it has become increasingly difficult to provide a business-level overview of the data present in information technology (IT) infrastructures.
Currently enterprises are not able to clearly determine what data assets exist and how the data is relevant to the business. This is in large part because data duplication across an enterprise may result in data acquiring different names depending upon the application that uses the data and/or when the application requires a particular schema, and therefore particular names. Consequently, it is difficult to reconcile data across an enterprise and provide a consolidated view of data assets.
Data classification may be used to categorize stored data within an enterprise. Data classification has traditionally been performed by individuals, which makes it a very time-consuming, error-prone, and an expensive process. Furthermore, with enterprises having many, possibly hundreds, of databases, this task becomes impossible for individuals to carry out. Therefore, data labeling and classification is becoming increasingly important as enterprises attempt to understand the data they have, and comply with internal procedures as well as legislative requirements for protecting data.
The need to classify data continues to grow as reports of data losses are circulated throughout enterprises and communities. Additionally, the need to protect certain data classifications containing private information also increases as more unauthorized persons break through external firewalls and encounter little to no internal security measures. This is particularly relevant as businesses increase business process outsourcing and “offshore” storage of data, which requires that detailed audit trails be kept.
Accordingly, there exists a need in the art to overcome the deficiencies and limitations described hereinabove.