A modern organization typically maintains a data storage system to store and deliver records concerning various significant business aspects of the organization. Stored records may include data on customers (or patients), contracts, deliveries, supplies, employees, manufacturing, or the like. A data storage system of an organization usually utilizes a table-based storage mechanism, such as relational databases, client/server applications built on top of relational databases (e.g., Siebel, SAP, or the like), object-oriented databases, object-relational databases, document stores and file systems that store table formatted data (e.g., CSV files, Excel spreadsheet files, or the like), password systems, single-sign-on systems, or the like.
Table-based storage systems typically run on a computer connected to a local area network (LAN). This computer is usually made accessible to the Internet via a firewall, router, or other packet switching devices. Although the connectivity of a table-based storage system to the network provides for more efficient utilization of information maintained by the table-based storage system, it also poses security problems due to the highly sensitive nature of this information. In particular, because access to the contents of the table-based storage system is essential to the job function of many employees in the organization, there are many possible points of possible theft or accidental distribution of this information. Theft of information represents a significant business risk both in terms of the value of the intellectual property as well as the legal liabilities related to regulatory compliance. Various search mechanisms have been used to detect theft of sensitive information, such as relational database search techniques, information retrieval techniques, file shingling techniques, and Internet Content Filtering Techniques, which are known by those of ordinary skill in the art.
Also, as the volume of data continues to grow within organizations, data security personnel have little or no visibility into where sensitive information, such as confidential data, is stored across the enterprise. There are three fundamental challenges surrounding stored data: 1) quickly finding exposed confidential data wherever it is stored, 2) understanding who has unauthorized access to that data, and 3) fixing the exposed confidential data automatically or manually.
Sensitive information, such as confidential data, can be stored in a variety of data repositories maintained by different application systems such as Oracle® Relational Database Management System (RDBMS), Structured Query Language (SQL) servers, Lotus Domino® servers, Microsoft® exchange servers, Microsoft® SharePoint servers, etc. These different application systems store the confidential data as structured data in a data repository (also referred to as a data store) according to different data schemas. For example, in a relational database, the data schema defines the tables, the fields in each table, and the relationships between fields and tables.
In order to prevent malicious and unintentional data breaches, commercial and government regulations often impose restrictions on how confidential data may be stored, the format of confidential data, and who can access that confidential data. In order to comply with these regulations, companies create policies to govern how confidential data is stored in the various applications, in what format the confidential information is stored, and who can access that confidential data. However, since the confidential data is stored according to various types of data schemas, the policies cannot be generally applied to all of the different types of data repositories, but need to be customized, and individually applied to each of the different types of data repositories. Also, since the confidential data is stored according to a specific data schema, in order to search the data repository for confidential data, the conventional systems require specific knowledge of which data schema the particular data repository uses to store the data before searching or indexing the content of the data repository.
There are various conventional methods and systems that can scan and search different types of application data repositories for text data, such as e-Discovery tools and enterprise search tools. The enterprise search tools typically do not apply policies to data at all, but are primarily used to retrieve documents based on simple legacy keyword searches or highly sophisticated conceptual querying. The e-Discovery tools typically allow an organization to search, identify, cull, collect, and process electronically stored information (ESI) across the organization, and then export the ESI to an attorney review platform. The e-Discovery tools can process documents according to specified policies to help manage and identify data for legal, regulatory, and investigative matters. However, the e-Discovery tools require specific knowledge of the particular data schema used by each of the data repositories in order to search, index, or retrieve the data. In addition, the e-Discovery tools are not directed at applying policies to stored data in order to detect policy violations in the stored data, but rather to searching, indexing, or retrieving specific data for legal, regulatory, and investigative review on another platform.