The Information Age, also known as the Digital Age or Computer Age is characterized by the ability to generate, process, transfer, and share information in a negligible amount of time. The Information Age is also defined by concealment of sensitive or confidential information, whose disclosure can be protected from unauthorized access or public leakage.
In business environments, including, for example, the healthcare or financial industries, sensitive or confidential information can be distributed based on a user's privilege to access and also view the information. For example, information including names, addresses, and social security numbers, that are sensitive or confidential to a business service, can be contained in either or both structured relational databases or unstructured content repositories. Structured information is information that is already structured in fields, such as, for example, “data”, “title”, “subject”, “unit price”, “quantity”, “total price” or “commission percentage”.
Further, structured information can be stored in a record of a relational database table. In addition, when information is structured in a relational database table, for example, spreadsheets, columns, row etc., it is usually relatively easy to search the structured information in the relational database. On the other hand, unstructured information refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Examples of unstructured information may include books, journals, documents, metadata, health records, audio, video, files, and unstructured text such as the body of an e-mail message, Web page, or word processor document. Further, while the main content being conveyed in unstructured information does not have a defined structure, it generally comes packaged in objects (e.g., in files or documents) that themselves have structure and are thus a mix of structured and unstructured information, but collectively this is still referred to as unstructured information. For example, an HTML web page is tagged, but HTML mark-up is typically designed solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page.
Thus, the context of unstructured information results in irregularities and ambiguities that make it difficult for relational database engines to understand the unstructured information. Further, sensitive information present in unstructured content repositories that is also present in structured relational databases cannot be easily protected from leakage of the information to a user whom might not have access or privilege to view the information. Therefore, sensitive information present in unstructured content repositories need to be redacted or sanitized to prevent information leakage from the unstructured document, while also taking into consideration that the information may be in a structured relational database.
Current solutions that attempt to address these problems are typically focused on redaction of documents based on manually defined static dictionaries. For, example, in Chab Cumby, Rayid Ghani, “A Machine Learning Based System for Semi-Automatically Redacting Documents” (2011), Proceedings of the 23rd Annual Conference on Innovative Applications of Artificial Intelligence (IAAI), the authors attempt to improve a way to redact documents based on semi-automatically redacting information in documents using machine learning techniques and standard NLP algorithms. Further, specific current solutions involve redaction of information based on dictionaries of protected entities, i.e., explicit values. For example, commonly owned U.S. Pat. No. 7,831,571 B2 describes redaction of documents based on exploitation of a database of entities to identify pre-defined terms to be removed from the document.