The disclosed technology relates to the fields of cryptography and document processing.
There are a number of commercial products for supporting legal discovery. Some products use natural language processing to cluster or categorize and detect cumulative or duplicate documents. These products identify entities within the document. In some products a user then manually selects what entities are to be redacted from the document. Other products can use rules to help redact identified entities and other personal or sensitive information. While these products reduce the time required to produce documents, they still require that the data gatekeeper process the documents to redact sensitive information for which the requesting entity is not authorized. However these tools still require that the data gatekeeper process the documents that contain sensitive information for each discovery request.
Content processing technologies exist to facilitate content indexing and duplicate identification. Technology also exists to redact, or remove, content from documents. The goal of these technologies is to index content, facilitate content search and thus to facilitate removing the searched-for content from the documents.
The existing technology does not allow “in-document” redaction. Either a paper copy or an image of a paper copy is provided that has the sensitive information blocked out. Electronic documents can be redacted by deleting the sensitive information from the file. One of the problems that result from this situation is that because multiple parties have different access rights and because the access rights of the parties change over time, the document owner must carefully control what is redacted based on the access rights. Due to the sheer manual labor and bookkeeping issues involved, mistakes are made. What is needed is some way for documents that contain sensitive information to be provided only once and to have a simple but secure method to reveal the content of the document based on the access rights given to the party.
Another problem that needs to be addressed is that of mistakenly delivering a partially redacted document to the wrong party (such as by a mistake by the post office, or a mailroom error, etc.). Yet another problem is that of attempting to determine which documents in a document collection, or portions of a document, have specific sensitive information.
It would be advantageous to provide a technology that would allow reversible redaction of electronic documents.
In accordance with the disclosure herein, a computer controlled method, apparatus and computer program product therefor, generates a selectively encrypted data unit from an unencrypted data unit. The method includes: accessing a list of attributes related to the unencrypted data unit; accessing the unencrypted data unit, the unencrypted data unit comprising a sequence of data; identifying sensitive information within the sequence of data associated with one or more of the list of attributes; selecting a protection key, the protection key responsive to a random number; computing a plurality of auxiliary values responsive to the one or more of the list of attributes and the random number; encrypting the sensitive information with the protection key to create an encrypted version of the sensitive information, the encrypted version associated with the plurality of auxiliary values; linking an attribute vector with the encrypted version, the attribute vector responsive to the one or more of the list of attributes associated with the encrypted version; and storing, as the selectively encrypted data unit, data from the unencrypted data unit and the encrypted version of the sensitive information.