Recent data suggests that nearly eighty-five percent of all data is found in computing files and growing annually at around sixty percent. One reason for the growth is that regulatory compliance acts, Statutes, etc., (e.g., Sarbanes-Oxley, HIPAA, PCI) force companies to keep file data in an accessible state for extended periods of time. However, block level operations in computers are too lowly to apply any meaningful interpretation of this stored data beyond taking snapshots and block de-duplication. While other business intelligence products have been introduced to provide capabilities greater than block-level operations, they have been generally limited to structured database analysis. They are much less meaningful when acting upon data stored in unstructured environments.
Unfortunately, entities the world over have paid enormous sums of money to create and store their data, but cannot find much of it later in instances where it is haphazardly arranged or arranged less than intuitively. Not only would locating this information bring back value, but being able to observe patterns in it might also prove valuable despites its usefulness being presently unknown. However, entities cannot expend so much time and effort in finding this data that it outweighs its usefulness. Notwithstanding this, there are still other scenarios, such as government compliance, litigation, audits, etc., that dictate certain data/information be found and produced, regardless of its cost in time, money and effort. Thus, a clear need is identified in the art to better find, organize and identify digital data, especially data left in unstructured states.
In search engine technology, large amounts of unrelated and unstructured digital data can be quickly gathered. However, most engines do little to organize the data other than give a hierarchical presentation. Also, when the engine finds duplicate versions of data, it offers few to no options on eliminating the replication or migrating/relocating redundancies. Thus, a further need in the art exists to overcome the drawbacks of search engines.
When it comes to large amounts of data, whether structured or not, compression techniques have been devised to preserve storage capacity, reduce bandwidth during transmission, etc. With modern compression algorithms, however, they simply exist to scrunch large blocks of data into smaller blocks according to their advertised compression ratios. As is known, some do it without data loss (lossless) while others do it “lossy.” None do it, unfortunately, with a view toward recognizing similarities in the data itself.
From biology, it is known that highly similar species have highly similar DNA strings. In the computing context, consider two word processing files relating to stored baseball statistics. In a first file, words might appear for a baseball batter, such as “batting average,” “on base percentage,” and “slugging percentage,” while a second file might have words for a baseball pitcher, such as “strikeouts,” “walks,” and “earned runs.” Conversely, a third file wholly unrelated to baseball, statistics or sports, may have words such as “environmental protection,” “furniture.” or whatever comes to mind. It would be exceptionally useful if, during times of compression, or upon later manipulation by an algorithm if “mapping” could recognize the similarity in subject matter in the first two files, although not exact to one another, and provide options to a user. Appreciating that the “words” in the example files are represented in the computing context as binary bits (1's or 0's), which occurs by converting the English alphabet into a series of 1's and 0's through application of ASCII encoding techniques, it would be further useful if the compression algorithm could first recognize the similarity in subject matter of the first two files at the level of raw bit data. The reason for this is that not all files have words and instead might represent pictures (e.g., .jpeg) or spread sheets of numbers.
Appreciating that certain products already exist in the above-identified market space, clarity on the need in the art is as follows. One, present day “keyword matching” is limited to select set of words that have been pulled from a document into an index for matching to the same exact words elsewhere. Two, “Grep” is a modern day technique that searches one or more input files for lines containing an identical match to a specified pattern. Three. “Beyond Compare,” and similar algorithms, are line-by-line comparisons of multiple documents that highlight differences between them. Four, block level data de-duplication has no application in compliance contexts, data relocation, or business intelligence.
The need in the art, on the other hand, needs to serve advanced notions of identifying new business intelligence, conducting operations on completely unstructured or haphazard data, and organizing it, providing new useful options to users, providing new user views, providing new encryption products, and identifying highly similar data, to name a few. As a byproduct, solving this need will create new opportunities in minimizing transmission bandwidth and storage capacity, among other things. Naturally, any improvements along such lines should contemplate good engineering practices, such as stability, ease of implementation, unobtrusiveness, etc.
Regarding the use of keys to encrypt and decrypt information, it is known to have “dual control” to guard access to important assets. For example, the concept of a two-key safe deposit box at a local banking institution is well understood. The bank retains one key, and the customer retains the other. To access the contents held in the box, both keys must be present and utilized. Not only does this protect the bank against allegations of improper access, it protects the box renter since a stolen key does not allow unilateral access. As another example, “dual control” typifies the two-person rule for handling launch codes for a nuclear missile. Only when two parties bring the correct keys to the lockbox can the box be opened and the codes revealed.
Regarding the keys themselves, encryption and decryption procedures require the use of secret information, e.g., the “key.” Common methods of controlling access to information in data security include single secret key encryption, sequential double encryption, and public key encryption. These methods provide less secure means of access control than can be realized using the two-key method to be described in this document.
In a symmetric encryption with a single secret key, the same key is used for both encryption and decryption. Single key encryption methods provide data security, but do not provide the advantages of “dual control.”
Sequential double encryption methods use two keys, but they do not require “dual control” because its keys are not used simultaneously. Furthermore, sequential double encryption requires that the order in which the keys are applied be known and preserved. The two-key decryption method described herein requires that both keys be simultaneously available at decryption time. There is no sequential use, so there is no ordering issue.
In a public-key encryption method, also referred to as asymmetric encryption, each user has both a public key and a private key. Applying an algorithm to a random number generates the two keys for each user. Encryption is performed with the public key, and decryption is done with a private key. A third party can certify the ownership of key pairs. The users' private keys are kept secret, while their public keys might be distributed widely. The two-key embodiment discussed below does not lend itself to public key and private key scenarios because both of its keys are used simultaneously to decrypt the data.
Accordingly, a further need in the art is identified to overcome the problems of traditional keys and their usage.