1. Field of the Invention
The present invention is related to cryptography, and, more particularly, to recovery of encrypted documents where the password is not available.
2. Description of the Related Art
Password recovery today is an important area of research and information security and information systems security. Many file formats today provide for encryption of the file content using a password entered by a user. For example, Microsoft Word, Microsoft Excel, Adobe Acrobat, as well as many others, provide for an encryption scheme that uses a key derived from a password entered by a user. Recovering the encrypted document may be necessary in any number of circumstances.
For example, the password may have been created by a former employee, who is no longer available to open the document. Alternatively, the password may have been lost or forgotten, while the document still needs to be opened by the system administrator. Yet another situation involves the system administrator testing documents and their passwords to make sure that the users are not relying on relatively simple passwords, or easily discoverable passwords that rely on information about the users themselves (for example, the user's name, the user's spouse's name, their pet's names, their names spelled backwards, etc.). These are some of the circumstances where a system administrator may need to recover the password, without assistance from whoever chose that password.
One common encryption scheme is known as a “stream cipher” encryption. In such a scheme, starting with a password, which is usually a collection of alphanumeric characters, a specific predefined algorithm is used to generate a key, which is an N-bit binary value. For reasons of export control regulations, many applications' keys are restricted to using no more than 40 bits. This means that the total number of keys is, obviously, 240. However, it should be remembered that the stream cipher algorithms are not restricted to 40-bits-64-bit keys are also frequently used, as are 56-bit, 96-bit and 128-bit keys (at times).
In cryptography, a stream cipher is a symmetric cipher in which the plaintext digits are encrypted one at a time, and in which the transformation of successive digits varies during the encryption. The encryption of each digit is dependent on the current state. In practice, the digits are typically single bits or bytes.
In a synchronous stream cipher, a stream of pseudo-random digits is generated independently of the plaintext and ciphertext messages, and then combined with the plaintext (to encrypt) or the ciphertext (to decrypt). In the most common form, binary digits are used (bits), and the keystream is combined with the plaintext using the exclusive OR operation (XOR). This is a binary additive stream cipher.
RC4 is a common example of stream cipher used in encryption of many documents. Its description was anonymously posted on the Internet in 1994. RC4 generates a pseudorandom stream of bits (a “keystream”) which, for encryption, is combined with the plaintext using XOR; decryption is performed the same way. To generate the keystream, the cipher makes use of a secret internal state that consists of two parts:
A permutation of all 256 possible bytes (denoted “S” below).
Two 8-bit index-pointers (denoted “i” and “j”).
The permutation is initialized with a variable length key, typically between 40 and 256 bits, using a key-scheduling algorithm. Once this has been completed, the stream of bits is generated using the pseudo-random generation algorithm (PRGA).
The key-scheduling algorithm is used to initialize the permutation in the array “S”. “L” is defined as the number of bytes in the key and can be in the range 1<L<256, typically between 5 and 16, corresponding to a key length of 40-128 bits. First, the array “S” is initialized to the identity permutation. S is then processed for 256 iterations similar to the main PRGA algorithm, but also mixes in bytes of the key at the same time:
for i from 0 to 255                S[i]:=i        j:=0        
for i from 0 to 255                j:=(j+S[i]+key[i mod keylength]) mod 256        swap(S[i],S[j])        
A number of conventional methods are known for recovering passwords. One conventional method involves using certain heuristics based on known information about the user. This can include such things as the user's first name, last name, spouse's or children's names, parents' names, Social Security numbers, names spelled backwards, pet's names, city names, addresses, etc. Since the amount of such information is relatively finite (and, in any event, much smaller than the theoretical number of keys—240,—in the case of the user using such obvious passwords, this would result in a very quick password recovery.
Where such heuristic approaches fail, the next step might involve using a dictionary. For example, the English language contains approximately 20,000 words, including variations on the spelling, conjugations of words, singular/plural, etc. The applications that encrypt their files, such as MS Word, or Adobe Acrobat, provide functions for checking the validity of the password—in other words, it is not necessary to invoke the entire application in an attempt to open the document itself—it is possible to simply use that function and provide the possible passwords to it as an argument, resulting in a success or failure as the return parameter.
The use of such built-in password checking functions (especially as a stand-alone code) has been optimized to a point where there are no additional savings to be gained from this approach—if the word selected by the user as a password is in the dictionary, then it can be predicted with a reasonable degree of certainty how many processor instructions per password need to be executed and, derivatively, the average time to find the valid password.
In the event that the dictionary approach fails, combinations of letters and numbers can be tried, for example, “aaa,” “aab,” “aac,” etc. However, for six character or eight character (or longer passwords), such an approach becomes prohibitively time-consuming. In the case of an eight character English-language password, and where upper and lower case letters are treated as different symbols, plus ten digits, the total number of possible passwords is roughly (2×26±10)8. This is a very large number, and, given current technology, will take a very long time to result in a recovery of a password. Languages with more than 26 letters would result in more possible passwords.
An alternative brute force approach is to focus not on the passwords, but directly on the keys. Microsoft Word and Adobe Acrobat, and other similar programs have a function that allows one to test whether the key submitted to it is the valid one. By going through all the possible keys, of which, in the 40-bit scheme, there are 240, eventually a correct key will be found. The stream cipher encryption algorithms guarantee that for any possible password, there is at least one valid 40-bit key (and possibly more than one). This is a scheme that is, in fact, frequently used. Given current technology, on a typical desktop, in 2006, with a 40-bit key, it would take approximately two weeks, on average, to find the valid key.
This has significant implications for the field of password protection and document recovery. A task that requires a dedicated computer for one or two weeks is not one that will be undertaken routinely. Solutions that attempt to use parallel processing, such as using multiple computers to solve this problem, where each computer only works on some portion of the total task, are possible but relatively expensive. Furthermore, such solutions (e.g., server farms, parallel processing supercomputers, etc.) are clearly outside the realm of individual users or small businesses, but are, in practice, restricted to the corporate environment, primarily due to cost considerations.
Another alternative was proposed three years ago, based on the 1980 work of Martin Hellman, see http://lasecwww.ephfl.ch/pub/lasec/doc/Oech03.pdf. This approach involves what is sometimes referred to as the “Rainbow Table.” A Rainbow Table is a special type of lookup table that is constructed by placing a plaintext password entry in a chain of keys and ciphertexts, generated by a one-way function, such as a hash. The end result is a table that contains, statistically, high probability of revealing a password within a short period of time, generally less than a minute. The success probability of the table depends on the parameters used to generate it. These include the character set used, password length, chain length, and table count.
Success probability is defined as the probability that the plaintext can be found for a given ciphertext. In the case of passwords, the password is the plaintext, and the hash of the password is the ciphertext, so the success probability is the probability that the original password can be recovered from the password hash. Tables are specific to the hash function they were created for, e.g., MD5 tables can only crack MD5 hashes. Rainbow Tables for a variety of character sets and hashing algorithms have been developed, including LM hash, MD5, SHA1, etc.
The Rainbow Table approach has two major disadvantages. One disadvantage is that it does not give a 100% guarantee of finding the password, if hashes are used. Typically, the Rainbow Table approach gives a 99%, or 99.5%, or 99.9% probability of finding the password. The “missing” 1% or 0.1%, is of more than a theoretical significance in many cryptographic applications. Another problem with the Rainbow Table approach is that it is primarily used on Windows-encrypted documents, which use LM-hashes of Windows passwords, which is equivalent to 29-bit keys—in other words, there are only 229 possible passwords that need to be tested. This is far fewer than the 240 possible keys in the Microsoft Word or Adobe Acrobat documents. Attempting to use the Rainbow Table approach with 40-bit keys would substantially increase the time required (which is approximately 30 seconds, depending on the hardware, for 29-bit keys).
Accordingly, there is a need in the art for a fast system and method for recovering keys for encrypted documents.