There are two trends that dominate technology today. The first is the ever-increasing amount of data which is being stored. As of 2013 the limit on the size of data sets that are feasible to process in a reasonable amount of time are on the order of the exabyte (EB) and the technological per-capita capacity to store information has roughly doubled every 40 months since the early 1980's. The large amount of data being stored and analyzed is often referred to as “Big Data” as discussed in [12]. The second is the flexible and widespread nature of data storage devices. Modern day storage is loosely coupled and often spread out over a large geographical area. Due to these trends, securing and deduplicating big data for transmission and storage is more important than ever.
The challenge presented is to encapsulate data in a manner where four major properties are maintained. These properties are data confidentiality, integrity, authentication, and compression. Confidentiality, or encoding, guarantees that the data can only be read and interpreted by an authorized party who possesses the key to that data. Integrity and authentication guarantee that when decoded, the data is in the exact same form as it was before encoding. Finally compression, or what is more commonly referred to as data deduplication, reduces the size of a data set by removing redundancy.
There are well established and government endorsed methods for providing confidentiality and authentication such as the Advanced Encryption Standard (AES) as defined in [1] and approved block cipher modes of operation when implementing AES as defined in [2]. There are also well established methods for providing data deduplication such as taught by George Henry Forman et al in U.S. Pat. No. 7,200,604 or taught by Haustein et al in US2010/0235332 and demonstrated in [10].
Even though methods for these technologies are well defined and established, there is room for significant improvement in the performance of the write operation. The invention presented will specifically address performance when the process of encoding data via symmetric-key algorithms such as AES and the process of data deduplication are combined and operate concurrently on subsequent blocks. This is a common occurrence since compression before encoding saves space in data storage and bandwidth in data transmission.
The invention is a write once read many virtual disk that allows the combination of both methods without the reduction in performance that would normally be observed due to processor contention between the encoding and deduplication operations. Write once read many virtual disks can be used as a mechanism of encapsulation that provides data confidentiality, integrity, authentication, and compression. This type of virtual disk can either be transmitted over a network or placed in long term storage.
The performance problem we will address is the write throughput when adding data to a virtual disk. The invention is a virtual disk that is designed to be written once and read as many times as needed based on the application. In computer science this type of disk is often referred to as write once read many or WORM. These types of file systems are extensively studied and there has been much work in optimizing write and read operations on non-encrypted data as demonstrated in [13].
In trying to speed up write performance the invention takes two computational bottlenecks into consideration. The first is the rate of compression as dictated by the deduplication algorithm and the second is the rate of encoding, which is dictated by the symmetric-key algorithm being utilized to provide confidentiality, integrity, and authentication.
When deduplicating or compressing data for storage, a cryptographic hash function is used that can generate significant processor load at high data rates as demonstrated in [11]. When encoding data the cryptographic functions that compose a symmetric-key algorithm can generate significant processor load as well. When these types of cryptographic functions are computed concurrently the aggregate throughput rate of the entire system can be cut in half or more due to processor contention.
To enable a high performance write method for a write once read many virtual disk the invention takes advantage of two techniques. The first is the operation of a block cipher as a stream cipher through the use of counter mode as demonstrated in [2] and [4], in order to produce a keystream. The second is the fact that a keystream can be generated before the actual data is available for storage and therefore precomputed as taught by Leventhal et al in US2007/0110225 and demonstrated by [7], [8], [9], and [13].
The invention utilizes the first technique in its format method using the empty space on a virtual disk as a cache and the second technique in its write method utilizing a precomputed keystream as a one-time pad. The details of the invention that facilitate these methods will be disclosed in the summary section.
A well-known example of the first technique that is utilized in traditional data storage is the disk cache. In order to optimize disk operation, data is brought into memory (cached) if it is expected to be accessed in the near future creating a higher burst transfer rate. In this manner the speed difference between primary and secondary storage (which can be substantial since a hard drive has mechanical components) can be blended at some level creating higher throughput.
A well-known example of the second technique is in the area of securing interprocessor communication in a tightly coupled multiprocessor system. This is accomplished by generating small keystream buffers that reside in processor caches which are used as a one time pad to reduce the latency generated by the use of cryptographic operations in symmetric-key algorithms as demonstrated in [5], [7], [8], and [13].
The invention addresses the need for confidentiality, integrity, authentication, and compression for the transmission and storage of big data by providing a write once read many virtual disk with performance characteristics that exceed other modern day solutions.