1. Technical Field
The present invention relates generally to data compression and decompression and, in particular, to a system and method for compressing and decompressing data based on an actual or expected throughput (bandwidth) of a system that employs data compression. Additionally the present invention relates to the subsequent storage, retrieval, and management of information in data storage devices utilizing either compression and/or accelerated data storage and retrieval bandwidth.
2. Description of the Related Art
There are a variety of data compression algorithms that are currently available, both well-defined and novel. Many compression algorithms define one or more parameters that can be varied, either dynamically or a-priori, to change the performance characteristics of the algorithm. For example, with a typical dictionary based compression algorithm such as Lempel-Ziv, the size of the dictionary can affect the performance of the algorithm. Indeed, a large dictionary may be employed to yield very good compression ratios but the algorithm may take a long time to execute. If speed were more important than compression ratio, then the algorithm can be limited by selecting a smaller dictionary, thereby obtaining a much faster compression time, but at the possible cost of a lower compression ratio. The desired performance of a compression algorithm and the system in which the data compression is employed, will vary depending on the application.
Thus, one challenge in employing data compression for a given application or system is selecting one or more optimal compression algorithms from the variety of available algorithms. Indeed, the desired balance between speed and efficiency is typically a significant factor that is considered in determining which algorithm to employ for a given set of data. Algorithms that compress particularly well usually take longer to execute whereas algorithms that execute quickly usually do not compress particularly well.
Accordingly, a system and method that would provide dynamic modification of compression system parameters so as to provide an optimal balance between execution speed of the algorithm (compression rate) and the resulting compression ratio, is highly desirable.
Yet another problem within the current art is data storage and retrieval bandwidth limitations. Modern computers utilize a hierarchy of memory devices. In order to achieve maximum performance levels, modern processors utilize onboard memory and on board cache to obtain high bandwidth access to both program and data. Limitations in process technologies currently prohibit placing a sufficient quantity of onboard memory for most applications. Thus, in order to offer sufficient memory for the operating system(s), application programs, and user data, computers often use various forms of popular off-processor high speed memory including static random access memory (SRAM), synchronous dynamic random access memory (SDRAM), synchronous burst static ram (SBSRAM). Due to the prohibitive cost of the high-speed random access memory, coupled with their power volatility, a third lower level of the hierarchy exists for non-volatile mass storage devices. While mass storage devices offer increased capacity and fairly economical data storage, their data storage and retrieval bandwidth is often much less in relation to the other elements of a computing system.
Computers systems represent information in a variety of manners. Discrete information such as text and numbers are easily represented in digital data. This type of data representation is known as symbolic digital data. Symbolic digital data is thus an absolute representation of data such as a letter, figure, character, mark, machine code, or drawing.
Continuous information such as speech, music, audio, images and video, frequently exists in the natural world as analog information. As is well known to those skilled in the art, recent advances in very large scale integration (VLSI) digital computer technology have enabled both discrete and analog information to be represented with digital data. Continuous information represented as digital data is often referred to as diffuse data. Diffuse digital data is thus a representation of data that is of low information density and is typically not easily recognizable to humans in its native form.
Modern computers utilize digital data representation because of its inherent advantages. For example, digital data is more readily processed, stored, and transmitted due to its inherently high noise immunity. In addition, the inclusion of redundancy in digital data representation enables error detection and/or correction. Error detection and/or correction capabilities are dependent upon the amount and type of data redundancy, available error detection and correction processing, and extent of data corruption.
One outcome of digital data representation is the continuing need for increased capacity in data processing, storage, and transmittal. This is especially true for diffuse data where increases in fidelity and resolution create exponentially greater quantities of data. Data compression is widely used to reduce the amount of data required to process, transmit, or store a given quantity of information. In general, there are two types of data compression techniques that may be utilized either separately or jointly to encode/decode data: lossless and lossy data compression.
Over the last decade, computer processor performance has improved by at least a factor of 50. During this same period, magnetic disk storage has only improved by a factor of 5. Thus one additional problem with the existing art is that memory storage devices severely limit the performance of consumer, entertainment, office, workstation, servers, and mainframe computers for all disk and memory intensive operations.
For example, magnetic disk mass storage devices currently employed in a variety of home, business, and scientific computing applications suffer from significant seek-time access delays along with profound read/write data rate limitations. Currently the fastest available (15,000) rpm disk drives support only a 40.0 Megabyte per second data rate (MB/sec). This is in stark contrast to the modern Personal Computer's Peripheral Component Interconnect (PCI) Bus's input/output capability of 512 MB/sec and internal local bus capability of 1600 MB/sec.
Another problem within the current art is that emergent high performance disk interface standards such as the Small Computer Systems Interface (SCSI-3), iSCSI, Fibre Channel, AT Attachment UltraDMA/100+, Serial Storage Architecture, and Universal Serial Bus offer only higher data transfer rates through intermediate data buffering in random access memory. These interconnect strategies do not address the fundamental problem that all modern magnetic disk storage devices for the personal computer marketplace are still limited by the same typical physical media restriction. In practice, faster disk access data rates are only achieved by the high cost solution of simultaneously accessing multiple disk drives with a technique known within the art as data striping and redundant array of independent disks (RAID).
RAID systems often afford the user the benefit of increased data bandwidth for data storage and retrieval. By simultaneously accessing two or more disk drives, data bandwidth may be increased at a maximum rate that is linear and directly proportional to the number of disks employed. Thus another problem with modern data storage systems utilizing RAID systems is that a linear increase in data bandwidth requires a proportional number of added disk storage devices.
Another problem with most modern mass storage devices is their inherent unreliability. Many modern mass storage devices utilize rotating assemblies and other types of electromechanical components that possess failure rates one or more orders of magnitude higher than equivalent solid state devices. RAID systems employ data redundancy distributed across multiple disks to enhance data storage and retrieval reliability. In the simplest case, data may be explicitly repeated on multiple places on a single disk drive, on multiple places on two or more independent disk drives. More complex techniques are also employed that support various trade-offs between data bandwidth and data reliability.
Standard types of RAID systems currently available include RAID Levels 0, 1, and 5. The configuration selected depends on the goals to be achieved. Specifically data reliability, data validation, data storage/retrieval bandwidth, and cost all play a role in defining the appropriate RAID data storage solution. RAID level 0 entails pure data striping across multiple disk drives. This increases data bandwidth at best linearly with the number of disk drives utilized. Data reliability and validation capability are decreased. A failure of a single drive results in a complete loss of all data. Thus another problem with RAID systems is that low cost improved bandwidth requires a significant decrease in reliability.
RAID Level 1 utilizes disk mirroring where data is duplicated on an independent disk subsystem. Validation of data amongst the two independent drives is possible if the data is simultaneously accessed on both disks and subsequently compared. This tends to decrease data bandwidth from even that of a single comparable disk drive. In systems that offer hot swap capability, the failed drive is removed and a replacement drive is inserted. The data on the failed drive is then copied in the background while the entire system continues to operate in a performance degraded but fully operational mode. Once the data rebuild is complete, normal operation resumes. Hence, another problem with RAID systems is the high cost of increased reliability and associated decrease in performance.
RAID Level 5 employs disk data striping and parity error detection to increase both data bandwidth and reliability simultaneously. A minimum of three disk drives is required for this technique. In the event of a single disk drive failure, that drive may be rebuilt from parity and other data encoded on disk remaining disk drives. In systems that offer hot swap capability, the failed drive is removed and a replacement drive is inserted. The data on the failed drive is then rebuilt in the background while the entire system continues to operate in a performance degraded but fully operational mode. Once the data rebuild is complete, normal operation resumes.
Thus another problem with redundant modern mass storage devices is the degradation of data bandwidth when a storage device fails. Additional problems with bandwidth limitations and reliability similarly occur within the art by all other forms of sequential, pseudo-random, and random access mass storage devices. Typically mass storage devices include magnetic and optical tape, magnetic and optical disks, and various solid-state mass storage devices. It should be noted that the present invention applies to all forms and manners of memory devices including storage devices utilizing magnetic, optical, neural and chemical techniques or any combination thereof.
Yet another problem within the current art is the application and use of various data compression techniques. It is well known within the current art that data compression provides several unique benefits. First, data compression can reduce the time to transmit data by more efficiently utilizing low bandwidth data links. Second, data compression economizes on data storage and allows more information to be stored for a fixed memory size by representing information more efficiently.
For purposes of discussion, data compression is canonically divided into lossy and lossless techniques. Lossy data compression techniques provide for an inexact representation of the original uncompressed data such that the decoded (or reconstructed) data differs from the original unencoded/uncompressed data. Lossy data compression is also known as irreversible or noisy compression. Negentropy is defined as the quantity of information in a given set of data. Thus, one obvious advantage of lossy data compression is that the compression ratios can be larger than that dictated by the negentropy limit, all at the expense of information content. Many lossy data compression techniques seek to exploit various traits within the human senses to eliminate otherwise imperceptible data. For example, lossy data compression of visual imagery might seek to delete information content in excess of the display resolution or contrast ratio of the target display device.
On the other hand, lossless data compression techniques provide an exact representation of the original uncompressed data. Simply stated, the decoded (or reconstructed) data is identical to the original unencoded/uncompressed data. Lossless data compression is also known as reversible or noiseless compression. Thus, lossless data compression has, as its current limit, a minimum representation defined by the entropy of a given data set.
A rich and highly diverse set of lossless data compression and decompression algorithms exist within the current art. These range from the simplest “adhoc” approaches to highly sophisticated formalized techniques that span the sciences of information theory, statistics, and artificial intelligence. One fundamental problem with almost all modern approaches is the compression ratio to encoding and decoding speed achieved. As previously stated, the current theoretical limit for data compression is the entropy limit of the data set to be encoded. However, in practice, many factors actually limit the compression ratio achieved. Most modern compression algorithms are highly content dependent. Content dependency exceeds the actual statistics of individual elements and often includes a variety of other factors including their spatial location within the data set.
Of popular compression techniques, arithmetic coding possesses the highest degree of algorithmic effectiveness, and as expected, is the slowest to execute. This is followed in turn by dictionary compression, Huffman coding, and run-length coding with respectively decreasing execute times. What is not apparent from these algorithms, that is also one major deficiency within the current art, is knowledge of their algorithmic efficiency. More specifically, given a compression ratio that is within the effectiveness of multiple algorithms, the question arises as their corresponding efficiency.
Within the current art there also presently exists a strong inverse relationship between achieving the maximum (current) theoretical compression ratio, which we define as algorithmic effectiveness, and requisite processing time. For a given single algorithm the effectiveness over a broad class of data sets including text, graphics, databases, and executable object code is highly dependent upon the processing effort applied. Given a baseline data set, processor operating speed and target architecture, along with its associated supporting memory and peripheral set, we define algorithmic efficiency as the time required to achieve a given compression ratio. Algorithmic efficiency assumes that a given algorithm is implemented in an optimum object code representation executing from the optimum places in memory. This is almost never achieved in practice due to limitations within modern optimizing software compilers. It should be further noted that an optimum algorithmic implementation for a given input data set may not be optimum for a different data set. Much work remains in developing a comprehensive set of metrics for measuring data compression algorithmic performance, however for present purposes the previously defined terms of algorithmic effectiveness and efficiency should suffice.
Various solutions to this problem of optimizing algorithmic implementation are found in U.S. Pat. Nos. 6,195,024 and 6,309,424, issued on Feb. 27, 2001 and Oct. 30, 2001, respectively, to James Fallon, both of which are entitled “Content Independent Data Compression Method and System,” and are incorporated herein by reference. These patents describe data compression methods that provide content-independent data compression, wherein an optimal compression ratio for an encoded stream can be achieved regardless of the data content of the input data stream. As more fully described in the above incorporated patents, a data compression protocol comprises applying an input data stream to each of a plurality of different encoders to, in effect, generate a plurality of encoded data streams. The plurality of encoders are preferably selected based on their ability to effectively encode different types of input data. The final compressed data stream is generated by selectively combining blocks of the compressed streams output from the plurality of encoders based on one or more factors such as the optimal compression ratios obtained by the plurality of decoders. The resulting compressed output stream can achieve the greatest possible compression, preferably in real-time, regardless of the data content.
Yet another problem within the current art relates to data management and the use of existing file management systems. Present computer operating systems utilize file management systems to store and retrieve information in a uniform, easily identifiable, format. Files are collections of executable programs and/or various data objects. Files occur in a wide variety of lengths and must be stored within a data storage device. Most storage devices, and in particular, mass storage devices, work most efficiently with specific quantities of data. For example, modern magnetic disks are often divided into cylinders, heads and sectors. This breakout arises from legacy electro-mechanical considerations with the format of an individual sector often some binary multiple of bytes (512, 1024, . . . ). A fixed or variable quantity of sectors housed on an individual track. The number of sectors permitted on a single track is limited by the number of reliable flux reversals that can be encoded on the storage media per linear inch, often referred to as linear bit density. In disk drives with multiple heads and disk media, a single cylinder is comprised of multiple tracks.
A file allocation table is often used to organize both used and unused space on a mass storage device. Since a file often comprises more than one sector of data, and individual sectors or contiguous strings of sectors may be widely dispersed over multiple tracks and cylinders, a file allocation table provides a methodology of retrieving a file or portion thereof. File allocation tables are usually comprised of strings of pointers or indices that identify where various portions of a file are stored.
In-order to provide greater flexibility in the management of disk storage at the media side of the interface, logical block addresses have been substituted for legacy cylinder, head, sector addressing. This permits the individual disk to optimize its mapping from the logical address space to the physical sectors on the disk drive. Advantages with this technique include faster disk accesses by allowing the disk manufacturer greater flexibility in managing data interleaves and other high-speed access techniques. In addition, the replacement of bad media sectors can take place at the physical level and need not be the concern of the file allocation table or host computer. Furthermore, these bad sector replacement maps are definable on a disk by disk basis.
Practical limitations in the size of the data required to both represent and process an individual data block address, along with the size of individual data blocks, governs the type of file allocation tables currently in use. For example, a 4096 byte logical block size (8 sectors) employed with 32 bit logical block addresses. This yields an addressable data space of 17.59 Terabytes. Smaller logical blocks permit more efficient use of disk space. Larger logical blocks support a larger addressable data space. Thus one limitation within the current art is that disk file allocation tables and associated file management systems are a compromise between efficient data storage, access speed, and addressable data space.
Data in a computer has various levels of information content. Even within a single file, many data types and formats are utilized. Each data representation has specific meaning and each may hold differing quantities of information. Within the current art, computers process data in a native, uncompressed, format. Thus compressed data must often be decompressed prior to performing various data processing functions or operations. Modern file systems have been designed to work with data in its native format. Thus another significant problem within the current art is that file systems are not able to randomly access compressed data in an efficient manner.
Further aggravating this problem is the fact that when data is decompressed, processed and recompressed it may not fit back into its original disk space, causing disk fragmentation or complex disk space reallocation requirements. Several solutions exist within the current art including file by file and block structured compressed data management.
In file by file compression, each file is compressed when stored on disk and decompressed when retrieved. For very small files this technique is often adequate, however for larger files the compression and decompression times are too slow, resulting in inadequate system level performance. In addition, the ability to access randomly access data within a specific file is lost. The one advantage to file by file compression techniques is that they are easy to develop and are compatible with existing file systems. Thus file by file compressed data management is not an adequate solution.
Block structured disk compression operates by compressing and decompressing fixed block sizes of data. Block sizes are often fixed, but may be variable in size. A single file usually is comprised of multiple blocks, however a file may be so small as to fit within a single block. Blocks are grouped together and stored in one or more disk sectors as a group of Blocks (GOBs). A group of blocks is compressed and decompressed as a unit, thus there exists practical limitations on the size of GOBs. Most compression algorithms achieve a higher level of algorithmic effectiveness when operating on larger quantities of data. Restated, the larger the quantity of data processed with a uniform information density, the higher the compressions ratio achieved. If GOBs are small compression ratios are low and processing time short. Conversely, when GOBS are large compression ratios are higher and processing time is longer. Large GOBs tend to perform in a manner analogous to file by file compression. The two obvious benefits to block structured disk compression are psuedo-random data access and reduced data compression/decompression processing time.
Several problems exist within the current art for the management of compressed blocks. One method for storage of compressed files on disk is by contiguously storing all GOBs corresponding to a single file. However as files are processed within the computers, files may grow or shrink in size. Inefficient disk storage results when a substantial file size reduction occurs. Conversely when a file grows substantially, the additional space required to store the data may not be available contiguously. The result of this process is substantial disk fragmentation and slower access times.
An alternate method is to map compressed GOBs into the next logical free space on the disk. One problem with this method is that average file access times are substantially increased by this technique due to the random data storage. Peak access delays may be reduced since the statistics behave with a more uniform white spectral density, however this is not guaranteed.
A further layer of complexity is encountered when compressed information is to be managed on more than one data storage device. Competing requirements of data access bandwidth, data reliability/redundancy, and efficiency of storage space are encountered.
These and other limitations within the current art are solved with the present invention.