The present invention relates to memory systems, and more particularly to an integrated compression/decompression circuit embedded on industry standard memory modules where such modules operate to improve performance of a computing system by the storage of compressed data in the system memory and/or on the nonvolatile memory subsystem.
System memory modules and architectures have remained relatively unchanged for many years. While memory density has increased and the cost per storage bit has decreased over time, there has not been a significant improvement to the effective operation of the memory subsystem using non-memory devices located within such memory subsystems. The majority of computing systems presently use industry standard in-line modules. These modules house multiple DRAM memory devices for easy upgrade, configuration, and improved density per area.
Software-implemented compression and decompression technologies have also been used to reduce the size of data stored on the disk subsystem or in the system memory data. Current compressed data storage implementations use the system""s CPU executing a software program to compress information for storage on disk. However, a software solution typically uses too many CPU compute cycles to operate both compression and decompression in the present application(s). This compute cycle problem increases as applications increase in size and complexity. In addition, there has been no general-purpose use of compression and decompression for in-memory system data. Prior art systems have been specific to certain data types. Thus, software compression has been used, but this technique limits CPU performance and has restricted use to certain data types.
Similar problems exist for programs that require multiple applications of software threads to operate in parallel. Software compression does not address heavy loaded or multi-threaded applications, which require high CPU throughput. Other hardware compression solutions have not focused on xe2x80x9cin-memoryxe2x80x9d data (data which reside in the active portion of the memory and software hierarchy). These solutions have typically been I/O data compression devices located away from the system memory or memory subsystem. In addition, the usage of hardware compression has been restricted to slow, serial input and output devices usually located at the I/O subsystem.
Mainframe computers have used data compression for acceleration and reduction of storage space for years. These systems require high dollar compression modules located away from the system memory and do not compress in-memory data in the same memory subsystem for improved performance. Such high dollar compression subsystems use multiple separate engines running in parallel to achieve compression speeds at super computer rates. Multiple separate, serial compression and decompression engines running in parallel are cost prohibitive for general use servers, workstations, desktops, or mobile units. Lower cost semiconductor devices have been developed that use compression hardware as well. The main difference is that these devices do not operate fast enough to run at memory speed and thus lack the necessary performance for in-memory data. Such compression hardware devices are limited to serial operation at compression rates that work for slow I/O devices such as tape backup units. The problem with such I/O compression devices, other than tape backup units, is that portions of the data to compress are often too small of a block size to effectively see the benefits of compression. This is especially true in disk and network subsystems. To operate hardware compression on in-memory data at memory bus speeds requires over an order of magnitude more speed than present day state-of-the-art compression hardware.
Prior Art Computer System Architecture
FIG. 1 illustrates a block diagram example of a prior art computer hardware and software operating system hierarchy of present day computing systems. The prior art memory and data storage hierarchy comprises the CPU Subsystem 100, the main memory subsystem 200, and the disk subsystem 300. The CPU subsystem 100 comprises the L1 cache memory 120 and L2 cache memory 130 coupled to the CPU 110 and the CPU""s local bus 135. The CPU subsystem 100 is coupled to the main memory subsystem 200 through the CPU local bus 135. The main memory subsystem 200 is also coupled to the disk subsystem 300. The main memory subsystem 200 comprises the memory controller 210, for controlling the main system memory banks, active pages of memory 220, inactive pages of memory 230, and a dynamically defined page fault boundary 232. The page fault boundary 232 is dynamically controlled by the virtual memory manager software 620 to optimize the balance between active and inactive pages in the system memory and xe2x80x9cstalexe2x80x9d pages stored on disk. The memory subsystem 200 is coupled to the I/O, or disk subsystem 300, by the I/O peripheral bus interface 235, which may be one of multiple bus standards or server/workstation proprietary I/O bus interfaces, e.g., the PCI bus. For purpose of illustration, the I/O disk subsystem 300 comprises the disk controller 310, the optional disk cache memory 320, and the actual physical hard disk or disk array 330 which is used to store nonvolatile /non-active pages. In alternate embodiments, multiple subsections of CPU 100, memory 200, and disk 300 subsystems may be used for larger capacity and/or faster operation.
The prior art drawing of FIG. 1 also illustrates the software operating system 600. The typical operating system (OS) comprises multiple blocks. FIG. 1 shows a few of the relevant OS blocks, including the virtual memory manager (VMM) 620, file system 640, and disk drivers 660.
The operation of prior art systems for storage and retrieval of active and non-active pages from either the system memory or the disk is now described for reference. Again referring to the prior art system of FIG. 1, the VMM 620 is responsible for allocation of active pages and reallocation of inactive pages. The VMM 620 defines page fault boundaries 232 separating the active pages 220 and the inactive pages 230 located in both the system memory subsystem 200 and disk subsystem 300. An active page may be defined as an area or page of memory, typically 4096 bytes, which is actively used by the CPU during application execution. Active pages reside between or within system memory or CPU cache memory. An inactive page may be defined as an area or page of memory, typically 4096 bytes, which is not directly accessed by the CPU for application execution. Inactive pages may reside in the system memory, or may be stored locally or on networks on storage media such as disks. The page fault boundary 232 is dynamically allocated during run time operation to provide the best performance and operation as defined by many industry standard algorithms such as the LRU/LFU lazy replacement algorithm for page swapping to disk. As applications grow, consuming more system memory than the actual available memory space, the page fault boundaries 232 are redefined to store more inactive pages 230 in the disk subsystem 300 or across networks. Thus, the VMM 620 is responsible for the placement of the page fault boundary 232 and the determination of active pages 220 and inactive pages 230, which reside in memory and on the disk subsystem 300.
The file system software 640, among other tasks, and along with the disk drivers 660, are responsible for the effective movement of inactive pages between the memory subsystem 200 and the disk subsystem 300. The file system software 640 may have an interface which is called by the VMM 620 software for the task of data movement to and from the computer disk and network subsystems. The file system 640 software maintains file allocation tables and bookkeeping to locate inactive pages that have been written to disk. In order for the file system to operate, the file system calls the software disk drivers 660 for DMA control of data movement and physical disk control. Instructions are programmed into the disk controller 310 of the disk subsystem 300 by the file system 640 software. Thus, when application data exceeds the available system memory space, the VMM 620 allocates and reallocates active and inactive pages for best operation of application data and instructs the file system 640 to instruct the disk driver 660 to carry out the DMA operation and page movement tasks.
For the purpose of this disclosure, it is helpful to understand the relative read and write time requirements for CPU read and write operation to or from each of the subsystems 100, 200, and 300. For example, for the CPU subsystem 100, a read and write operation to or from the L1120 or L2130 cache memory is on the order of tens of nanoseconds. A CPU 110 read/write from/to the memory subsystem 200 is on the order of hundreds of nanoseconds. A CPU read or write and/or a memory controller DMA read or write to the disk subsystem 300 is on the order of milliseconds. To move a page (typically 4096 bytes) from the inactive page 230 area to the active page 220 by the CPU 110 typically requires 3 xcexcs for the page fault software plus 7 xcexcs for the data move, or 10 xcexcs of overhead. For the DMA controller, typically located in the memory controller 210, to read or DMA a page from disk cache 320 requires about 1 ms, while movement of a page to physical disk requires about 10 ms. Thus, the data transfer time from disk subsystem 300 to memory subsystem 200 is about three orders of magnitude longer than from memory subsystem 200 to CPU subsystem 100 L1/L2 cache 120/130 memory. This represents an area of desired improvement. In addition, the speed of CPU reads/writes to and from the memory subsystem 200 is also an area of desired improvement.
Certain prior art systems utilize multiple compression and decompression devices to achieve faster compression rates for I/O data sent and stored on disk. No prior art currently exists which uses in-line memory compression technology at the memory interface or on memory modules to achieve improved system performance. Therefore, a new system and method is desired to improve overall memory performance, including a reduction in the effective page swap time overhead as seen in present day computing systems. The present invention addresses these problems in a unique and novel hardware and software architecture.
One embodiment of the present invention discloses a system and process to initialize, operate, and shutdown, through a combination of hardware and software procedures, an integrated circuit embedded on industry standard memory modules where such modules operate to improve performance of a computing system by the storage of compressed data in the system memory instead of data storage on the disk subsystem. The preferred embodiment of the present invention relates to computer system architectures, and more particularly to Compression Enabled Dual In-line Memory Modules (C-DIMM), which include an integrated chip mounted on DIMM, SODIMM, and SIMM or RIMM memory modules. It may include multiple DRAM memory types including SDRAM, DR-DRAM, and DDR-DRAM. It may also include memory subsystems that do not use industry standard in-line modules, but alternatively couple memory in a plurality of means to other system components. In addition, alternate embodiments of the present invention may be embedded into memory controllers and CPUs or into I/O subsystems and use the process of the present invention to improve system performance. The present invention increases effective memory density for all of the memory located within the memory subsystem. In addition, the invention increases performance without additional cost for in-line memory modules, disk-cache memory, disk storage devices, and network communications operations.
One embodiment of the present invention comprises a compression/decompression integrated circuit or chip, mounted on an industry-standard memory interface module such as a DIMM, SODIMM, SIMM, or RIMM modules, or embedded into the memory subsystem with other discrete components. The embodiment may also comprise the software methods and procedures required for enabling the operation of the integrated circuit within standard operating system environments. In addition, the embodiment includes the method of transparent memory module operation prior to the activation of the integrated circuit. The integrated circuit may contain novel high rate parallel compression and decompression technology. The compression and decompression technology may provide lossless and/or lossy compression and decompression of data. In alternate embodiments, the integrated circuit may contain other algorithms such as encryption and decryption or other co-processing circuits. The system of the preferred embodiment mounts the compression/decompression chip, which may be referred to as the Compactor chip, onto an industry-standard or de facto standard memory module (for example, DIMM, RIMM, SODIMM, or SHIM). In alternate embodiments, the Compactor chip may be located in multiple areas of the computing device, including the core logic memory controller, CPU, Peripheral Component Interconnect (PCI) bus, or any other input/output bus coupled either directly or indirectly via additional control integrated circuits to the system memory subsystem. For purpose of this disclosure, the system of the preferred embodiment is referred to as the C-DIMM or Compression enabled Dual-Inline Memory Module.
As seen in prior art, the operation of the operating system""s Virtual Memory Manager (VMM) continuously tags pages for future storage (typically 4096 bytes per page) from the system memory to the nonvolatile memory in order to open up additional memory space for higher priority tasks based on the software application""s or driver""s request. As used herein, nonvolatile memory may include, but is not limited to: hard disks, removable storage such as diskettes, and solid state memory such as flash memory. In addition, xe2x80x9cswap-spacexe2x80x9d is used in both the system memory and on nonvolatile memory to make the memory allocation and de-allocation operation run smoothly. Stale page swap-space operation between system memory and disk controlled by the Virtual Memory Manager (VMM) typically follow the industry standard LRU/LZU operation as documented in multiple technology papers.
The present system includes the novel introduction of a compressed cache (CC) located within the system memory subsystem, or alternatively located elsewhere in the computer system. The CC may be allocated as a portion of the memory subsystem or alternatively may be separate memory used exclusively for the CC. In the preferred present embodiment, allocation of the system memory for the CC is initiated by the C-DIIM driver software requesting the system""s operating system software to allocate the CC. The CC is memory mapped into the main system memory. Thus, the CC holds compressed pages of data in memory under the direction of the C-DIMM installable file system filters and the Compactor chip software driver.
In the preferred present embodiment, a compressed cache may be allocated for use with one or more cacheable objects such as devices, partitions, sector ranges, file systems, files, request types, process IDs, etc. Thus, one compressed cache may be used for one or more cacheable objects. Alternatively, two or more compressed caches may be allocated for use with one or more cacheable objects. Each compressed cache may be managed separately. A method may be provided to allow users of a system implementing the Compressed Cache architecture to configure the compressed caches associated with one or more objects on the system. Preferably, a computer program with a Graphical User Interface (GUI) is provided to allow the users of the system to assign and configure compressed caches for one or more cacheable objects.
In one embodiment, operation may proceed as follows:
First, the VMM software, working within the operating system software, may tag pages in pageable system memory that are stale as xe2x80x9cinactive pagesxe2x80x9d which get scheduled for later placement onto the nonvolatile memory (for example, hard disk) and network storage. The stale pages may be program data, and thus destined to be written to a swap space area on the nonvolatile memory, or file data, and thus destined to be written to one or more files on the nonvolatile memory.
Second, Compressed Cache Manager (CCM) software operating on the computer system""s CPU may receive a stale page transfer I/O request, and may then instruct C-DIMM device driver (CDD) software operating on the CPU to compress and store the stale page to the CC, typically located in system memory. In one embodiment, the CCM may pass a structure to the CDD comprising the location of the stale page in system memory and the destination location in the pre-allocated compressed cache for the page after compression. The structure may also comprise the original I/O request. In one embodiment, the Compactor Chip may have input buffer memory to receive the uncompressed page and output buffer memory to store the compressed page. In one embodiment, the input and output buffer memory may be comprised on the Compactor Chip. In another embodiment, the input and output buffer memory may be allocated in system memory. The CDD may write the stale page to the input buffer of the Compactor Chip. The Compactor Chip may then compress the stale page, preferably using a parallel compression algorithm, and store the compressed page in the output buffer. The CDD may then read the compressed page from the output buffer and write it to the destination location in the compressed cache. In an alternate embodiment, the CDD may pass the location of the stale page in system memory and the destination location for the compressed pages in the compressed cache to the Compactor Chip, the Compactor Chip may read the uncompressed page directly from system memory, compress the page, and store the compressed page directly to the compressed cache. After the compressed page is stored in the compressed cache, the operating system VMM and file system software think the stale page is stored on disk. However, no disk transfer has occurred, as would occur in prior art operating system operation.
Alternatively to being implemented as software executed on a CPU as stated above, the CCM may be implemented in hardware. In one embodiment, the CCM may be implemented as embedded hardware logic on the Compactor Chip.
Third, the C-DIMM file filter software, termed the Compressed Cache Manager (CCM), in order to make additional space within the CC, may either ask the file system to allocate more pages to the CC, or alternatively may move compressed pages from the CC to the disk. In the preferred embodiment, the monitoring of the CC to determine if additional space is needed, or if space is underutilized in the CC and thus may be freed, may be performed by a background task. The background task may also generate requests to the operating system to allocate or free memory for the CC, and may also initiate the transfer of compressed pages to the disk.
The transfer of compressed pages from the CC to disk may utilize novel Compressed Disk Manager (CDM) software executing on the CPU. The CDM may manage one or more Compressed Page Allocation Tables (CPATs) on partitions of nonvolatile memory pre-allocated as compressed partitions. A compressed partition of nonvolatile memory may be referred to as a CPAT cache. A CPAT and compressed partition may work similarly to a standard File Allocation Table (FAT) or file system (e.g., NTFS) used in present computer systems for the locating of pages that have been previously retired to disk. The CPAT represents a secondary File Allocation Table or FAT2. The CDM is responsible for managing the CPAT and for the translation of addresses between the compressed pages of the CC and the actual xe2x80x9cphysicalxe2x80x9d sectors on compressed partitions of nonvolatile memory, as well as translating addresses from non-compressed system memory and the compressed partition of nonvolatile memory when required. In one embodiment, the CDM may receive an I/O request to transfer one or more compressed pages from the CC to nonvolatile memory. In one embodiment, the I/O request may be generated by a background task that monitors compressed pages in the CC and generates write to disk I/O requests for compressed pages when the compressed pages need to be removed from the CC. In another embodiment, the CCM may generate the I/O request. The CDM may then read the compressed pages from the CC and write the compressed pages to a compressed partition on the nonvolatile memory. In one embodiment, the CDM may also receive page transfer I/O requests from the VMM and, if the pages were not compressed by the CCM, may interface with the CDD to compress and store the pages in the compressed partition using a method substantially similar to that described in step two for the CCM compressing and storing pages to the CC.
Fourth, the VMM may generate read I/O requests to read from the disk subsystem one or more pages previously paged out of system memory. Compressed Cache Manager (CCM) software operating on the computer system""s CPU may receive a read I/O request for a page, and may then examine the compressed cache to see if the requested page is in the compressed cache.
Fifth, if the CCM determines that the compressed page is in the compressed cache, the CCM may instruct C-DIMM device driver (CDD) software operating on the CPU to decompress the page from the CC and move the decompressed page to the system memory. In one embodiment, the CCM may pass a structure to the CDD comprising the location of the compressed page in the CC and the destination location in the system memory for the page after decompression. The structure may also comprise the original I/O request. In one embodiment, the Compactor Chip may have input buffer memory to receive the compressed page and output buffer memory to store the decompressed page. In one embodiment, the input and output buffer memory may be comprised on the Compactor Chip. In another embodiment, the input and output buffer memory may be allocated in system memory. The CDD may write the compressed page to the input buffer of the Compactor Chip. The Compactor Chip may then decompress the page, preferably using a parallel decompression algorithm, and store the decompressed page in the output buffer. The CDD may then read the decompressed page from the output buffer and write it to the destination location in the system memory. In an alternate embodiment, the CDD may pass the location of the compressed page in the CC and the destination location for the compressed page in the system memory to the Compactor Chip, the Compactor Chip may read the compressed page directly from the CC, decompress the page, and store the decompressed page directly to the system memory. Once decompression is complete, the C-DIMM device driver may indicate to the VMM that the requested page is now in system memory and ready for use by application software.
Sixth, if the CCM determines that the requested page is not in the compressed cache, then the CCM may generate a read I/O request to read the page from nonvolatile storage. The read I/O request may include the source location of the requested page on nonvolatile storage and the destination compressed cache location for the requested page. The CDM may receive the read I/O request and examine the CPATs of one or more compressed partitions on nonvolatile memory to see if the requested page is stored in a compressed partition. If the page is stored in a compressed partition, the CDM may translate (via the CPAT) the source location from the I/O request, retrieve the compressed page from the compressed partition on nonvolatile storage, and write the compressed page to the destination location in the compressed cache. The CCM may then proceed with the decompression of the requested page as described in the fifth step. In one embodiment, the CDM may also receive a read I/O request directly from the VMM, search for the requested page in the compressed partitions as described above, and, if the page is located in a compressed partition, may interface with the CDD to decompress and store the requested page to the destination location in system memory using a method substantially similar to that described in step four for the CCM decompressing and storing pages to the system memory.
In an alternate embodiment, in order to retrieve a compressed page from the disk subsystem, the CCM may directly call the CDM for FAT2 address translation in order to obtain the disk partition and sector address and to read the compressed page from the disk subsystem into the CC. The decompression process into an active page may then be performed as described in step four.
These six process steps represent over three orders of magnitude improvement in allocation and transmission of VMM requested pages to the active area of the system memory. For example, page transfers between the CC and xe2x80x9cactivexe2x80x9d portions of the system memory are in the order of 15 xcexcs per page, while pages requested in a convention system from the disk subsystem to the active area of system memory require around 10 ms per page. Transferring compressed data rather than uncompressed data to and from nonvolatile storage such as a disk subsystem also represents a significant improvement in performance due to the decrease in transfer times between system memory and the nonvolatile storage.
Thus, reduction of the time required to activate pages results in improved efficiency and reduced cost. In addition, secondary benefits result in more effective disk space, peripheral bus bandwidth, and reduced transmission time across LANs and WANs when the remote server also has C-DIMM or equivalent technology installed.
In summary, the capabilities of the present invention remove system bottlenecks, allowing more effective data transfer and data storage density. By keeping compressed pages in memory and moving fewer pages to the disk subsystem for temporary storage, the system can fit more application data in the memory subsystem and reduce execution and data movement time significantly. In addition, multiple compactor chips of multiple variety and function can be installed for additional performance enhancements or processing of tasks. A single C-DIMM can effectively achieve performance similar to doubling of the memory resident in the memory subsystem. This represents a significant improvement in the cost and operation of present day workstations, data servers and other computing devices that are memory intensive. Thus the compression enabled memory module (C-DIM) or alternate embodiment of the compactor chip technology, along with the process of moving data through software control filters, is a significant advance over the operation of current software based compression technology running from a specific CPU application program.
Inventions
This application xe2x80x9cIn-memoryxe2x80x9d compression is best applied to any system where the addition of memory improves performance when operating standard or non-standard applications. By application of the present invention for in-memory compression, system disk request rates are decreased, increasing effective application speed and thus establishing a new price per number of operations which a computing system can achieve.
The present invention includes parallel data compression and decompression logic, designed for the reduction of data bandwidth and storage requirements and for compressing and decompressing data at a high rate. The compression/decompression logic may be referred to as a xe2x80x9cCompactor Chip.xe2x80x9d The Compactor Chip may be included in any of various devices, including, but not limited to: a memory controller; memory modules such as a DIMM; a processor or CPU; peripheral devices, such as a network interface card, modem, IDSN terminal adapter, ATM adapter, etc.; and network devices, such as routers, hubs, switches, bridges, etc., among others.
In the present invention, the parallel data compression engine and method, preferably implemented on the Compactor Chip, operates to perform parallel compression of data. In one embodiment, the method first involves receiving uncompressed data, wherein the uncompressed data comprises a plurality of symbols. The method also may maintain a history table comprising entries, wherein each entry comprises at least one symbol. The method may operate to compare a plurality of symbols with entries in the history table in a parallel fashion, wherein this comparison produces compare results. The method may then determine match information for each of the plurality of symbols based on the compare results. The step of determining match information may involve determining zero or more matches of the plurality of symbols with each entry in the history table. The method then outputs compressed data in response to the match information.
In the present invention, the parallel decompression engine and method, preferably implemented on the Compactor Chip, may decompress input compressed data in one or more decompression cycles, with a plurality of codes (tokens) typically being decompressed in each cycle in parallel. The parallel decompression engine may include an input for receiving compressed data, a history table (also referred to as a history window), and a plurality of decoders for examing and decoding a plurality of codes (tokens) from the compressed data in parallel in a series of decompression cycles. A code or token may represent one or more compressed symbols or one uncompressed symbol. The parallel decompression engine may also include preliminary select generation logic for generating a plurality of preliminary selects in parallel. A preliminary select may point to an uncompressed symbol in the history window, an uncompressed symbol from a token in the current decompression cycle, or a symbol being decompressed in the current decompression cycle. The parallel decompression engine may also include final select generation logic for resolving preliminary selects and generating a plurality of final selects in parallel. Each of the plurality of final selects points either to an uncompressed symbol in the history window or to an uncompressed symbol from a token in the current decompression cycle. The parallel decompression engine may also include uncompressed data output logic for generating the uncompressed data from the uncompressed symbols pointed to by the plurality of final selects, and for storing the symbols decompressed in this cycle in the history window. The decompression engine may also include an output for outputting the uncompressed data produced in the decompression cycles.