Semiconductor technologies have advanced rapidly over the past decades. However, disk storage (e.g., hard drives) has not kept pace with computer main memories (e.g. DRAM) in terms of access speed. Although the storage capacity of magnetic disks increased dramatically, their mechanical nature remains a weakness to disk access speed. The result is a wider speed gap between main memories and disk storage in a computer system. On the other hand, the main memories have also failed to track on the speed of processors, resulting to another speed gap between main memories and processors. As a result, a processor has to wait for a substantial amount of time for data access operations due to the speed gaps. The waiting time has since become a serious penalty to the performance of computer systems.
Caching is a main mechanism for reducing access latency. A memory storage subsystem can include one or more layers of memory cache to plug the performance gap between processors and main memories. For example, many computer systems nowadays have at least three levels of memory caches between a processor and a main memory. In a similar way, modern file systems generally use large non-volatile caches to speed up storage drive access. In recent development, solid state disk with non-volatile memory technology, acting as a new layer of disk, are deployed between main memories and storage drives (e.g. hard drives). Non-volatile random access memory (NVRAM) is the general name used to describe any type of random access memory which does not lose its information when power is turned off. The non-volatile cache handles the data most frequently written to or retrieved from storage and can also effectively increase the capacity of the drive. The non-volatile cache can be an integrated part of a hard disk, external to a hard disk but contained in the housing of a hard disk, or entirely external to a hard disk.
The cache technology increases the performance of the data storage system and enhances the overall system responsiveness. The additional layer of disk cache can store duplicates of frequently-used data from a main storage drive, therefore dramatically reduce the number of times a system needs to burn power and waste time finding small bits of data scattered across the main storage drive. The solid state disk also enables the system to store boot and resume information in a cache.
The efficiency of a caching mechanism is mostly exploited when the cache is occupied with data that are accessed frequently. In an adverse situation, after a long series of sequential accesses to one-time-use-only (cold) data blocks, many frequently accessed data blocks may be evicted out from the cache immediately, leaving all these cold blocks occupying the cache for an unfavorable amount of time and thus resulting in a waste of the memory resources. A solid state disk acting as a disk cache may also suffer the same negative impacts in such situation.
Data access activities to or from a storage drive that involves storing a large data workload in a cache while discarding the data workload without reusing the data workload is often known as cache thrashing. Such data access activities are often from applications referred to as cache thrashing applications. Having large cache capacity can quickly become irrelevant if I/O requests coming from cache thrashing applications exceed the size of the disk cache. To increase the performance of a cache, it is important to retain the data that are frequently accessed and to remove data that will not be required in the near future (e.g., data that are only required once). It would be greatly useful to identify such applications that exhibit cache thrashing behaviors and subsequently prevent storing data requested by these applications in the disk cache to increase the efficiency of usage.
A data mining process (DMP) herein refers to a computer process that performs disk requests (or other memory access requests) to random locations in a data storage (e.g., storage drive, main memory) and the results requested by the process are rarely re-used. The workload requested by the process is relatively large as compared to the capacity of a caching mechanism in place, causing reusable data from the cache to be evicted if the results of the disk requests are cached. Such computer processes exist in and not limited to applications such as computer virus scan applications, file indexing application, data mining applications, disk scanning applications, and file streaming applications. For example, a virus scanner process randomly scans all the files on a storage drive. The disk requests from the virus scanner process, if cached, would cause the disk cache, which stores most recently used data, flushed with one-time-use-only data during the scanning process occurs. Some sequential streaming applications can be detected by examining the addresses of a series of disk requests. If the series of disk requests are accessing consecutive locations on a storage drive, the process is most likely a sequential streaming application. A data mining process is not easily detectable using the same technique as the process may raise disk requests to random locations in a computer storage.