A new industry is developing around data storage that is not database related. Applications in this industry often involve vast amounts of data that are continually changing. One such application, for example, involves storing data from a stock market. This particular application creates a unique problem due to the fact that stock data is both voluminous and frequently accessed, but the frequency that the data is updated varies widely. In particular, some stocks in the stock market are traded frequently and will have continually changing value data resulting in a large amount of reads and frequent writes. Most other stocks, however, trade relatively infrequently, and value data will typically not change much but will still be subjected to a significant amount of read operations. Because of the volatility of the data and the immediacy of the need for the data there is a need to have large amounts of data accessible all the time. A way to address this need is to retain all of the data in volatile local memory, typically in a volatile data structure referred to as a cache.
Various memory-caching solutions store objects in memory based on a key value. This is similar in concept to a database but generally these solutions are in quick access objects and not generally used for persistence. With key-based access, a data set is associated with a key value, and in order to retrieve the data set, the key value must be provided, in much the same manner as an index.
One problem that arises for these solutions is scalability in volatile memory. For example, in a 32 bit environment such as a 32-bit JAVA® programming environment, there is at most 2 GB of addressable memory available in any process. When the amount of data that needs to be stored is greater than 2 GB, often a more expensive and complex 64-bit architecture is required. Alternatively, data can be partitioned across multiple processes by key, e.g., so that data associated with different groups of keys is accessible in different processes.
There are inefficiencies, however, in these conventional approaches. For instance, with a 64-bit architecture, replicating several gigabytes or terabytes of data from one 64-bit address space to another can take a significant amount of time, which can complicate high availability environments. Garbage collection (attempting to reclaim memory used by objects that will never again be accessed by the application) can also cause problems in very large caches due to the need to use resources to locate non-addressable objects in a massive heap.
Partitioning by key reduces the amount of data any individual partition must store but at the cost of transactional complexity. In moderately complex applications, data must be accessed and updated across partitions resulting in two-phase transaction protocols across these partitions, which can be slow and blocking.
Traditional partitioning processes use a hash function or hash algorithm on the key for each keyed data set, and then replicate the data based on the hashing function for availability. A problem associated with partitioning is the need to access multiple processes when accessing multiple keys during a transaction. This takes time and slows the data access process. Additionally there are some applications that cannot be partitioned or do not partition well.
Another problem is availability. The data must be kept redundantly such that software, hardware, or network failures can be masked. This is generally accomplished through data replication from a primary process to a replica process or processes, resulting in a complete copy of all the data in the partition. This becomes problematic when the data becomes so large that the time to replicate or recover takes too long.
Each of the above mentioned solutions have common problems in that each requires replication of data or partitioning and communicating with multiple partitions, adding considerable time to the solution by either the replication of extremely large amounts of data or the accessing of multiple processes to retrieve multiple sets of keyed data. Accordingly, there is a need in the art for an improved way of storing and accessing large amounts of keyed data in volatile memory without adding significant time for replication or complexity of access.