In the field of data management, maximizing both scalability and performance at the same time is a challenging problem. Scalability and performance are often conflicting objectives because improvements in one typically come at the expense of the other. In a multicomputer environment, where processes are physically distributed, the problem is exacerbated by network latency and communication bandwidth between nodes. The problem is even more daunting if nodes must join the distributed system in an ad-hoc fashion without service interruption.
Prior solutions provide high scalability at the expense of low performance or vice versa. Prior distributed data management systems, for example, though highly scalable, are designed to maintain invariance over several data objects at once and are thus encumbered by the need for transactional scope management and by the need for distributed global locking.
In particular, Database Management Systems (DBMs) guarantee that data modifications are strictly serializable and thus require expensive transaction management overhead and distributed locking to insure data correctness. The need for such transaction overhead and locking greatly reduces concurrent data access and limits performance
Prior Scalable Distributed Data Structures (SDDS) solutions offer dynamic data scalability in a multicomputer environment, but encounter vexing performance problems that limit operational utility. Like distributed DBMS solutions, existing SDDS solutions inevitably encounter performance bottlenecks when accessed by a plurality of concurrent users. Though SDDS solutions can load balance data uniformly across multiple computer nodes, access to data on a particular node can be blocked undesirably by concurrent requests.
SDDS solutions also encounter performance limitations when managing data with complex shapes or of large size. Data sets composed of complex relationships form deep object graphs that incur expensive serialization costs. Compared to primitive data types, the computational cost of serializing and deserializing a complex object graph is significant. As a step in the process of data transfer, the impact of slow serialization on overall performance can be profound.
Moreover, prior SDDS solutions virtualize data access by resolving client requests from server nodes that contain actual data. If a requested object is managed by a server node that is different from the client node, a network data transfer must occur to move the object from the server node to the client node. Because large objects consume significant network bandwidth and result in undesirable transfer latency, SDDS solutions inevitably encounter performance bottlenecks because they must repeatedly drag large objects across the network for every remote request.
The present invention provides equivalent representations of complex data types that result in compressed byte arrays. These compressed data representations are stored and only reified back to their original format as needed. Accordingly, the invention provides data translation and passivation that not only reduce resource storage footprint but also speeds data transfer. The invention provides caching and synchronization of data sets without the expensive node-to-node data transfers that are commonly used. The invention provides scalable data structures, concurrency, efficient serialization and passivation, and data caching that enable applications to store and retrieve data in a manner that is optimal for use in a distributed environment where high speed service delivery and graceful scalability are critical.
In one implementation, the invention includes a distributed data management system with multiple virtual machine nodes operating on multiple computers that are in communication with each other over a computer network. Each virtual machine node includes at least one data store or “bucket” for receiving data. A digital hash map data structure is stored in a computer readable medium of at least one of the multiple computers to configure the multiple virtual machine nodes and buckets to provide concurrent, non-blocking access to data in the buckets, the digital hash map data structure including a mapping between the virtual machine nodes and the buckets. The distributed data management system employing dynamic scalability in which one or more buckets from a virtual machine node reaching a memory capacity threshold are transferred to another virtual machine node that is below its memory capacity threshold.
The present invention eliminates the need for the transaction management of conventional systems and maximizes concurrency through a distributed data structure that enables concurrent access to all data at all times, even as the data structure is growing. The disclosed invention overcomes this concurrency limitation with its non-blocking approach to data management while offering the dynamic scalability benefits of SDDS.
The challenge of creating a data management system that scales automatically to any size, enables concurrent user access, and guarantees high performance in a multicomputer environment is a daunting task. The method and system implemented herein solve the ubiquitous data management problem of high performance, concurrent access to data in a distributed environment, wherein the amount of data may grow dynamically and be modified at anytime.
Additional objects and advantages of the present invention will be apparent from the detailed description of the preferred embodiment thereof, which proceeds with reference to the accompanying drawings.