The present disclosure relates generally to cloud computing and more particularly to a scalable distributed data management system utilizing load-balancing techniques including data distribution and distributed indexing to leverage a cloud computing system.
Cloud computing services can provide computational capacity, data access, networking/routing and storage services via a large pool of shared resources operated by a cloud computing provider. Because the computing resources are delivered over a network, cloud computing is location-independent computing, with all resources being provided to end-users on demand with control of the physical resources separated from control of the computing resources.
Cloud computing is a model for enabling access to a shared collection of computing resources—networks for transfer, servers for storage, and applications or services for completing work. More specifically, the term “cloud computing” describes a consumption and delivery model for IT services based on the Internet, and it typically involves over-the-Internet provisioning of dynamically scalable and often virtualized resources. This frequently takes the form of web-based tools or applications that users can access and use through a web browser as if it was a program installed locally on their own computer. Details are abstracted from consumers, who no longer have need for expertise in, or control over, the technology infrastructure “in the cloud” that supports them. Most cloud computing infrastructures consist of services delivered through common centers and built on servers. Clouds often appear as single points of access for consumers' computing needs, and do not require end-user knowledge of the physical location and configuration of the system that delivers the services.
The utility model of cloud computing is useful because many of the computers in place in data centers today are underutilized in computing power and networking bandwidth. People may briefly need a large amount of computing capacity to complete a computation, for example, but may not need the computing power once the computation is done. The cloud computing utility model provides computing resources on an on-demand basis with the flexibility to redistribute resources automatically or with little intervention.
The flexibility of the cloud lends itself to a number of solutions for storing, retrieving, and analyzing large datasets. Relational database management systems (RDBMS) are data management systems designed to handle large amounts of interrelated data. RDBMS organize related data into tables optimized for rapid access of data while maintaining core requirements of atomicity (the requirement that a transaction be entirely successful and that changes made by a partially successful transaction be reverted), consistency (the requirement that transactions must not violate specified database consistency checks), isolation (the requirement that no transaction can interfere with another transaction), and durability (the requirement that committed transactions be written to a permanent location instead of, for example, a buffer). Many RDMBS protocols support splitting large amounts of data over multiple computing nodes. In a horizontally distributed environment, transaction data may be stored among multiple nodes, whereas in a vertically distributed environment, the data may be replicated at multiple nodes. As can be seen, the task of achieving reasonable query performance in a distributed network while maintaining atomicity, consistency, isolation and durability is non-trivial. Challenges inherent in the task have necessitated tradeoffs that make distributed RDBMS suit some applications better than others.
Due in part to these tradeoffs, a number of NoSQL-type systems have emerged. These systems soften the requirements of a relational database in exchange for increased performance. Abandoning certain RDBMS tenets has the ability to pay dividends, particularly in a distributed environment. For example, a NoSQL system may employ an eventual consistency model to improve data transactions. Under an eventual consistency model, transaction results will propagate to appropriate data locations eventually as opposed to arriving at a guaranteed time. Propagating results and synchronizing data requires considerable overhead, and deprioritizing certain writes can relieve burden on the hardware including storage element and the supporting network. It can also improve query response time.
Based on the intended use and associated design considerations, NoSQL systems utilize a variety of different mechanisms to distribute data over a set of compute nodes. These mechanisms lead to partitioning rules such as a minimum level of granularity when partitioning data between computing systems. On the other hand, cloud computing is uniquely suited to rapid and dynamic creation, reconfiguration, and destruction of computing “systems.” Data management architectures with greater flexibility and capable of efficient balancing and scaling can better leverage the ephemeral resources available within the cloud. Accordingly, it is desirable to provide a better-functioning data management system capable of maximizing cloud-computing resources while providing improved query efficiency and data capacity.