One of the key promises of cloud computing is to enable applications to dynamically adapt to growing workloads by elastically increasing the number of servers, which is often referred to as scaling out for evolving workloads from a number of users. It is especially important for web applications to elastically scale out (and in) since workloads from web users are often unpredictable, and can change dramatically over relatively short periods of time. This elastic scaling approach, however, is known to have a key bottleneck: the database, whereas the application itself can scale out by using an increasing number of application servers in parallel.
Relational database management systems (RDBMS) have been offering data partitioning and replication trying to scale on top of parallel computing architecture, as well. However its scalability is too limited for highly scalable web applications. Caching has been extensively applied to web applications. However its applicability is limited for recent web applications because (1) the dynamic content to be served is highly personalized, resulting in a low cache hit rate, and (2) applications are more write-heavy to feature user's input to the system (e.g., Web 2.0 applications).
Software architects typically use database sharding techniques to build scalable web applications on RDBMSs. Instead of letting an RDBMS manage data partitioning and replication on its own, they use multiple independent RDBMSs, each managing its own database node. One key motivation for this approach is that the full and global ACID properties enforced by RDBMSs are often considered as overkill for scale-out applications. However, with this approach, the application is now responsible for managing partitioning and replication, with consistency issues and load balancing in mind. Furthermore, the application has to manage re-partitioning to adapt to changes in data and in workload over time.
Key-value stores, e.g. Amazon Dynamo, provide an alternative to such RDBMS-based sharding approaches. A key-value store typically employs consistent hashing to partition a key space over distributed nodes, where addition and removal of nodes would cause only limited re-partitioning. Unlike database sharding, the key-value store frees the engineers from custom effort on load balancing, data replication, and fail-over. In practical terms, the major benefit is the low operational cost needed to achieve scalability and availability. This benefit has made it possible for key-value stores to be offered as a service.
However, unless the application handles very simple data items, the key-value store pushes further burden on the application logic to access and manipulate the data. Although several systems support more complex structured data, the query and data manipulation interfaces they provide are not as powerful as the relational model offered by traditional RDBMSs. To make it worse, their access APIs are proprietary, which raises concerns about the portability issue (referred to as the vendor lock-in issue).
One of the underlying problems in these recent approaches is the lack of proper abstraction. The key-value store approach gives up, not only ACID properties, but also the key benefit of the relational data model: physical data independence. In order to achieve high performance, the application developer needs to carefully design the data organization on the storage, and to write application logic to access this data in a specific way to this organization. For example, it is a common practice for application developers to aggressively apply denormalization in order to allocate multiple semantically-related entities into a single key-value object. This achieves improved efficiency and enables atomic access to the related data. Such a design decision, even though it is on the physical-organization level, requires rewriting the application logic to take that specific design into account, thus losing the advantage of physical data independence.