Data management in modern business applications is one of the most challenging topics in today's software industry. Not only is data driving today's business but also provides the foundation for the development of novel business ideas or business cases. Data management in all the different flavors has become a core asset for every organization. Also, data management has gained significant attention at senior management level as the core tool to drive and develop the current business. On the system side, data management scenarios have become extremely complex and complicated to manage. An efficient, flexible, robust, and cost-effective data management layer is the core for a number of different application scenarios essential in today's business environments.
Initially, classical enterprise resource planning (ERP) systems were implemented as the information processing backbone that handles such application scenarios. From the database system perspective, the online transactional processing (OLTP) workload of ERP systems typically requires handling of thousands of concurrent users and transactions with high update load and very selective point queries. On the other hand, data warehouse systems—usually considered as the counterpart to OLTP—either run aggregation queries over a huge volume of data or compute statistical models for the analysis of artifacts stored in the database. Unfortunately, applications like real time analysis to identify anomalies in data stream or ETL/information integration tasks add to the huge variety of different and in some cases absolutely challenging requirements for a data management layer in the context of modern business applications.
Some have postulated that traditional database management systems are no longer able to represent the holistic answer with respect to the variety of different requirements. Specialized systems will emerge for specific problems. Large data management solutions are now usually viewed as a zoo of different systems with different capabilities for different application scenarios. For example, classic row-stores are still dominating the OLTP domain. Maintaining a 1:1-relationship between the logical entity and the physical representation in a record seems obvious for entity-based interaction models. Column-organized data structures gained more and more attention in the analytical domain to avoid projection of queried columns and exploit significantly better data compression rates. Key-value stores are making inroads into commercial data management solutions to cope not only with “big data”-volumes but also provide a platform for procedural code to be executed in parallel. In addition, distributed file systems that provide a cheap storage mechanism and a flexible degree of parallelism for cloud-like elasticity made key-value stores a first class citizen in the data management arena. The plethora of systems is completed by triple stores to cope with schema-flexible data and graph-based organization. Since the schema comes with the data, the system provides efficient means to exploit explicitly modeled relationships between entities, run analytical graph algorithms, and exhibit a repository for weakly-typed entities in general.
Although specialized systems may be considered a smart move in a first performance-focused shot, the plethora of systems yields tremendous complexity to link different systems, run data replication and propagation jobs, or orchestrate query scenarios over multiple systems. Additionally, setting up and maintaining such an environment is not only complex and error prone but also comes with significantly higher total cost of ownership (TCO). From a high-level perspective, the following observation of motivations underlying the current situation can be made:
Usage perspective: SQL is no longer considered the only appropriate interaction model for modern business applications. Users are either completely shielded by an application layer or would like to directly interact with their database. In the first case, there is a need to optimally support an application layer with a tight coupling mechanism. In the second case, there is a need for scripting languages with built-in database features for specific application domains. There is also the need for a comprehensive support domain-specific and proprietary query languages, as well as a huge demand for mechanisms to enable the user to directly address parallelism from a programming perspective.
Cost awareness: There is a clear demand to provide a lower TCO solution for the complete data management stack ranging from hardware to setup costs to operational and maintenance costs by offering a consolidated solution for different types of workloads and usage patterns.
Performance: Performance is continually identified as the main reason to use specialized systems. The challenge is to provide a flexible solution with the ability to use specialized operators or data structures whenever possible and needed.
Different workload characteristics do not fully justify using the zoo of specialized systems. Our past experience of handling business applications leads us to support the hypothesis for a need of specialized collections of operators. There exists a bias against individual systems with separate life cycles and administration set-ups. However, providing a single closed system is too limiting, and instead a flexible data management platform with common service primitives is preferred.
Different workload characteristics—ranging from high volume transaction processing via support of read-mostly analytical DWH workloads to high-update scenarios of the stream processing domain do not fully justify going for the zoo of specialized systems. Experience with handling business applications leads to the need of specialized collections of operators.
In addition to pure data processing performance, the lack of an appropriate coupling mechanism between the application layer and the data management layer has been identified as one of the main deficits of state-of-the-art systems. Further, individual systems with separate life cycles and administration set-ups are more difficult to set up and manage, while a single closed system is usually too limiting. What is needed is a flexible data management platform with common service primitives on the one hand and individual query execution runtime environments on the other hand.