The invention relates to a computer-implemented method for operating a database, in particular for updating a binary status of a tuple in the database.
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Relational database systems have been the backbone of business applications for more than 20 years. They promised to provide companies with a management information system covering the core applications, including financials, sales, order fulfillment, manufacturing, and human resources, which run from planning, through business processes, to individually defined analytics. However, this goal has not been achieved. The more complex business requirements became, the more the focus was put on the so-called transactional processing part and designed the database structures accordingly. These systems are called OLTP systems (Online Transactional Processing). Analytical and financial planning applications were increasingly moved out to separate systems for more flexibility and better performance. These systems are called OLAP systems (Online Analytical Processing). In reality, parts of the planning process were even moved off to specialized applications mainly around spreadsheets.
Both systems, OLTP and OLAP, are based on the relational theory but with different technical approaches [W. H. Inmon. Building the Data Warehouse, 3rd Edition. John Wiley & Sons, Inc., New York, N.Y., USA, 2002]. For OLTP systems, tuples are arranged in rows which are stored in blocks. The blocks reside on disk and are cached in main memory in the database server. Sophisticated indexing allows fast access to single tuples, however accesses get increasingly slower as the number of requested tuples increases. For OLAP systems on the other hand, data are often organized in star schemas, where a popular optimization is to compress attributes (columns) with the help of dictionaries. After the conversion of attributes into integers, processing becomes faster. More recently, the use of column store databases for analytics has become quite popular. Dictionary compression on the database level and reading only the columns necessary to process a query speed up query processing significantly in the column store case.
The introduction of so-called data warehouses must be considered to be a compromise. The flexibility and speed gained had to be paid for with additional management of the extraction and loading of data, and controlling of the redundancy. For many years, the discussion seemed to be closed and enterprise data was split into OLTP and OLAP [C. D. French. “One Size Fits All” Database Architectures Do Not Work for DDS. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, Calif., May 22-25, 1995 [1], pages 449-450]. The OLTP is the necessary prerequisite for the OLAP. However only with the OLAP can companies understand their business and come to conclusions about how to steer and change course. When planned data and actual data are matched, business becomes transparent and decisions can be made. While centralized warehouses also handle the integration of data from many sources, it is still desirable to have OLTP and OLAP capabilities in one system which could make both components more valuable to their users.
The last 20 years, Moore's law enabled the enterprise system to grow both in functionality and volume [G. E. Moore. Cramming More Components Onto Integrated Circuits. Electronics, 38(8), 1965]. When the processor speed hit the 3 GHz level (2003) and further progress seemed to be distant, two developments helped out: unprecedented growth of main memory and massive parallelism through blade computing and multi-core CPUs [G. Koch. Discovering Multi-Core: Extending the Benefits of Moore's Law. Technology@Intel, (7), 2005]. While main memory was always welcome for e.g. caching and a large number of CPUs could be used for application servers, the databases for OLTP where not ideally suited for massive parallelism but stayed on SMP (symmetric multi processing) servers. The reasons were temporary locking of data storage segments for updates and the potential of deadlocks while updating multiple tables in parallel transactions. This is the main reason why for example R/3 from SAP ran all update transactions in a single thread and relied heavily on row level locking and super fast communication between parallel database processes (SMP). Some of the shortcomings could be overcome later by a better application design, but the separation of OLTP and OLAP remained unchallenged.
Early tests with in-memory databases of the relational type based on row storage did not show significant advantages over leading RDBMSs with equivalent memory for caching. Here the alternative idea to use column store databases for OLTP was born. Column storage was successfully used for many years in OLAP and really surged when main memory became abundant [M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, Aug. 30-Sep. 2, 2005, pages 553-564. ACM, 2005] [P. Boncz. Monet: A Next-Generation DBMS Kernel for Query-Intensive Applications. 2002. PhD Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands].
Column Storage is Best Suited for Modern CPUs
Modern CPUs with multi-core architecture provide an enormous amount of computing power. Blades with 8 CPUs and 16 cores per CPU will populate next-generation blade servers. That gives us 128 computing units with up to approximately 500 GB of main memory. To optimize the use of these computing devices we have to understand memory hierarchies, cache sizes and how to enable parallel processing within one program [P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access. In VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, Sep. 7-10, 1999, Edinburgh, Scotland, UK, pages 54-65. Morgan Kaufmann, 1999]. Considering the memory situation first, enterprise applications are to a large extent memory bound, that means the program execution time is proportional to the amount of memory accessed for read/write or being moved.
As an example, a full table scan of SAP's accounting document line items table is considered, which has 160 attributes, in order to calculate a total value over all tuples. The inventors used the accounting data of a German brewery of 5 years, the number of tuples in this table was 34 million. In the underlying row database, 1 million tuples of this particular table consume about 1 GB of space. The size of the table was thus 35 GB. The equivalent column store table size was only 8 GB because of the more efficient vertical compression along columns. Considering that in real world applications only 10% of the attributes of a single table are typically used in one SQL-statement (see FIG. 1), that means for the column store at most 800 MB of data have to be accessed to calculate the total values [S. Aulbach, T. Grust, D. Jacobs, A. Kemper, and J. Rittinger. Multi-Tenant Databases for Software as s Service: Schema-Mapping Techniques. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, Jun. 10-12, 2008, pages 1195-1206. ACM, 2008]. FIG. 2 shows (schematically) that the row storage with horizontal compression cannot compete, if processing is set-oriented and requires column operations. Even with the appropriate index the amount of data accessed is orders of magnitude higher.
According to the inventors' analyses of real systems with customer data, most applications in enterprise computing are actually based on set processing and not single tuple access. Thus, the benefit of having data arranged in a column store is substantial. In addition to this, most of the calculations can be executed on row level using the compressed, integer format. A performance gain of a factor 100-1000 in comparison to the same calculation executed on non-compressed data formats at the application level can be achieved. The application layer has to work with minimal projections in local SQL statements and avoid using more generic SQL statements in subroutines to support the reduction in memory access.
On top of these benefits comes the introduction of parallel processing. According to Hennessy in [J. L. Hennessy and D. A. Patterson. Computer Architecture—A Quantitative Approach. Morgan Kaufmann, fourth edition, 2007], the difficulty of creating parallel processing programs is to break up a program into equal-sized pieces, which then can be processed in parallel without much synchronization. The select operation through one or more columns is exactly what we are looking for. This operation can indeed be split easily into equal parts and distributed to multiple cores. The standard operations of OLAP engines and any other formal application logic e.g. calculation of due dates, currency conversion, working days for a given date interval etc. can be handled by stored procedures operating on the integer values of the compressed columns.
All calculations on the tuple level will automatically be parallelized, since they are completely independent of each other. The first level of an aggregation will be executed synchronously for each qualified tuple. The synchronization between the core processes is minimal. Further aggregation along given hierarchies take place as a second step on the accumulated data. The same applies to sorting by attributes or sequencing by time.
Even if only a few tuples qualify through the selected statement, the introduction of indices is not necessary because the scanning speed is so enormous, especially if parallel processing on multiple cores is active. On current CPUs, we can expect to process 1 MB per ms and with parallel processing on 16 cores>10 MB per ms. To put this into context, to look for a single dimension compressed in 4 bytes, we can scan 2.5 million tuples for qualification in 1 ms. With this speed in mind, it is not even necessary to provide a primary key index for most of the tables but the full column scan can be used instead. Column storage is so well suited for modern CPUs that the full scope of the relational algebra can be used without shortcomings in performance. It is important to note that every attribute now represents a potential index. There are no restrictions any more for the applications to focus on certain navigation paths. The delegation of most of the calculations to the database layer cleans up the application layer and leads to a better separation of concerns. This will result in a higher quality of programs and allow a better lifecycle with ongoing development. The hard disk is used only for transaction logging and snapshots for fast recovery. In fact, disk has become yesterday's tape [J. Gray. Tape is Dead. Disk is Tape. Flash is Disk, RAM Locality is King. Storage Guru Gong Show, Redmon, Wash., 2006].