A data warehouse (DW) may store an organization's data and may provide reporting and analysis facilities. Businesses and governments alike may rely heavily on DW technology in order to glean information from vast amounts of data and to make strategic decisions. The importance of data analytics in this context are substantial, and some state that the recent (2008-2009) financial meltdown may be due in part to insufficient visibility into the true value of mortgage-backed securities, e.g., a problem of poor data analysis.
The DW may be implemented using specialized database systems, relational databases, object-relational databases and/or other types of databases. When an execution plan for each query in the data warehouse is generated independently, and when the queries are executed independently, a contention of resources in a computer system may result. This contention of resources in the data warehouse may cause a database and the computer system to perform inefficiently. The inefficient performance of the computer system may cause unacceptable delays in receiving a response from the database, thereby restricting an ability of users to run multiple queries concurrently.
Today's real-world applications may utilize support for many concurrent users. Moreover, it may be desired to limit a query latency increase caused by going from one query to several concurrent ones. A DW client may specifically state that increasing concurrency from 1 query to 40 should not increase the latency of any given query by more than 6 times. Data warehousing environments of large organizations may be used to routinely support hundreds of concurrent queries.
A general-purpose DW system may have a limited capability to meet these goals. Adding a new query today may have unpredictable effects and/or predictably negative ones. For instance, when going from 1 to 256 concurrent queries, the query response time in a widely used commercial DBMS may increase by an order of magnitude and PostgresSQL's may increase by two orders of magnitude. When queries take hours or days to complete, they may be less able to provide real-time analysis and, depending on the isolation level, they may need to operate on hours-old and/or days-old data. This situation may lead to “workload fear”: users of the DW may be prohibited from submitting ad-hoc queries and may execute only sanctioned reports. Hence in order to achieve better scalability, organizations may break their data warehouse into smaller data marts, perform aggressive summarization, and batch query tasks.
These measures, however, may delay the availability of answers, restrict severely the types of queries that may be run (and consequently the richness of the information that may be extracted) and contribute to an increased maintenance cost. In effect, the available data and computation resources may end up being used inefficiently, which may prevent the organization from taking full advantage of their investment. Workload fear may act as a barrier to deploying novel applications that use the data in imaginative ways.
This phenomenon may not be due to faulty designs, but rather because some existing database systems may have been designed for a particular case. Workloads and data volumes, as well as hardware architectures, may differ from the particular case. Conventional database systems may employ the query-at-a-time model, where each query may be mapped to a distinct physical plan. This model may introduce contention when several queries are executing concurrently, as the physical plans compete in mutually-unaware fashion for access to the underlying I/O and computational resources. As a result, concurrent queries may result in random I/O; but when the DW holds 1 petabyte, even a query that touches only 0.01% of the highly-indexed database may still retrieve on the order of 10^9 tuples, thus potentially performing a crippling number of random I/O operations. Performance of more random I/O operations may result in a slower database system.