The subject matter described herein generally relates to providing advanced analytics for business intelligence applications working with massive volumes of data, such as on hundreds of terabytes to petabyte scale.
Existing data warehouses and the solutions built up around them are increasingly unable to provide reasonable response times to data management requests due to the expanding volume of data the warehouses are maintaining This is especially true in certain industries, such as telecommunications, where millions and even billions of new data records may be added each day to the data handling systems.
To improve response times, the rows of a table were partitioned across multiple machines with separate disks, enabling parallel I/O scans of big tables. Basic relational query operators like selection, join, grouping and aggregation were reinvented to run in parallel via similar partitioning schemes: the operations undertaken by each node in the cluster are the same, but the data being pumped through the fabric is automatically partitioned to allow each node to work on its piece of the operator independently. Finally, these architectures allowed multiple relational operators to operate at the same time, allowing pipeline parallelism in which an operator producing a data stream runs in parallel with the operator consuming it. However, data consumers used the existing architecture primarily for reporting and billing. The business intelligence derivation was only done by, for example, allowing analysts to fire ad hoc cubing queries for browsing along interesting dimensions. As such, advanced analytics applications were not utilizing the data warehouse even though many potential use cases for such analytics existed.