Interactive analytics refers to the real-time manipulation of data to thereby answer ad-hoc queries concerning that data. Because it is interactive, it has a requirement for low-latency (e.g., sub-second) response time for the queries. For example, a wireless telecommunication provider may offer voice and data services via a cellular network; the provider may wish to analyze, in real-time, the effects of new services, operations, or products on the network. For example, if a new type of mobile handset is available, the provider may want to know (i) how much data traffic in a particular area was generated by the devices in the past hour (or other amount of time) or (ii) how much data traffic in the area was generated by a particular application running on the devices in the past hour. The answers to these queries must be available very quickly to be useful.
To be most useful, the allowable queries need to be highly flexible. For example, the size of the queried areas may vary from towns to cities, states, and regions; the devices queried may vary from particular makes or models to product lines, and the data transmitted to or from the devices may vary from all data or only data sent by a particular application or application type.
In many situations the data is not in a convenient format to answer these queries quickly or efficiently. The raw data may be, for example, low-level, high-volume, and streaming in nature. Low-level data, as the term is used herein, means that the data is not organized into high-level categories such as “the amount of data traffic in the Washington, D.C. area in the past hour”, but rather is made available in highly disaggregated form. The disaggregated form could be, for example, a collection of data records each having a record number and a flat list of items of data (e.g., user ID, device ID, application ID, data bytes transferred, and/or region ID). Each record may correspond to, for example, a single transfer of some number of bytes of data traffic to a particular application (e.g., a web browser) running on a particular device (as identified by its device ID) operated by a particular user at a specific moment in time. Note that the device type may not be given directly, but instead may be derived from the device ID. Furthermore, the records may be physically distributed (i.e., records for different regions may be generated in different locations).
The data is also high-volume: a large network operator may have hundreds of millions of customers, for example, many of whom are using their devices at any given time, so that the aggregate number of records being generated is very large. A few billion records may be generated per hour, for example. Finally, the data is streaming in nature. That is, it is generated more-or-less continuously in time.
One challenge in managing and analyzing such data is to provide an analyst (or other submitter of a query) with the ability to answer a wide range of flexible queries with answers based on low-level, high-volume, streaming records with sub-second response time. The manipulation of data to answer these queries may be solved using a data model called online analytical processing (“OLAP”) cubes. Conceptually, data in an OLAP cube is organized into a set of independent dimensions; the cube may have any number of dimensions. In the foregoing example, the dimensions could be device-type, region, and application, because the example queries concern devices, regions, and applications. The items stored in the cube are referred to as measures. In the present case, the measure is number of bytes transferred in the past hour (“data traffic”). Finally, specific instances of a dimension are referred to as labels. In the example, the labels might include “IPHONE 4” (for the device-type dimension) and “Washington, D.C.” (for the region dimension).
FIG. 1 shows an illustration of this data layout in an OLAP cube 100. The cube has a first dimension “device-type,” a second dimension “application,” and a third dimension “region.” To answer a query, the cube returns a sum of the measures in the cells corresponding to the intersection of the appropriate layers of the cube. For example, to determine the number of bytes transferred in the past hour by web browsers running on IPHONE 4s in the Washington, D.C. area, the system retrieves a single cell 102 that lies at the intersection of the “Washington, D.C.” slice 104 of the region dimension, the “IPHONE 4” slice 106 of the device-type dimension, and the “web browser” slice 108 of the application dimension. Such operations may be done with sub-second response time. In practice, however, such data is not usually stored in a cube organization, but rather in a database that is suited for storing and performing queries on data with this logical structure; OLAP cubes may satisfy the need for sub-second response time for a flexible set of ad hoc queries, the data organization typically handled by such systems is not low-level, high-volume, streaming data.
Another technology relevant to interactive data analytics on large streaming data sets is cluster computing. A compute cluster is a set of general-purpose computing systems (i.e., nodes) having a directly attached local storage (e.g., a hard disk) connected by a local area network (e.g., a single, high speed switch). Using such clusters it is possible to process large amounts of data in relatively short time intervals. The basic principle of cluster computing is divide and conquer. To process a large data file, the file is divided up and a portion is placed on the local storage of each compute node. Each node then processes its local data, optionally communicating with one or more of the other compute nodes. While cluster computing has the potential to process large amounts of data, however, cluster computing is not well suited to processing streaming data. Cluster computing assumes that data is resident on local storage and does not provide any particular advantage when data is not on local storage, but rather is made available incrementally over time (streaming).
Thus, a need exists for a way to query, in real-time, large, constantly changing, low-level data sets.