Data warehousing technology collects and processes data from a group of data sources. For instance, a business-related organization may use data warehousing technology to harvest transactional information from a group of systems throughout the organization. Administrators of the organization may use the collected data to generate various reports. In one use, the reports provide insight into the operation of the organization.
In a typical data warehousing operation, raw data from a group of data sources is forwarded to a centralized collection point using client-server interaction. A series of processing stations then process the collected data. In other words, the traditional approach funnels a large amount of collected data into a processing pipeline; within the pipeline, each station performs operations on the data in successive fashion (such as various types of transformation operations). The data warehousing technology also aggregates the collected data to yield aggregated results. These operations may result in the storage of a large amount of data. After the above-described operations have been performed, and not before, a user may generate reports based on the aggregated results or the raw collected data (or both).
The above approach continues to work well in many environments. However, some environments include data sources which generate an enormous amount of data. One example of such an environment is a network-accessible service that includes a group of computer servers, network devices, etc. that generate performance data, log data, and other information. For instance, each such device may make several performance measurements each second. Further, a large-scale operation may provide many thousands of devices. As can be appreciated, this type of operation generates a significant amount data.
In these environments, the traditional approach may experience bottlenecks in transporting, processing, and storing the large amount of data. As a result, it is a time-consuming task for the traditional approach to process the collected data. As a further consequence, an administrator may need to wait a significant amount of time before he or she can investigate the state of computer servers that are being monitored—potentially more than an hour. This renders any reports available through this technology untimely, and thus of questionable use.
The above-described approach may have one or more additional drawbacks. The approach may be relatively complex and expensive. The approach may also be relatively inflexible and not readily scalable. The approach may also provide only a small subset of the data provided by the data source (due to its difficulty in processing such a large amount of data). The approach may also provide poor (e.g., unpredictable) performance. Further, the approach may provide insufficient tools for allowing a user to investigate the data processing equipment being monitored. The traditional approach may have yet additional drawbacks.