In situations where one has access to massive amounts of data, the cost of building a statistical model to characterize the data can be significant if not insurmountable. The accuracy of the model and the cost of building the model are competing interests associated with building a statistical model. That is, while the use of a larger data set may provide a more accurate model than a smaller set of the data, the analysis of data tends to become increasingly inefficient and expensive with larger data sets. Because of the computational complexity associated with analyzing large data sets, a common practice is to build a model on the basis of a sample of the data. This allows, for example, predictions to be formulated using the model as a guide.
With the spreading of the Internet comes a substantially increased flow of information. The Internet allows easy dissemination of all types of information with minimal barriers to entry (typically one must just have access to a computer to create and/or send data). Thus, the number of large information databases has drastically increased as the Internet has grown. It is also frequently difficult to ascertain the size of a particular database. One only has to enter the search term “dog” on a search engine and look at the total number of returned entries to appreciate the vastness of the stored knowledge provided almost instantaneously by the Internet. That single search can return millions of “hits” that can include additional megabytes of information to be found at each of the links provided by the search, truly overwhelming.
However, despite the enormity of the information, there are actual systems of hardware and software that must read, interpret, store, and receive/transmit this information in order for it to be available for a user to find. These systems must be built and scaled to operate efficiently to handle data volumes of this size. It is often necessary to utilize, for example, multiple computers to handle a single task due to its size. For example, an Internet web site can have multiple servers provide access to the web site in order to provide enough bandwidth so that users are not waiting for extended periods of time to download web pages.
While providing increased resources facilitates in compensating for tremendous data throughput, it, at the same time, also greatly increases the complexity of determining the necessary resources to meet those demands. That is, for example, if users complain about slow response times for a web site, a second server can be incorporated to speed up the web site response time. It seems simple until the example expands to include a popular web site supported by 100 servers with billions of hits per month. The amount of statistical data for the web site throughput is now too large to peruse in its entirety each month, and it is also distributed across 100 different servers. Trying to determine statistics such as peak loading, average loading, types of users, duration of use, and/or access path and the like becomes an unfathomable task. However, without this information, it would be impossible to determine if 10 more servers would be sufficient to ease traffic concerns or if 50 more servers are required. Thus, obtaining this type of data is extremely valuable for web sites and other situations where extreme amounts of data need to be processed in a quick and efficient manner to glean important information that would otherwise be unobtainable with current technologies.