Generally, as more and more information is collected by companies and government agencies the storage and retrieval of the information becomes a greater problem. Often companies may store data on the order of petabytes (PBs) or larger. Colloquially information systems that store these very large amounts of data may be referred to as “Big Data”.
Typically, massive high-volume data storage introduces significant obstacles when it comes to information management that “Big Data” solutions were meant to solve. Often such obstacles include one or more of: high-volume, high-speed insertions of data into the database, support for petabytes of stored data, purging strategies that can match the insertion speed, mutating schemas or data formats that cause expensive data migrations, and/or queries that treat every field as equally viable criteria. It is understood that the above are merely a few illustrative examples to which the disclosed subject matter is not limited.
Frequently, different solutions applied to the Big Data sector specialize in different aspects of these problems. However, these solutions generally all suffer from a common problem—that of responding quickly enough to queries as the data grows. In one example, the time a user waits between making a query request to the database and receiving the first data record returned as a result of the query (a.k.a. the time-to-first-result) degrades as the data set gets larger and larger.
For example, a company's web site (e.g., Salesforce.com, etc.) may produce about 1 terabyte (TB) of performance data per day and may expect the data acquisition rate to accelerate significantly year-over-year as the company grows its business. This company may desire to retain 15 months of that data in order to maintain visibility on annual and semi-annual operational events/patterns or for other reasons. However, the company may also desire to access a piece of data within 30 seconds of its insertion into the larger database or data set. Moreover, they may desire that queries be able to return useful results within 30 seconds or sooner, even though the queries might span the entire data set (e.g., >1 TB, >1 PB, hundreds of millions of records, etc.). It is understood that the above are merely a few illustrative examples.