This invention relates to the fields of computer systems and data processing. More particularly, a system, method, and apparatus are provided for organizing, joining and then performing calculations on massive sets of data.
Computing systems that host communication services, news sources, social networking sites, retail sales, and/or other services process large amounts of data. Different datasets may be assembled for different applications, different application features, or for other purposes, but may be inter-related. As a simple example, an organization that sells a product may maintain one dataset comprising communications (e.g., electronic mail messages) sent by all of its salespeople, and another dataset correlating those salespeople with the clients they service. To obtain a report indicating how often each salesperson communicates with his or her clients, for example, typically the two entire datasets would be joined and then processed in some manner.
Some organizations, however, need to correlate, analyze, or otherwise process tens or hundreds of millions of records, or more—such as an organization that operates a social networking site or a popular communication application and that assembles voluminous data regarding its members' activities. Joining datasets within this type of environment could yield an intermediate collection of data amounting to tens or hundreds of terabytes. Generating this huge data collection and performing queries or other processing to extract desired information could therefore take a significant amount of time (e.g., many hours)—so much time, in fact, as to make the resulting information obsolete by the time it is produced.