Current statistical languages allow developers of analytics, statistical programs, and other applications and other users, such as consumers of analytics, (hereinafter referred to generically as “users”) the ability to input and analyze a wide variety of data. However, the operations typically supported by such languages operate on data “in memory.” In 4th Generation Languages (4GL) languages, such as S-PLUS, the data is processed in an object-oriented fashion that is intuitive, but sometimes space consuming, because an object is created in-memory to hold the data as it is being manipulated by the statistical operations. That is, as long as the data to be processed fits in memory all at once, these operations can succeed. Thus, the operations can only handle as much data (or as large a data set) as the memory supports. Memory includes, for example, both available RAM and virtual memory, in environments where such is supported. (Virtual memory is typically supported by a swap space on an external disk drive, which is accessed as if it were RAM.) Note that, as used here, the terms “data,” a “set of data,” or “data set” can be used interchangeably and indicate one or more items that are being processed together.
For example, if the task at hand is use statistical models to analyze a data set, then the entire data set is read into memory as an object so that statistical operations can efficiently be applied to the data. Typically, modeling such data sets requires several copies of the data to be created, for example 3-10 copies, while the analytic is being performed. In such cases, the amount of data that can be handled at once is thereby limited by the amount of memory readily available.
Although in some modern computer systems a large virtual memory space can be accommodated, especially with 64-bit computing devices, statistical operations that operate on large data may use the virtual memory in such a way (e.g., with random access of data) that the program “thrashes”—causing memory pages to be continuously swapped in and out. Thrashing in this manner causes huge performance issues, rendering the use of virtual memory for such analytic tasks impracticable.
In addition, in some instances programs created using the statistical language that have been operable at some point, for example, during prototyping, may suddenly not work as the data grows beyond the capacity of the memory, such as when the programs are placed in production. Sometimes such issues remain undetected until the program is placed in the field. To attempt to solve such data set problems, the ultimate consumer of the program may end up sampling or otherwise aggregating the data set into smaller portions that can be appropriately analyzed or operated upon. Sub-setting the data in this manner may, in some scenarios, generate incorrect results, or at least inject a degree or error where it is not desired. In addition, it is not possible in all applications of a particular statistical scenario to subset the data at all—in those cases, the program is simply non operable.
Moreover, customers are desiring to process data that is hundreds of megabytes up to tens or even hundreds of gigabytes in size. Traditional in-memory models cannot support such data sizes, which are increasing at a faster growth rate than the memory capacity of computer systems for processing such data.