The field of the invention is specialized computing systems and methods for data intensive computing. More specifically, the present invention relates to systems and methods for memory and data management within a data intensive computer architecture capable of facilitating multi-petabyte data analysis.
The traditional process of scientific discovery consists of systematic observation, experimentation, measurement and data collection, leading to the creation of a theory that explains past observations and predicts the results of future experiments. In virtually every field of science, technological progress has led to the construction of high-throughput measurement instruments, such as telescopes, high-energy particle accelerators, gene sequencing machines, and the like. These high throughput measurement instruments generate very large data sets from observation of complex physical systems. On the other hand, theoretical description of complex physical phenomena starts with a set of basic laws, typically expressed as partial differential equations, whose consequences are investigated with the help of simulation experiments using computational models. The volume of data produced by computer simulations has been increasing even more rapidly than the size of empirical measurement data sets.
The availability of large experimental datasets coupled with the potential to analyze them computationally is changing the nature of science. In many cases, however, the ability to acquire experimental data outpaces the ability to process such data, leading to the so-called “data deluge.” This data deluge is the outcome of three converging trends: the proliferation and availability of high throughput instruments, the increasing size and availability of mass storage devices to store the measurements, and the continuing evolution of processor design according to Moore's law.
As discussed above, the volume of data produced by computer simulations, used in virtually all scientific disciplines today, is increasing at an even faster rate than experimental data. The reason for this volumetric increase is, in many cases, the desire to preserve intermediate simulation steps for future reuse because the intermediate steps represent substantial computational investments. The sheer volume of these datasets is only one of the challenges that scientists must confront. For example, data analyses in some disciplines, such as environmental sciences, must span thousands of distinct datasets with incompatible formats and inconsistent metadata. Overall, dataset sizes follow a power law distribution and challenges abound at both extremes of this distribution.
While improvements in computer hardware have enabled this data explosion, the traditional computer architecture and components used within the traditional computer architecture has failed to keep pace. Referring to FIG. 1, a traditional computer network 100 is illustrated. The computer network 100 is formed by a plurality of workstations 102 that are connected to one or more servers 104. The workstations 102 may have more or less high-performance hardware systems and the servers 104 generally include more high-performance and substantially robust hardware systems than the workstations 102. However, regardless of the underlying performance of the hardware, such as clock speeds, memory size, and the like, the underlying hardware architecture of computer systems, both servers and workstations, and, by extension, the software architecture designed to operate these hardware architectures is generally similar and has been generally stagnant.
Specifically, referring to FIG. 1, a general hardware architecture 106 is illustrated that is representative of the hardware architecture employed in both workstations 102 and servers 104. Specifically, the hardware architecture 106 includes one or more CPUs 108. Trends in CPU designs have evolved over the years and have included increasing clock speeds, increasing density per chip, increasing chips per die, and increasing processors that together form the CPU 108. While each advancement has improved performance, computer hardware architecture since the earliest designs have maintained this design of a CPU 108, which may include one or more caches 110.
The CPU 108 is generally connected through a bridge 112 to memory 114 and, in some cases, an additional non-local cache 116. While memory and cache design evolutions have been great over the years, they generally adhere to the concept of placing one or more levels of comparatively fast random access memory in close proximity (access proximity) to the CPU 108. In more recent times, dedicated graphics processing units (GPUs) 118 have been adapted from processors utilized to simply drive a display 120 to a secondary, specialized processor that the CPU 108 can utilize to offload tasks fitting the specialized capabilities of the CPU 108, such as transcoding operations and many others. In any case, the general computer architecture 106, regardless of workstation or server, regardless of whether the computer architecture is from the 1990s or more recent times, provides a CPU 108 and memory 114 and may be supplemented by secondary processing and memory components, such as a GPU 118 and various caches 110, 116 dedicated to particular situations. In this regard, the above-described components may be conceptualized as a CPU/memory sub-system 122.
The computer architecture 106 also includes a bus or multiple buses 124 that connect the above-described CPU/memory sub-system 122 to other, slower components of the computer architecture 106. For example, the buses 124 may provide connections to a universal serial bus (USB) hub or controller 126 and/or dedicated, bus-connected I/O devices 128. Of course, I/O connections may vary substantially; however, in all cases, the bus 124 provides connections to one or more hard drives 130. These hard drives 130 may take many forms and, more recently, include hardware advances such as solid-state drives, but are uniformly present in workstations or personal computers 102 and servers 104. This is because all traditional notions of computer architecture can be conceptualized as, at a minimum, including a CPU/memory sub-system 122 and a mass-storage sub-system 132.
Despite great advances in the individual components of the traditional hardware architecture and the software designed to operate on the traditional hardware architecture, traditional computer architectures 106 include substantial limitations. For example, CPU performance has been doubling every 18 months, following Moore's Law. The capacity of disk drives is doubling at a similar rate, somewhat slower that the original Kryder's Law prediction, driven by higher density platters. On the other hand, disks' rotational speed has changed little over the last ten years. The result of this divergence is that, while sequential I/O speeds increase with density, random I/O speeds have changed only moderately. Due to the increasing difference between the sequential and random I/O speeds of disks, only sequential disk access is possible. More specifically, if a 100-terabyte computational problem requires mostly random access patterns, it cannot be done.
These and many other limitations of the traditional notions of computer architecture are substantially compounded when the computer systems are formed into a network. As described above, traditional computer architectures, generally, provide a substantial computing system at each point that is, generally, capable of autonomous operation; that is, each system includes a CPU/memory sub-system 122 and mass-storage sub-system 132. Accordingly, general notions of computers systems operating in a network and sharing network resources are limited by network speeds, even in a data center, that are unable to keep up with the doubling of the data sizes. In other words, with petabytes (“PB”) of data, the data cannot be moved to where the computing occurs with sufficient speed.
The typical analysis pipeline of a data-intensive scientific problem starts with a low level data access pattern during which outliers are filtered out, aggregates are collected, or a subset of the data is selected based on custom criteria. The more CPU-intensive parts of the analysis happen during subsequent passes. Such analyses are currently implemented in academic Beowulf clusters that combine compute-intensive but storage-poor servers with network attached storage. These clusters can handle problems of a few tens of terabytes, but they do not scale above hundred terabytes because, for example, they may be constrained by the very-high costs of PB-scale enterprise storage systems. Furthermore, as these traditional systems grow to meet subsequent data needs, the power and space requirements for such systems exceed what is available to individual scientists and small research groups. In addition, existing supercomputers are not well suited for data intensive computations, at least in part because they maximize CPU cycles but lack I/O bandwidth to the mass storage layer. Most supercomputers also lack disk space adequate to store PB-size datasets over multi-month periods.
Furthermore, commercial cloud computing platforms are not a feasible solution. For example, recently, when mass storage becomes a limitation, remote or “cloud” storage is sought as a solution. However, when considering massive and data-intensive computer operations on the scale of PBs or factors of PBs, the data movement and access fees are excessive compared to purchasing physical disks, the I/O performance offered is substantially lower, and the amount of disk space provided is inadequate.
The above description of memory and storage architecture and hierarchy has remained essentially consistent for decades. For example, M. J. Murdocca and V. P. Heuring have accurately summarized this hierarchy from slow and inexpensive to fast and expensive in the following order: off-line/on-line storage, secondary storage, main memory, caches, registers. That is, this hierarchy and the results of this hierarchy have remained consistent in Murdocca et al.'s 1999 book that discussed this hierarchy, “Principles of Computer Architecture,” in Murdocca et al.'s 2007 book that discussed this hierarchy, “Computer Architecture and Organization: An Integrated Approach,” and continuing through more recent presentations on the subjects. However, as described above, traditional paradigms of computer architecture and the software constructs that operate on these traditional paradigms of computer architecture are incapable of handling data-intensive computing on the scale of PBs or factors of PBs.
Coupled with these shortcomings of traditional notions of computer hardware architecture are traditional notions of computer software architecture. Though not always correlated directly, many operational and design components of traditional computer software architectures are predicated on systemic features of traditional computer hardware and software. For example, the operating system of traditional computer systems is designed to balance program productivity with implementation efficiency.
Referring to FIG. 2, the above-described computer hardware architecture 106 is coupled with an operating system 200. The operating system 200 can be represented as a plurality of sub-systems. On one end of the operating system 200 are a series of sub-systems, such as a file system 202 and device drivers 204, that are tightly designed with the computer hardware architecture 106 in mind. On the other end of the operating system 200 are a series of sub-systems, such as a user interface 206 and application program interfaces (APIs) 208, that are more tightly coupled to or provide the fundamental basis on which applications 210 can perform in concert with the operating system 200. At the heart of the operating system 200 is the kernel 212.
Though a simplification of the multitude of sub-systems included with the operating system 200 and even within the sub-systems illustrated in FIG. 2, the representation illustrates the influence of traditional computer hardware notions on operating system design and, further, illustrates that the combined standards of computer architecture and operating system substantially dictate the fundamentals of operation for applications 210. That is, traditional notions of computer programming are build upon the consistency of the above-described computer hardware and operating system architectures. For example, high-level programming languages hide the computer's memory hierarchy and system architecture, while the operating system provides highly optimized services for all application developers. The only means of permanently storing data is by writing it in a file, and the abstract programming model of sequential file access is efficiently implemented in the operating system. The operating system typically does not include services for handling high-level programming objects, such as arrays or graphs. When there is a need to store such objects for subsequent computation, the programmer must make use of the file system with serialization/unserialization of these objects.
Tied to these notions of high-level programming languages and traditional notions of computer operating systems are the fundamentals of traditional compilers. Because traditional compilers are focused on the creation of executable programs, traditional compilers are necessarily tied back to the fundamentals of the operating system 200 and the hardware architecture 106 described above. Unfortunately, as described above, these systems, both hardware and software, present substantial impediments to realizing data-intensive computing systems and methods on the order of PB or factors of PB.
It would therefore be desirable to provide a system and method for storing, transferring, and analyzing large sets of data, for example, at a minimum of the petabyte range.