The field of the invention is general computer architecture and, more particularly, systems and methods for data intensive computing. More specifically, the present invention is a computer architecture for a system capable of executing programs performing multi-petabyte data analysis.
The traditional process of scientific discovery consists of systematic observation, experimentation, measurement and data collection, leading to the creation of a theory that explains past observations and predicts the results of future experiments. In virtually every field of science, technological progress has led to the construction of high-throughput measurement instruments, such as telescopes, high-energy particle accelerators, gene sequencing machines, and the like. These high throughput measurement instruments generate very large data sets from observation of complex physical systems. On the other hand, theoretical description of complex physical phenomena starts with a set of basic laws, typically expressed as partial differential equations, whose consequences are investigated with the help of simulation experiments using computational models.
The availability of large experimental and simulation datasets is changing the way science is practiced. Comparison between theory and observation has become a difficult problem that often requires analysis of very large data sets. In many instances special-purpose hardware and software have been developed. Such solutions are expensive and time consuming. Increasingly there is a need for a general-purpose computational capability with petascale datasets.
This demand exists not only in science, but businesses rely, with increasing regularity, on gaining competitive or strategic business information from very-large datasets. For example, Ian Ayres, in his book, Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, describes how business, from airlines to supermarkets, are looking to process massive datasets with increasing regularity. Of course, as noted above, hardware and software capable of processing such datasets are not generally available. Accordingly, as Ian Ayres describes, these business are forced to rely upon specialized business that have developed or can develop the special-purpose hardware and software, and the expertise to operate this special-purpose hardware and software, necessary to processes very-large datasets. As more businesses deploy petascale databases, there will be increasing demand for general-purpose computational capability to mine these data sets for useful information.
While improvements in computer hardware have enabled this data explosion, the traditional computer architecture and components used within the traditional computer and computer network have failed to keep pace. Referring to FIG. 1, a traditional computer network 100 is illustrated. The computer network 100 is formed by one or more workstations 102 that are connected to one or more servers 104. The workstations 102 may have more or less high-performance hardware systems and the servers 104 generally include more high-performance and substantially robust hardware systems than the workstations 102. However, regardless of the underlying performance of the hardware, such as clock speeds, memory size, and the like, the underlying hardware architecture of computer systems, both servers and workstations, and, by extension, the software architecture designed to operate these hardware architectures is generally similar and has been generally stagnant.
Specifically, referring to FIG. 1, a general hardware architecture 106 is illustrated that is representative of the hardware architecture employed in both workstations 102 and servers 104. Specifically, the hardware architecture 106 includes one or more CPUs 108, which may include one or more local caches 110, which forms a first layer in a memory hierarchy. The remaining layers in the memory hierarchy are accessible to the CPU 110 through a bridge 112. Specifically, the other layers in the memory hierarchy include a random access memory (RAM) 114 and, in some cases, an additional non-local cache 116. While memory and cache design evolutions have been great over the years, they generally adhere to the concept of placing one or more levels of comparatively fast random access memory in close proximity (access proximity) to the CPU 108. In any case, traditional computer memory hierarchies include one or more local caches 110, RAM 114, and one or more non-local caches 116.
In more recent times, dedicated graphics processing units (GPUs) 118 have been adapted from processors utilized to simply drive a display 120 to a secondary, specialized processor that the CPU 108 can utilize to offload tasks fitting the specialized capabilities of the CPU 108, such as transcoding operations and many others. In any case, the traditional computer architecture 106, regardless of workstation or server, regardless of whether the computer architecture is from the 1990s or more recent times, provides a CPU 108 and memory 114 and may be supplemented by secondary processing and memory components, such as a GPU 118 and various caches 110, 116 dedicated to particular situations. In this regard, the above-described components may be conceptualized as a CPU/memory sub-system 122.
The computer architecture 106 also includes a bus or multiple buses 124 that connect the above-described CPU/memory sub-system 122 to other, slower components of the computer architecture 106. For example, the buses 124 may provide connections to a universal serial bus (USB) hub or controller 126 and/or dedicated, bus-connected I/O devices 128. Of course, I/O connections may vary substantially; however, in all cases, the bus 124 provides connections to one or more hard drives 130. These hard drives 130 may take many forms and, more recently, include hardware advances such as solid-state drives, but are uniformly present in workstations or personal computers 102 and servers 104. This is because all traditional notions of computer architecture can be conceptualized as, at a minimum, including a CPU/memory sub-system 122 and a mass-storage sub-system 132.
Presently, hundreds of hard disks are needed to store a petabyte (PB) dataset. As a result, while the task of sorting a one terabyte data set can be efficiently performed on a traditional computer by a program consisting of just a few lines of code, sorting a petabyte data set requires developing special-purpose software that will depend on the hardware characteristics of the storage system used. More generally, computational tasks of O(N log(N)) complexity, where N is the number of items in the data set, are feasible because for all practical data sets log(N) <80. Nevertheless, in addition to the complexity of the required hardware, presently such computations require very costly and time consuming development of special-purpose complex software.
Existing supercomputers are not well suited for data intensive computations, at least in part because they maximize CPU cycles but lack I/O bandwidth to the mass storage layer. Most supercomputers also lack disk space adequate to store PB-size datasets over multi-month periods.
Furthermore, commercial cloud-computing platforms do not provide a feasible solution. For example, recently, when mass storage becomes a limitation, remote or cloud storage is sought as a solution. However, when considering massive and data-intensive computer operations on the scale of PBs or factors of PBs, the data movement and access fees are excessive compared to purchasing physical disks, the I/O performance offered is substantially lower, and the amount of disk space provided is inadequate.
It would therefore be desirable to provide a general-purpose programmable computer to perform computations with large datasets, for example, at a minimum of the petabyte range.