In many industries, information technology platforms for technical computing have become key differentiators and drivers of business growth. For companies in these industries—including the biotechnology, pharmaceutical, geophysical, automotive, and aerospace industries, among others—a new class of business-critical technical computing problems has arisen. These problems are complex, with solutions often involving multiple computational steps with large quantities of data passing from one step to the next. In addition, the individual computational steps themselves are often computer intensive and very time consuming.
The life science industry offers some of the best examples of this new class of applications. In pharmaceutical R&D, for example, users often query dozens of different databases using a variety of search and analysis tools, moving data from one analysis to the next in “pipelined” dataflows. These dataflows, which are both computationally intensive and logistically complex, are essential parts of the R&D process, because without them, it would be impossible to adequately investigate the large amount of data coming out of modern, highly automated laboratories.
Pipelined dataflows are also becoming part of the computing infrastructure that supports clinical medicine. Major medical centers are beginning to use complex heterogeneous databases and applications for analyzing clinical, genetic, and demographic information to make final treatment decisions. Such infrastructure is becoming a differentiator for major medical institutions.
The new class of technical computing problems described above pose a number of substantial computational challenges. First, they require huge amounts of CPU horsepower and other computational resources. For a number of years, the platforms of choice for large computations have been clusters or local-area networks of small-to-moderate-size commodity machines. Such clusters have rapidly replaced traditional large multiprocessing supercomputers for most high performance computing tasks, due largely to the substantial cost-performance advantages they offer. However, individual clusters are insufficient to meet the voracious demands of complex technical computing dataflows. This has led to a new type of high-performance computing platform based on the concept of a “computing grid” (analogous to a utility grid) made up of multiple interconnected clusters. Computing grids have a wealth of resources—CPUs, storage, communication, etc.—that may be applied to technical computations, but management of such diverse collections of resources is difficult, and effective software solutions are only now beginning to appear.
A second major challenge arises because solutions to the new class of problems must access and integrate massive amounts of data from large numbers of distinct sources. Algorithms for technical computation traditionally assume that all required data is stored locally and that the costs of data access are almost always dominated by the costs of computation. For many problems of the type considered here, however, the data may reside at numerous sites and may be stored in very diverse formats. For such problems, the costs of accessing the data and integrating it into the computational process may easily outstrip the costs of the computations themselves. This challenge is motivating the development of new data management tools capable of efficiently accessing and integrating such widely distributed data.
Finally, a third major challenge relates to the way that users manage the execution of the kind of large, complex dataflows required to solve mission-critical technical computing problems. These dataflows typically include a number of computer-intensive steps, and it is usually necessary to perform complex transformations as data is transferred between steps. Traditionally, running such dataflows has required substantial amounts of time and effort from a skilled user capable of starting and monitoring each computational step, dealing with potential processing errors, transforming the data between steps, and moving data among the different computers where computations may take place. Such a manual process is a significant impediment to the effective use of complex dataflows, since it wastes valuable time of skilled users and its tedium increases the likelihood of errors. The need to overcome such operational hurdles is driving the development of application management tools to automate and accelerate dataflow processing.
Application integration/interoperability or acceleration are essential requirements to solve this new class of technical computing problems. Application accelerators and dataflow systems are the two main tools used to address these requirements. Application accelerators are designed to improve the performance of individual, stand-alone applications. Most accelerators have evolved from a number of tools such as MPI, PVM, or Linda that were created originally to accelerate applications using special parallel programming techniques designed to exploit static sets of “worker” machines. Today, the most common application accelerators are more flexible, allowing them to exploit a set of widely distributed workers that may evolve dynamically throughout the course of a computation. However, particularly among the so-called “peer-to-peer” application accelerators touted for use on computing grids, there are often substantial tradeoffs for this flexibility that may make the accelerators unsuitable for large classes of applications. The tradeoffs include such things as limitations on file transfers to or from the worker machines, restricted or prohibited communications among the workers, constraints on the specific combinations of hardware, operating systems, and programming languages that are permitted; and restrictions or inefficiency due to security, encryption, or other requirements. In addition, in many cases, the use of application accelerators by end users may be severely limited by the fact that most of the accelerators require modification of application source code in order to deal with the transfer of data to and from the workers.
Dataflow systems take an entirely different approach to acceleration, focusing not on individual applications, but on complex pipelines of applications called dataflows that may be thought of visually as flowcharts (including logic and loops), where the flowchart boxes correspond to specific applications or data accesses. Such dataflows are common in many industries, and they are ubiquitous in the life sciences. Almost all members of the new class of mission-critical technical computing problems are, in fact, solved by dataflows, not individual applications, since the solutions require accessing data from numerous sources, applying multiple types of analysis, and making significant logistic decisions during the computational process. The key issues in dataflow systems are application integration/interoperability (including data conversion and data flow among the applications in the dataflows) and performance improvement by means of sophisticated application-specific scheduling. The best dataflow systems are able to address these issues without access to the source code of individual applications used in the dataflows; this broadens the applicability of such systems substantially as compared with the application accelerators discussed above.
Traditionally, users have applied dataflow via two approaches. The simpler one is completely manual; a user starts up the program for each step by hand, and reformats and transfers the data between steps either by hand or by using simple scripts or programs. The only real advantage of the manual method is that it is relatively easy to use and can cater to a wide variety of situations. For example, one step may require visiting a web site, filling out a form, clicking a button or two on the screen, and cutting/pasting data from the output screen into a file or the input screen for the next step. Another step may require logging into a remote machine, transferring some files, running a command-line program on the remote machine, and then transferring the result files back to the user's machine. None of this is automated, but at least the procedures are straightforward enough so that most users can perform them.
The manual approach has many drawbacks, of course. Dataflow execution is very time-consuming and error-prone, and the user must pay constant attention to ensure correct results. The traditional alternative has been to implement “automated” dataflows by writing complex scripts using a standard scripting language. Once the script is written, a user can execute the dataflow from a command line by running the script and providing whatever specific parameters and files may be required. Shell scripting languages in various operating systems are widely used for dataflow development. The Perl scripting language is a common choice for OS-independent scripts, but there are a number of others such as Python, Jython, and even Java that are more modern and may well be better choices depending on the types of data manipulations required between the computer-intensive steps of the dataflow.
Regardless of the choice of scripting language, however, script creation is effectively the same as programming. The developer uses an editor to create a dataflow script that invokes each of the computer-intensive steps as independent programs. The invocations often take place on the machine where the script executes (leading to a sequential computation), but scripting languages may make it possible (though not necessarily easy) to invoke the programs on remote machines. In between these program invocations, the developer inserts whatever code is required to handle errors that might arise, perform the data manipulations required to convert the data from the output format of one step to the input format of the next, and move data around among multiple machines (if different steps run on different machines). The data operations themselves may be coded in the scripting language, or they may be implemented by invoking a separate program to filter the data in some way, but they are rarely designed to be reused in other dataflows. Correct dataflow operation is entirely the responsibility of the developer, and it is unusual to encounter dataflow scripts that operate correctly in more than a few environments that happened to be important to the developer when the dataflow was created.