Present invention relates to object-oriented, parallel computer languages, script and visual, together with compiler construction, to write programs to be executed in fully parallel (or multi-processor) architectures, virtually parallel, and single-processor multitasking computer systems. The invention also relates to architecture and synchronization of multi-processor hardware.
Fundamentally, two drastically opposing methods were known to construct computer architectures and programs: control-flow and data-flow. In control-flow method, the programs have the shape of series of instructions to be executed in strict sequence, in data-flow the execution occurs when a set of data needed to be processed is available.
Control-flow is the method used widely by the mainstream computing while data-flow method has been unable to make its way to mainstream computing and its application is currently limited to rare custom built hardware and sometimes as top conceptual model for some multi-user and real time software.
Data-flow method is naturally concurrent but even with this appeal was unable to overcome severe other problems with the method, chief among them the fact that data-flow as low-level software design method does not translate well into use with common computer algorithms. Numerous data-flow architectures were researched and they were excellent in providing high-speed computing in a number of specific applications. However, even the very limited applications of data-flow computing produced numerous problems with the concept.
Parallelism in data-flow relies on splitting the problem to be solved (input token in data-flow terminology) into many sub-tokens traveling parallel paths. The results of computations performed on sub-tokens traveling parallel paths would then have to gradually merge back together to produce the final result. A key problem with data-flow architecture concerned allowing unrestricted feeding of tokens (input data) into data-flow system and letting them take the shortest rather than predefined path (so called “dynamic data-flow”). Such unrestricted data-flow processing would result in mismatching result tokens arriving at destination. To solve this problem, numerous methods of tagging and instance numbering of dynamic data-flow tokens were proposed.
Key problem with concurrency in control-flow environment centers around simultaneously executing processes sharing the same data and with separate control sequences cooperating with one another. Rudimentary non-structural synchronization tools (critical sections, semaphores, signals) for dealing with these issues have been known for a very long time. Many mass produced processors are equipped with special instructions that allow exclusive/safe access to memory shared by two or more processors. These instructions (called interlocked memory access instructions) allow easy implementation of the rudimentary synchronization tools and are used by all operating systems that support multi-processor use.
Programs using the rudimentary tools are however fairly hard to construct and prone to hard to be seen and corrected conceptual problems leading to deadlocks. A deadlock is a situation where two or more separate processes all hung (are forever suspended) waiting for resources reserved by other suspended processes.
For these reasons, many methods of structural, object-oriented methods of concurrent process synchronization have been proposed and implemented. For example:
“Monitors” implemented in Concurrent Pascal define sections of program as the only elements to be accessed by more than one process.
“Rendezvous sections” implemented in Ada provide instructions that allow two separate processes meet at points specified in both of them.
“Named channels” of Occam and similar messaging methods (Concurrent Object-Oriented C) provide special constructs through which to send and receive data between processes running in parallel.
“Shared variables” QPC++ allow exchanging inter-process information by special data bound to semaphores.
“Separate” designation for routines and data of SCOOP/Eiffel allow specifying routines to be executed as separate process and data to be used exclusively by one process. The method seems very appealing but fails to address many of the problems. Further mechanism of “require” block within a separate procedure allow specifying conditions to be met for a separate procedure to execute as multi-tasking extension of the “design by contract” concept.
None of the above methods have been widely accepted in mainstream computing and concurrent programming art is still an exception rather than a rule. In spite of tremendous need for parallel programming support, the most popular languages either offer only the rudimentary, non-object oriented parallel programming support, or none at all. In particular, in spite of numerous attempts, the C++ standard committees had failed to agree on a universal support for parallel programming in C++. All proposed methods were unable to get enough support to be accepted as the basis for a standard of parallel programming in C++.
In such a situation, programmers were often being forced to find their own ways to implement some parallel programming in C++ and other widely used languages. Many innovations were made in order to do some parallel programming using C++. For example, U.S. Pat. No. 5,999,987 and corresponding European Patent EP0667575 propose a way where limited parallel programming is implemented without language extensions by using special “placeholder” constructs to allow initiation of asynchronous/parallel/distributed tasks and continuation of processing. Placeholder construct later allows retrieval of results of the asynchronous call and re-synchronizing the calling process to it.
All these prior art methods either elegantly solve only one small subset of concurrent programming needs or propose a concept that is very costly to implement in practice. Monitors, Rendezvous Sections, Named Channels, and Separate Objects all appear to be incomplete solutions, serving only a few needs. SCOOP/Eiffel Require blocks, on the other hand, while conceptually appealing, are costly/impractical to implement because they specify an expression which must be met for a concurrent procedure to begin executing. This requires some method to be able to reevaluate the expression each time the source conditions might have changed to merit starting the execution of object containing the “require” block.
Purely control-flow programs result in cumbersome behavior and many unacceptable features. They cannot naturally share resources, cannot easily cooperate with each other. Due to these problems and pervasive lack of universal multi-thread and multi-process parallel programming method, various workarounds were designed to eliminate some of the bad characteristics of the control-flow programming model. These included message-based operating system interface and some visual programming.
The method of parallel computations that is still most commonly used in practice is pipelining. Pipelining is a decades old method that originated in the days of mainframe computers and is about connecting a number of otherwise independent executable programs or tasks through a special type of data buffer (referred to as pipe) written on one end and read from the other. The entire communication of a task with other tasks is through the pipes. A pipelining task runs a loop in which it reads the input data from its input pipe (or pipes), processes it, and outputs the results to output pipe (or pipes). In spite of this method being so old, it is still widely used and worked on. For example, U.S. Pat. No. 6,199,093 (Yokoya) shows a processor allocating method for a multi-processor computer dedicated to the pipelining model of computing. In that invention, the compiler produces a fixed number of tasks and generates the table showing the amount of data sent between every combination of two tasks. That table is then used at program load time to assign each task to a processing node. As no other means of communication is supported but data sent by one task and received by another, this means that only the pure pipelining model is supported. Problem is, it does not have to always stay like that, there are other communication means, such as memory sharing, messaging, fast networking implementing memory sharing, etc.
The main advantage of pipelining is the simple layout of a processing task that relies on pipes as only means of communication. This allows testing each task independently by sending test data to input pipes and reading the output pipes. As no other interactions happen but those through the pipes, this is a complete, accurate test of entire functioning of a task. This however does come at a price of having to decompose parallel processing into tasks that by definition do not know anything but what they receive through the pipes. This is not always sufficient for some processing problems. The other problem with pipelining is necessity to allocate pipes for every needed connection, each of them being able to store a number of items of data for pipelining to work. This forces the processing nodes to either be quite large, thus needing few pipes, or results in a large number of tasks connected with large number of pipes that now have to somehow be managed when they reach their capacity or when they stall other tasks when they are not large enough.
An additional advantage of pipelining stems from the fact that it can serve as a high-level design model. The idea is to have some method to create black boxes for running some processing on data (each of them with some input and output pipes) and then have them connected together with a Graphical User Interface (GUI) into a parallel processing model. For example, U.S. Pat. No. 5,999,729 (Tabloski, Jr. et al) is about having C++ programmers produce some “predefined components”, each with some input/output ports working with data buffers (pipes) and then have non-programmers connect them together within a graphical interface program. Some of those components allow splitting data into several paths, each path then running some processing tasks on data. The outputs are then combined together in components dedicated to putting data together into a single output. This model will work very well when a single item of data is all that is needed at every processing node to produce the output result. That gives pure pipelining which in Tabloski case means using a “splitter component” to feed data to several paths of “serial component” replicated with “replicator component”. When that rule does not work, and the tasks connected to the splitter have to know each other's data or have to communicate anything else with each other to process data, all that the development system can be used for is connecting the splitter component to “parallel component” that is connected to custom written “parallel module”. That custom parallel module works as several tasks that communicate with each other in platform-specific and application-specific way in order to be able to produce results. The fact that pipelining model required writing custom hard-coded modules for all non-trivial processing clearly shows serious limitations of that model of parallel processing. The present invention is about completely abandoning pipelining as the underlying method of parallel communication and of parallel processing. Pipelining will however be discussed some more in the detailed description of present invention to point out advantages that the present invention offers over pipelining.
A more demanding application will show problems inherent with pipelining even better. Consider the problem of having to perform 3D shading on large 3D model in parallel as described in U.S. Pat. No. 7,233,331(Kato). Shading calculations are about simulating light rays bouncing off between a large number of polygons, each with different texture, color, light scattering qualities, etc. Calculating something like that in parallel requires following light pathways that split and merge and which can bounce off the same polygons at the same time. There is nothing that could be connected with pipelines in such a model as specific light ray calculation means specific path followed once. Connecting polygons with pipelines to cover possible connections would make no sense whatsoever due to the almost infinite number of combinations that would result in. What is done by Kato instead, is modeling every needed light pathway out of dynamically created objects (parallel tasks to run), with a specific object being calculated when it receives all inputs that it needs by means of having his “data-slots” filled by means of message passing. At the core, this isn't that different from pipelining. Rather than have a pipe storing multiple items of data, each of them connecting two static objects, custom graph is created for each needed light path where each connection is fixed to be used exactly once, making the data-slots to be pipes that only pass one element. What are the problems with that method? Having to create separate dynamically allocated objects for every calculation makes such method of parallelism violate the basic rule of Object-Oriented Programming. Ideally, every polygon should simply implement some function performing the calculations of light bouncing off that polygon. The issue is not about following some dogma of Object Oriented Programming, but about clarity of code and especially about performance. Having to replicate parts of the 3D mesh as graph corresponding to light bouncing off polygons and to dynamically allocate memory for each step of the light pathway has got to be a massive bottleneck of the method, and is clearly indicated as such by Kato.
The key goal of present invention was to come up with a different parallel processing model, one that could utilize Object-Oriented Programming at its core, thus allowing, for example, a member function of polygon object of 3D mesh be able to calculate in parallel light bouncing off itself, be able handle multiple requests at the same time, and split and merge such requests, all as part of definition of the polygon object. Key to understanding the present invention is realizing that both pipelining and message passing dataflow have certain important thing in common. Both these methods force creation of tasks, nodes, or objects that will perform parallel operations long before they are performed. This is forced by need to know some sort of address or other identification of these objects to be able to connect them with pipes or with message passing. Eliminating that need, being able to create parallel data paths right when they are needed, opens the road to unprecedented ease of parallel programming.
In order to simplify software development, a lot of common work was being shifted to supervisory programs—the operating systems. A key part of an operating system is its “kernel”—the central program that manages the key hardware resources and lets other programs use them through common interface. At first the operating systems simply provided means for several programs to share the resources, including the processor. This meant being able to run many programs at once by constantly switching processor ownership: executing a portion of one program and then switching to the next one. Later, a messaging system has been added to handle certain functions—especially user interface in “windowed” environment. Functions were somewhat reversed. Rather than individual programs calling the operating system to provide user interface, the operating system would call the user programs with messages to process. This scheme has solved some problems inherent to control-flow programming by a method that bears resemblance to some data-flow concepts. At least the user interface was now event/new data driven rather than looping when waiting for new data. These messaging features, allowed pretty good appearance of multi-tasking and data-flow. Multiple programs elements like windows could be serviced virtually simultaneously, individual programs would not waste the processor while looping for input, etc.
Messaging methods provided very good emulation of parallel processing for the most popular computer uses. It also allowed running many independent programs simultaneously that would share all the available hardware. However, it was by no means the true low-level parallelism sought after. Individual programs would most often still be single threads processing messages received thorough single entry point. If actual, true parallelism/multi-tasking was desired for better performance, additional “threads” would have to be created by hand and the rudimentary synchronization tools would again be used to allow safe sharing of data.
To simplify software development process, numerous visual programming tools have been proposed and developed. The “flow-charting” methods simply representing regular script-type instructions through graphics did not really offer any practical advantages over script programming. More advanced methods of visual programming tools based on some of the dataflow concepts have found much wider application particularly in instrumentation markets. Prior to appearance of such tools, the users of computer based instrumentation have been forced to convert essentially parallel, data-flow type concepts (such as connecting sources of voltages to displays, switches to control lights) into extremely unnatural in this case control-flow code.
Two kinds of such partially dataflow-based instrumentation programming tools have been developed. Some of them (like SoftWIRE) allow the user to compose their applications out of “controls”—rudimentary functional building blocks where each block's action is triggered explicitly. Asserting a control's “control-in” input triggers a control's action. Once a control has finished its processing, it triggers its “control-out” output which can be connected to the next control's “control-in” to continue such explicitly designed data-flow.
National Instruments' LabView “virtual instruments” is another such a tool and is a subject of several patents. Working model here is somewhat closer to the commonly understood data-flow concept as processing happens when complete set of new data is available on inputs of a node.
By emulating data-flow interface, these concepts and systems do offer the user some degree of multi-tasking or actually good appearance of it. Success of these systems shows tremendous need for parallel, non control-flow programming tools.
Internally, the emulation of data-flow in these systems is pretty straightforward. As the data gets updated in various parts of the user-designed program graph, this triggers new graph nodes to be updated, often in very remote locations. The update requests get queued and executed sequentially, but for most of these systems' applications this passes as good enough parallelism. This method is very similar to the messaging system used by operating systems for user interface.
Originally, the entire such data-flow emulator (which could be considered the centralized operating system in this case) would run as a single thread which by nature eliminated all the synchronization/data sharing headaches of true parallelism. As the systems became more popular and performance demands harsher, the emulator was split into several threads handling tasks/update requests grouped by their nature (example user interface, instrument I/O, standard code). Later, to further meet growing performance needs, user-controlled multi-threading and synchronous multi-processing support was added. This has opened the old can of worms of the users, once again, having to create a few threads by hand and code the crude rudimentary synchronization tools (critical sections/semaphores) to avoid racing conditions and corrupting of data shared by several threads.
Necessity of the user having to assign work to be performed by separate threads and need to use the rudimentary synchronization tools substantially negate the true data-flow concept and all its advantages. However, the limitation of such near data-flow visual programming was not so much the visual programming concept itself (which is fairly universal), but the way it was implemented internally through control-flow, non-parallel code. A single visually-designed program could not naturally run on more than one processor and multi-processor use would result in need of explicit rudimentary control tools. Once again, lack of low-level, universal, multi-tasking at the core, quintessentially multi-processor programming method was the chief culprit here.
Prior-art visual programming tools created mainly for instrumentation market (LabView, Softwire) must be addressed here in more detail because they tend to make a very unfortunate claim that by merely being able to create parallel-wire like diagrams, full possible parallelism or data-flow processing can be described and achieved. If this claim were to be true even remotely, it would make the present invention completely unnecessary. However, this claim is either completely false or grossly imprecise which can be seen by studying actual details of implementation of these systems. First of all, the centralized supervisory software that queues and executes fired nodes that is used by these systems prevents this technique from being a universal programming method to construct say, operating systems, data bases, or device drivers. Second, contrary to often-repeated “hassle-free parallelism” claims made by these systems, the parallelism achieved there is not by any means an actual parallelism that is seen in, for example, data-flow computer where nodes are actual separate pieces of electronic hardware. Most of the time, the parallelism offered there is an illusion achieved by complex centralized supervisory software sequentially executing nodes fired at distant parts of the program graph. This is good enough for the specific application in instrumentation market but is by no means the actual parallelism sought by universal prior-art programming tools. Some two-processor parallelism was achieved there at great effort, by expansions of the centralized supervisory software, but even then the parallelism offered is not able to happen in most cases without the user modifying his graphically designed software. Third—existence of any centralized queue or supervisory software prevents full auto-scalable parallel execution on many processors from being possible.
The above points can clearly be seen in application notes describing methods to accomplish (some) multi-tasking in, for example, prior art LabVIEW™ system. National Instruments Application Note 114: “Using LabVIEW™ to Create Multithreaded VIs for Maximum Performance and Reliability” describes steps that are necessary to accomplish limited parallel performance with this prior-art system. To begin with, the application note concerns itself with creating two or more “virtual instruments” to be made to run in parallel. This already goes against the stated goals of actual parallel programming, where the entire code would naturally be parallel with many pieces executing in parallel, where breaking it into several logical parts would not improve performance. On page 5, the application describes various central “execution systems” that make execution of various elements seem like parallel, and the ways to properly direct execution of a specific instrument to a proper execution system. On pages 10 through 12, it describes steps that need to be taken to prevent “race conditions” from corrupting data. The methods offered include global variables that are only changed in one place, “Functional Global Variables,” and semaphores. This brings the already discussed specter of hard to use, non-object oriented “rudimentary synchronization” tools back into the fold—which further shows that this prior-art system is by no means a parallel programming tool sought after. In fact, by most definitions such prior-art systems should not be considered parallel programming tools at all any more that say standard C or C++ language could be considered as such. Just as manually coded limited parallelism is possible in C and C++ at extra effort and by using the rudimentary synchronization tools, very similar limited parallelism can be achieved in these prior-art instrumentation market tools.
Another National Instruments Application Note 199: “LabVIEW™ and Hyper-Threading” shows “Primes Parallelism Example” on page 2. Stating that dataflow order forces mandatory waits for every input in a loop, a claim is made that the only way to make “dataflow” code be able to execute on more than one processor is to split it to two “odd” and “even” loops and shown on modified diagram on page 3. This claim is either patently false or at least very imprecise, since it uses a fairly standard “data-flow” term to mean something that has very little to do with data-flow as defined by computer-science literature. Even if we assume that it was meant that LabVIEW™ implements a “static data-flow” machine where a single node cannot be fired again until it processes the previous firing, the claim still does not make much sense. In any data-flow machine as understood by computer science literature coining the term, various nodes of data-flow machine work simultaneously. A system that does not do that should not be called a dataflow system. This means that if we have a data-flow graph consisting of consecutive parts A and B, as soon as A finishes work on input dataset 0, it should pass it to B and be able to start processing input dataset 1. A system that does not do that probably should not be considered a data-flow system capable of parallelism. Forcing the user to split the problem into odd and even loops to take advantage of two processors, clearly shows that LabVIEW™ prior-art system does not even begin to deal with the issues addressed by the present invention, shows conceptual limitations of the centralized supervisory node-queuing execution system used there, and proves the tremendous need for the methods of the present invention. One of the goals of the present invention was to provide universal low level tools to allow, among other things, replicating static and dynamic data-flow algorithms executing in parallel on non data-flow hardware.
In spite of tremendous need for it, parallel programming remains a black art which is only used where absolutely necessary. True multi-processor parallel programming is only used for very specific, chosen time-consuming applications running on very costly and relatively rare hardware.
Most computers used in mainstream computing still have one processor executing user programs. Multi-processor server/workstation type computers are available, but their application mostly relies generally on several separate processes sharing two processors instead of one. Rare applications that take advantage of two or more processors at once do so only for very specific time-consuming tasks and code for this is almost always written using the non-structural rudimentary control tools or fundamentally non-object oriented messaging systems.
The problem with small use of parallel architectures is not with electronics. There is absolutely no obstacle from electronics art standpoint to, for example, build a computer where there would be a small processor accompanying each small chunk of RAM memory. The problem is we simply still do not have a universal-purpose methodology for describing desired parallelism in general and programming such architectures with plurality of processors in particular.
To make computing faster, a tremendous effort is made to make series of instructions of software conceptually written for single processor somehow run in parallel. Modern processors try to pre-fetch data and code, guess forward, cache data, all in order to partially paralyze software written as non-parallel. This results in extremely complex circuitry using a lot of energy and dissipating a lot of heat, which is the direct result of most data having to go through “narrow throat” of a single processor and single high-speed bus connecting the processor with memory.
Multi-processor architecture, if it could easily be programmed in natural, self-scaling fashion, would solve all these problems. It would be cheaper, consume far less energy, and there would be no physical limits on performance as processors could be added the same way the users today expand amount of RAM in their computers. Simply observing nature proves beyond any doubt that we are only beginning to understand parallel information processing. Our huge, kilowatts of energy wasting supercomputers still cannot replicate image recognition, processing, and storing capabilities of a tiny honey bee, for example.