This invention relates to executing applications having dynamically varying resource needs, and more particularly to executing the applications using integrated task and data parallelism.
Task parallelism and data parallelism are distinct programming models for describing parallel application software programs.
In the prior art, a task parallel application is typically composed of a set of cooperating processes (xe2x80x9ctasksxe2x80x9d) that are implemented in a framework such as POSIX threads. In a task parallel application, the programmer explicitly defines the communication and synchronization functions between threads in the application. The application relies on the run-time system to schedule the execution of the threads on available processor resources, and to perform load-balancing over the resources.
In contrast, a prior art data parallel application is a single process that operates on distributed data. In a data parallel application, a compiler is usually responsible for generating efficient distributed code where communication overhead is minimized.
There is an emerging class of real-time interactive applications that require a dynamic integration of both task and data parallelism for effective implementation. One such application is described in U.S. patent application Ser. No. 08/844,444 xe2x80x9cMethod and Apparatus for Visual Sensing of Humans for Active Public Interfacexe2x80x9d filed on Apr. 8, 1997 by Waters et al., incorporated herein by reference.
There, an interactive, computerized kiosk is described that provides public access to information and entertainment. The kiosk supports natural, human-centered interaction with customers. A camera is used to sense the presence of one or more customers in front of the kiosk. The kiosk provides visual and audio feedback as long as customers are xe2x80x9csensedxe2x80x9d in the kiosk""s visible environment.
The location and number of customers control the xe2x80x9cbehaviorxe2x80x9d of a graphical talking xe2x80x9cheadxe2x80x9d displayed on a monitor. The orientation of the talking depends on the location of the customer in the kiosk area. If there is more than one customer, then the talking head will divide its attention between the customers, much like a group interaction. For example, while talking to one customer, the eyes of the talking head may momentarily shift to others to make them feel part of the kiosk interaction.
The software application program that operates the kiosk has features that are typical of an emerging class of future scalable applications. The application is both reactive and interactive. For example, the kiosk (application) responds to changes in its environment. As new customers arrive, the kiosk will change its mode of interacting.
The application is computationally demanding due to the need for real-time vision, speech, and graphics processing. The application is also highly scalable. At the task level, i.e., processing threads, the application supports variable number of customers and functions. At the data level, multiple video and audio data streams may need to be processed.
At the hardware level, the kiosk application executes on a cluster of symmetric multi-processors (SMPs). SMPs provide a compelling platform for advanced applications such the kiosk system. Systems that use an SMP like architecture are economically attractive. Unfortunately, the flexibility provided by SMP clustering comes at the cost of a hierarchical communication model with multiple levels of locality. Conventional parallel programming models fail to handle one or more of these levels gracefully, making it difficult to program processor clusters effectively.
Applications such as the interactive kiosk exhibit both task and data parallelism. This is illustrated in FIG. 1 which shows a task graph 100 for a basic vision application 100 within the total set of kiosk applications. The vision application tracks multiple customers in the kiosk environment according to, for example, the color of their clothing. In FIG. 1, two basic constructs are used, the nodes represent tasks, or execution threads, and the edges or xe2x80x9cpipesxe2x80x9d connecting the tasks are data flows.
A camera 101 is connected to a digitizer task (D). The camera 101 continuously monitors a scene in front of the kiosk. The digitizer task 110 produces a sequence of frames 111 at a predetermined rate. Each frame is composed of a plurality of picture element (pixel) values. A histogram task (H) 120 analyzes the frames to determine a predominant color of the clothing worn by customers standing in front of the kiosk. The histogram task 120 or xe2x80x9ccolor trackerxe2x80x9d produces color models 121. Concurrently, motion masks 131 are produced by a change detector task (CD) 130 that also analyzes the frames 111. The color models 121 and motion masks 131 are used by a target detector task (TD) 140 to track individuals in the scene.
Task parallelism is most obvious in the histogram task 120 and the change detection tasks 130 which have no data dependencies. That is, these two tasks can operate on their own copies of the same frames 111 at the same time. Task parallelism is also present in the form of pipelining, for example, the digitizing task 110 and the target detection task 140 can be performed simultaneously on different frames in the sequence.
Data parallelism is present in the target detection task 140 where multiple color targets (customers) can be detected in parallel. Potentially it should also be possible to exploit data parallelism in the change detection and the histogram tasks. For example, a single frame could be partitioned into a plurality of regions, and the regions, such as quadrants, could be processed in parallel.
Applications, such as the kiosk application, are not well-supported by either the task or the data parallel model alone because the kiosk application is made up of multiple distinct tasks which each provide opportunities for data parallel processing. As a result, maximum performance is not achieved under the task parallel model, and the application as a whole does not neatly fall into a traditional data parallel model.
Effective implementation of such real-time, interactive applications requires a hybrid parallel model that integrates both task and data parallelism within a single framework. Hybrid models that integrate task and data parallelism have been proposed in the prior art. Unfortunately, previous approaches require either a static problem domain, or a highly restricted application domain such as is found in numerical linear algebra.
One prior art system describes a framework for exploiting task parallelism in dynamic multi-media applications such as the color tracker. This work is described in U.S. patent application Ser. No. 08/909,405, xe2x80x9cSpace-Time Memoryxe2x80x9d filed by Ramachandran et al. on Aug. 11, 1997. That system was designed to match the dynamic data flow and heterogeneous task requirements of multi-media applications involving, for example, concurrent video and speech processing.
In that framework, tasks are implemented as threads, and the run-time system relies on the operating system to effectively schedule processor resources. That prior art task parallel system lacks any type of mechanism for incorporating data parallelism into its framework.
A number of prior art task parallel systems do include integrated task and data parallelism, the xe2x80x9cOrcaxe2x80x9d and the xe2x80x9cFXxe2x80x9d system are two examples. However, Orca falls short in that its data parallelism is not only static, but also specified explicitly in the source programs of the application. The FX system is significantly more advanced. It automatically determines optimal mappings of tasks to processors in static domains where the flow of the computation does not vary as a function of the data, but remains fairly consistent over a variety of data sets.
Unfortunately, the parallelism exhibited by multi-media applications, like the color tracker above, is often highly dynamic, because the required processing is determined by the video content, for example, the number of customers in the scene at some point in time. As a result, such applications do not derive any benefits from compiler or profile driven analysis. Profiling is a technique wherein an executing application is measured in order to tune performance.
This critique extends to a large body of work involving the use of profile data to drive compilation and resource scheduling for parallel applications. All of these systems, of which FX provides an example, perform static task scheduling based on performance profiles. Application profiling is used to measure the performance of individual tasks, the measured performance is used as input to static scheduling decisions for system resources.
Unfortunately, this body of work does not provide a means to dynamically adjust the scheduling policy as resources needed by the application change over time. Profiling systems typically support off-line operations such as compilation and resource scheduling prior to run-time, and fall short of the on-line adaptation required by a dynamically varying class of applications.
Integration of task and data parallelism in a dynamic setting has been addressed for scientific applications involving parallel matrix computations. In this domain, the task graph has a regular structure, and models are available for characterizing the efficient of tasks in utilizing additional system resources.
Algorithms for on-line scheduling of dynamic tasks sets have been proposed in this context. Unfortunately, the computational model which describes scientific matrix computation does not apply to multi-media processing where the tasks are heterogeneous and involve processing a time-varying data stream as opposed to a static matrix.
On-line adaptation of resource scheduling policies for parallel applications has also been explored in other, more limited contexts such as page migration and replication in a CC-NUMA architecture. In this prior art work, measurements of cache or translation look-aside buffers (TLB) misses are used at run-time to make decisions about migrating or replicating cache pages within the context of a cache coherent shared memory architecture. Unfortunately, this type of work is of limited scope as it depends heavily on the properties of the CC-NUMA architecture. As a result, it falls short of providing a complete framework for addressing both task and data parallelism.
Another on-line adaptation scheme based on reinforcement learning has been proposed in the context of network packet routing. That scheme makes local decisions about packet routes according to a model of delivery times that is updated as the network traffic varies. Unfortunately, there is no obvious extension of that scheme to integrated task and data parallelism in a multi-processor computer system including a variety of different and competing resources. The complex behavior of a parallel computer systems can not be fully characterized by simple local interactions of system components.
Multi-media applications, such as the kiosk application described above, have two characteristics which differentiate them from more traditional parallel scientific applications. First, multi-media applications posses a high degree of dynamism within a heterogeneous task set. For example, the color tracker depends on the contents of the frames to determine which task procedures should be applied.
These procedures can differ significantly in their computational properties, at least when compared to scientific applications. Second, the kiosk application demands real-time processing of dynamic data streams. For example, the color tracker must process the frames nearly at the rate the frames are produced by the digitizer. This mix of requirements can lead to patterns of communication that are substantially different from scientific applications in which data sets, such as matrices, are available in their entirety at run-time.
There is a need for a framework for integrated task and data parallelism that can be tailored to the dynamic real-time needs of multi-media applications. This framework should be applicable to a broad range of heterogeneous tasks.
Because it is unlikely that an exact characterizations of the computational properties of a diverse task set will be available prior to run-time, the framework should also provide a mechanism for adapting the scheduling policy according to system performance. What is desired is satisfactory solution where a scheduling policy improves over time as the computational properties of the application become apparent.
There are two requirements for this framework. The first requirement arises from the fact that the relative computational requirements of the tasks vary over time. For example, the computational requirements of the target detection task varies with the number of customers perceived by the kiosk, whereas the computational requirements of the digitizer are fixed. This implies a requirement that the relative extent of data parallelism versus task parallelism must dynamically track changes in the application.
The second requirement is the need for adaptation. The system must adapt to hardware and software modifications, the later including both application software and system software. Similarly, the performance of the system should improve over time given fixed hardware and software.
This type of framework is conceptually different from the highly analytic approaches to scheduling in narrow application domains which have characterized the prior art on integrated task and data parallelism.
The invention provides methods and means for integrating task and data parallelism for dynamic applications where the applications include one or more tasks or processing threads.
Parallelism is achieved replacing a particular task which needs additional system resources with the following general component tasks, a splitter, workers, and a joiner. The splitter task partitions the input data stream to the particular task into a plurality of data chunks. The worker tasks process subsets of the data chunks, each worker task is an instance of the particular task. The joiner task combines the processed data chunks to produce the output data stream.
This type of task and data parallelism is useful in situations where the data chunks are continuously generated and time-varying in complexity such as a sequence of time-ordered video frames, and the complexity of the processing depends on the video content.
In one aspect of the invention, the chunks are placed in a work queue, and the control items are placed in a control queue. The control items indicate how the joiner can combine the processed chunks to reform the output data stream. In addition, each chunk is associated with a task and data parallel strategy that indicate methods to be applied to the chunks. The methods, for example, can be copies of the particular task, or models to be applied to the data by the worker tasks while processing the chunks.