This application relates to energy efficient computation.
Embedded learning and classification applications are computationally intensive and process large amounts of real-time data while balancing stringent Quality of Service (QoS) constraints under tight power budgets. These include applications in the areas of transportation, healthcare, robotics, aerospace and defense. For instance, one in-vehicle system monitors real-time driving events, analyzes risks, and suggests improvement over driving habits. Such a system requires real-time responsiveness while processing and analyzing continuous streams of data. Cars also continuously monitor and analyze internal sensor data in order to predict failures and reduce recalls. Another example is face, object and action detection in surveillance and store cameras; stores and advertising agencies analyze shopper behavior to gauge interest in specific products and offerings.
Such learning and classification applications are computation and data-intensive and are generally data parallel in nature. In data centers, such workloads can rely on clusters of high-performance servers and GPUs to meet the stringent performance constraints and dynamic scalability. For example, GPU based implementations of learning-algorithms like Convolutional Neural Network (CNN) and Support Vector Machines (SVM) have been published that meets their high-performance requirements. However, in embedded situations such as automobiles and store cameras, a CPU+GPU server compute node is too power hungry. Another approach to designing low-power servers has been to consider embedded processors.
There are several runtime proposals to map applications to heterogeneous systems, but few consider energy. OpenCL, for instance, provides a common programming platform for both CPU and GPU but burdens the programmer with application-distribution. IBM's OpenMP for Cell is similar, while Intel's Merge provides a manual mapping of the application. Yet other conventional systems provide a runtime-support for legacy applications improving the overall application-performance by deferring the intermediate data-transfer before scheduling to a different coprocessor. Harmony schedules the application in an automated manner by predicting the performance of the application's kernels, but they do not split a single kernel among multiple accelerators, and do not focus on energy. Other conventional systems take an adaptive (dynamic) approach to improve performance on a system with one GPU and assume a linear-model to decide the optimal partitioning of an application for a server-system consisting CPU and GPU.