Embedded devices are often built with relatively cheap hardware (e.g., power supplies of limited capacity, slower components that require less energy, etc.). The use of cheaper hardware allows a manufacturer to economically mass produce these embedded devices, but the economic savings come at a cost of degraded performance of the device. For instance, in a multiprocessor system—and in particular, a low-level multiprocessor platform like Raspberry Pi 2, which is used for embedded devices—the architecture can be asymmetrical in the sense that interrupts are delivered exclusively on a primary central processing unit (CPU), rather than load-balanced across multiple available CPUs. This can lead to overutilization of the primary CPU and underutilization of the other CPU(s) in the multiprocessor system. Furthermore, many embedded devices utilize a relatively slow direct memory access (DMA) controller. Oftentimes programmers work around this constraint by designing embedded software to perform memory transfer operations using a CPU in lieu of a slower DMA controller because the memory transfer operations can be performed faster with a CPU. This can put even more of a workload burden on a primary CPU because the primary CPU is fully occupied for the duration of a memory transfer operation and is unavailable to receive interrupts.
Current operating systems can alleviate these problems by scheduling tasks on (or transferring tasks to) an alternative CPU in a multiprocessor system, which reduces some of the burden on a primary CPU. For example, a device driver and/or an operating system (OS) component can transfer a task between CPUs by creating a task with a “producer” thread that runs on a first CPU, and executing the task with a “consumer” thread that runs on a second CPU. However, there is often a significant delay from a time when the producer thread signals to the consumer thread to execute the task, and a time when the consumer thread “wakes up” and is actually ready to execute the task. This delay can be caused by inherent scheduling latency in waking up a consumer thread and preparing the consumer thread to execute the task. For example, the consumer thread takes time to load a stack and multiple pages from system memory in order to prepare itself to execute a task. The time it takes to perform these “load” operations can cause discernable performance degradations. Many embedded devices—such as control devices (e.g., a robotic arm) that perform time-critical operations—could benefit from a higher performance OS.