The present invention relates to improving performance of computer code, and more specifically, to using cluster analysis to dynamically transfer execution of computer code across disparate processing unit architectures during runtime.
There are a large number of processing unit (PU) architectures capable of executing computer code. Frequently, computer code is expressly designed to be executed on a particular type of PU, such as a central processing unit (CPU) or graphics processing unit (GPU), and the code cannot be executed on other types of PU without substantial translation, which requires significant time and effort. Generally, most code is executed on a particular type of processing unit (e.g., C++ is usually executed using a CPU). At times, however, it is desirable to execute code on a different type of PU that it ordinarily would not be executed on. For example, code ordinarily executed on a CPU may be better executed on a GPU in some instances.
Porting such code to the second PU, however, often involves considerable wasted time because large portions of the code may be better executed on the original PU, such that the translation effort is wasted and the resulting translated code is useless. Furthermore, some code is better executed on a GPU when it is run in parallel with other portions of code, but on a CPU when it is run in isolation. And in my circumstances, developers may be unable to determine which processing architecture is optimal for the execution of a particular piece of computer code, without experimenting with executing the particular piece of computer code across multiple different PU architectures. Additionally, many computer program codes can be organized into blocks of code. Frequently, porting an entire block of code is unnecessary, and it would be more efficient (e.g., the program would run more quickly) if one or more smaller portions of the block are ported rather than the entire block. Similarly, at times the most efficient code is created by porting a portion of the code that includes sections from multiple logical blocks of code (e.g., blocks that are executed simultaneously or sequentially).
There is no satisfactory solution to a priori predict the acceleration that can be achieved by porting all or portions of computer code to other processing unit architectures. Moreover, there is no satisfactory solution to dynamically transfer execution of computer code between disparate processing unit architectures during runtime.