In just the past few years, algorithms from the relatively nascent field of machine learning have been widely applied for many types of practical applications, resulting in technologies such as self-driving vehicles, improved Internet search engines, speech, audio, and/or visual recognition systems, human health data and genome analysis, recommendation systems, fraud detection systems, etc. The growth of these algorithms has in part been fueled by recent increases in the amount and types of data being produced by both humans and non-humans. Thus, as the increased amount of data available for analysis has skyrocketed, so too has the interest in machine learning.
However, machine learning algorithms tend to be computationally expensive, as they can involve performing huge numbers of non-trivial operations (e.g., floating point multiplication) with huge amounts of data. As a result, it is extremely important to implement these algorithms as efficiently as possible, as any small inefficiency is quickly magnified due to the large scale of computation.
For example, many machine learning algorithms perform linear algebra operations with huge matrices. However, these types of operations are extremely difficult to parallelize in modern computing systems, at least in part due to potential write-to-read dependences across iterations (e.g., of a loop that updates values in a matrix, for example).
Some current approaches for performing these types of linear algebra operations may employ locking techniques or approximate lock-free implementations. Locking continues to generate the same solution as a sequential implementation, but trades-off this locking overhead for greater parallelism. However, as a result of locking overhead, previous approaches have shown that the performance does not scale beyond 2-4 cores and does not result in anything near linear performance scaling even until 4 cores.
The second approach—involving the use of approximate lock-free implementations—does get close to linear performance scaling, but does not achieve the best solution due to fundamentally relying upon approximations. Furthermore, the output deviation can be particularly high for datasets have a power-law distribution where some features are more common than others, which leads to greater chances of incorrect updates.
Accordingly, techniques enhancing the performance of these types of algorithms, such as those having write-to-read data dependencies across iterations of loops, are strongly desired.