Technical Field
The present invention relates generally to information processing and, in particular, to improving the runtime of CUda Basic Linear Algebras Subroutines (CUBLAS) matrix multiplication on a Graphics Processing Unit (GPU).
Description of the Related Art
Matrix multiplication is a generic operation in many computationally intensive applications. Deep neuron network training is one of the most prominent such applications. Other such applications include, but are not limited to, molecular mechanics simulation, gas and fluid dynamics, weather forecast, quantum chemistry, linear optimization, and so forth.
A Graphics Processing Unit (GPU) is a widely used platform for accelerating matrix multiplication by ten or more times compared to a Central Processing Unit (CPU). For example, NVIDIA® ships a library of CUda Basic Linear Algebras Subroutines (CUBLAS) created specifically for NVIDIA® GPUs.
The efficiency of matrix multiplication on a GPU by CUBLAS library functions greatly varies with matrix sizes. Hence, there is a need for improving runtime of CUBLAS matrix multiplication on a Graphics Processing Unit (GPU).