In machine learning platforms, cognitive applications often rely on large graph analytics. Typically, large graphs are highly sparse and represented as sparse matrices (adjacency matrix) in cognitive applications. Multiplication of these sparse matrices with sparse vectors is a very common operation in cognitive applications. Modern multi-core, multi-threaded processors incur substantial synchronization overhead in sparse-matrix sparse vector implementation.