Increasingly, there is a demand for solutions to complex linear systems. These solutions are used in various applications, such as computer vision (simultaneous localization and mapping (SLAM), robotics, drones etc.), machine learning, control-systems, big-data analytics, and other applications. The solutions to these complex linear systems may include a matrix decomposition operation. Matrix decomposition operations are computationally intensive. For example, matrix decomposition complexity may be cubic, such that, for N elements, processing involves N3 computations. Such processing complexity often consumes substantial power. Matrix decomposition operations also require substantial memory bandwidth, resulting in a substantial time delay in computing the solution (e.g., large latency). The large latency may significantly affect the performance of various applications, such as slowing camera pose estimation or SLAM calculations. In addition to the power used in the matrix decomposition mathematical operations, matrix decomposition operations also require substantial energy to execute the large number of memory accesses. When implemented on a software kernel running on a general purpose processor (e.g., central processing unit (CPU)), the matrix decomposition operations include unorganized memory access patterns (e.g., for triangular matrices) and serial operation dependencies, which further increase latency and power consumption. The high latency and high energy consumption may substantially reduce the performance of time-dependent applications, such as AR or VR applications.