Today, the various implementations of artificial intelligence and machine learning are driving innovation in many fields of technology. Artificial intelligence (AI) systems and artificial intelligence models (including algorithms) are defined by many system architectures and models that enable machine learning (deep learning), reasoning, inferential capacities, and large data processing capabilities of a machine (e.g., a computer and/or a computing server). These AI systems and models are often trained intensively to perform one or more specific tasks, such as natural language processing, image recognition, planning, decision-making, and the like. For example, a subset of these AI systems and models include artificial neural network models. The training of an artificial neural network model may, in many cases, require thousands of hours across the training cycle and many terabytes of training data to fine tune associated neural network algorithm(s) of the model before use.
However, once trained, a neural network model or algorithm may be deployed quickly to make inferences to accomplish specific tasks (e.g., recognizing speech from speech input data, etc.) based on relatively smaller datasets when compared to the larger training datasets used during the training cycle. The inferences made by the neural network model or algorithm based on the smaller datasets may be a prediction about what the neural network model calculates to be a correct answer or indication about a circumstance.
Still, while neural network models implementing one or more neural network algorithms may not require a same amount of compute resources, as required in a training phase, deploying a neural network model in the field continues to require significant circuitry area, energy, and compute power to classify data and infer or predict a result. For example, weighted sum calculations are commonly used in pattern matching and machine learning applications, including neural network applications. In weighted sum calculations, an integrated circuit may function to multiply a set of inputs (xi) by a set of weights (wi) and sum the results of each multiplication operation to calculate a final result (z). Typical weighted sum calculations for a machine learning application, however, include hundreds or thousands of weights which causes the weighted sum calculations to be computationally expensive to compute with traditional digital circuitry. Specifically, accessing the hundreds or thousands of weights from a digital memory requires significant computing time (i.e., increased latency) and significant energy.
Accordingly, traditional digital circuitry required for computing weighted sum computations of a neural network model or the like tend to be large to accommodate a great amount of digital memory circuitry needed for storing the millions of weights required for the neural network model. Due to the large size of the circuitry, more energy is required to enable the compute power of the many traditional computers and circuits.
Additionally, these traditional computers and circuits for implementing artificial intelligence models and, namely, neural network models may be suitable for remote computing processes, such as in distributed computing systems (e.g., the cloud), or when using many onsite computing servers and the like. However, latency problems are manifest when these remote artificial intelligence processing systems are used in computing inferences and the like for remote, edge computing devices or in field devices. That is, when these traditional remote systems seek to implement a neural network model for generating inferences to be used in remote field devices, there are unavoidable delays in receiving input data from the remote field devices because the input data must often be transmitted over a network with varying bandwidth and subsequently, inferences generated by the remote computing system must be transmitted back to the remote field devices via a same or similar network. Additionally, these traditional circuit often cannot manage the computing load (e.g., limited storage and/or limited compute) and may often rely on remote computing systems, such as the cloud, to perform computationally-intensive computations and store the computation data (e.g., raw inputs and outputs). Thus, constant and/or continuous access (e.g., 24×7 access) to the remote computing systems (e.g., the cloud) is required for continuous operation, which may not be suitable in many applications either due to costs, infrastructure limitations (e.g., limited bandwidth, low grade communication systems, etc.), and the like.
Implementing AI processing systems at the field level (e.g., locally at the remote field device) may be a proposed solution to resolve some of the latency issues. However, attempts to implement some of these traditional AI computers and systems at an edge device (e.g. remote field device) may result in a bulky system with many circuits, as mentioned above, that consumes significant amounts of energy due to the required complex architecture of the computing system used in processing data and generating inferences. Thus, such a proposal without more may not be feasible and/or sustainable with current technology.
Accordingly, there is a need for a deployable system for implementing artificial intelligence models locally in the field (e.g., local AI), and preferably to be used in edge devices, that do not result in large, bulky (edge) devices, that reduces latency, and that have necessary compute power to make predictions or inferences, in real-time or substantially real-time, while also being energy efficient.
The below-described embodiments of the present application provide such advanced and improved integrated circuits and implementation techniques capable of addressing the deficiencies of traditional systems and integrated circuit architectures for implementing AI and machine learning.