With the rapid increase of wireless sensors and the advent of the age of “Internet of Things” and “Big Data Computing”, there is a strong need for low-power machine learning systems that can help reduce the data being generated by intelligently processing it at the source. This not only relieves the user of making sense of all of this data but also reduces power dissipation in transmission making the sensor node run much longer on battery. Data reduction is also a necessity for biomedical implants where it is impossible to transmit all of the generated data wirelessly due to bandwidth constraints of implanted transmitters.
As an example, consider brain machine interface (BMI) based neural prosthesis—an emerging technology for enabling direct control of prosthesis from neural signal of the brain of the paralyzed persons. As shown in FIG. 1, one or a set of micro-electrodes arrays (MEAs) are implanted into cortical tissue of the brain to enable single-unit acquisition (SUA) or multi-unit acquisition (MUA), and the signal is recorded by a neural recording circuit. The recorded neural signal, i.e. sequences of action potential from different neurons around the electrodes, carries the information of motor intention of the subject.
The signal is transmitted out of the subject to a computer where neural signal decoding is performed. Neural signal decoding is a process of extracting the motor intention embedded in the recorded neural signal. The output of neural signal decoding is a control signal. The control signal is used as a command to control the prosthesis, such as a prosthesis arm. Through this process, the subject can move the prosthesis by simply thinking. The subject sees the prosthesis move (creating visual feedback to the brain) and typically also feels it move (creating sensory feedback to the brain).
Next generation neural prosthesis requires one or several miniaturized devices implanted into different regions of the brain cortex, featuring integration of up to a thousand electrodes, both neural recording and sensory feedback, and wireless data and power link to reduce the risk of infection and enable long-term and daily use. The tasks of neural prosthesis are also extended from simple grasp and reach to more sophisticated daily movement of upper limb and locomotive bipedal. A major concern in this vision is the power consumption of the electronics devices in the neural prosthesis. Power consumption of the implanted circuits are highly restricted to prevent tissue damage caused by the heat dissipation of the circuits. Furthermore, implanted devices are predominantly supplied by a small battery or wireless power link, making the power budget even more restricted, assuming a long-term operation of the devices. As the number of electrodes increases, higher channel count makes it a more challenging task, calling for optimization of each functional block as well as system architecture.
Another issue that arises with the increasing number of electrodes is the need to transmit large amount of recorded neural data wirelessly from the implanted circuits to devices external to the patient. This puts a very heavy burden on the implanted device. In a neural recording device with 100 electrodes, for instance, with typical sampling rate at 25 Ksa/s and a resolution of 8 bits, the wireless data rate can be as high as 20 Mb/s. Some methods of data compression are therefore highly desirable. It would be desirable to include a machine learning capability for neural signal decoding on-chip in the implanted circuitry, to provide an effective way of data compression. For example, this might make it possible to transmitted wirelessly out of the subject only the prosthesis command (e.g. which finger to move (5 choices) and in which direction (2 choices) for a total of 10 options, which can be be encoded in 4 bits). Even if this is not possible, it might be feasible to wirelessly transmit only some pre-processed data with reduced data rate compared to the recorded neural data.
Though digital processors have benefited from transistor scaling due to Moore's law, they are inherently inefficient at performing machine learning computations that require a large number of multiply operations. Analog processing on the other hand provides very power efficient solutions to performing elementary calculations like multiplication [7]; however, historically, analog computing has been difficult to scale to large systems for several reasons, a major one being device mismatch. With transistor dimensions reducing over the years, variance in properties of transistors, notably the threshold voltage, has kept on increasing making it difficult to rely on conventional simulations ignoring statistical variations. The problem is particularly exacerbated for neuromorphic designs, where transistors are typically biased in sub-threshold region of operation (to glean maximal efficiencies in energy per operation) since device currents are exponentially related to threshold voltages, thus amplifying their variations as well. In general, there has been an approach to compensate for mismatch either through floating-gates or otherwise. Sometimes, it is claimed that learning can compensate for mismatch—but the claim needs to be quantified since mismatch will exist in the learning circuits as well [1].
Hence, it would be useful to develop a low-power, analog computing based machine learning systems that can operate even with the large amount of statistical variation that is prevalent in today's semiconductor processes.
In the field of BMI, the neural decoding algorithms used are predominantly based on active filtering or statistical analysis. These highly sophisticated decoding algorithms work reasonably well in the experiments but requires significant amount of computation efforts. Therefore, the state-of-the art neural signal decoding are mainly conducted on either a software platform or on a microprocessor outside of the brain, consuming a considerable amount of power, thus making it impractical for the long-term and daily use of the neural prosthesis. As discussed above, the next generation neural prosthesis calls for a miniaturized and less power hungry neural signal decoding that achieves real-time decoding. Integrating the neural decoding algorithm with neural recording devices is also desired to reduce the wireless data transmission rate.
Until now, very little work has been done to give a solution for this problem. A low-power neural decoding architecture using analog computing is proposed [5], featuring optimizing the mapping in the training mode by continuous feeding of recorded neural signal and using the optimized mapping to generate the output in the operational mode. The architecture is largely an active filtering method involving massive parallel computing through low power analog filters and memories. Complicated learning algorithm of a modified gradient-descent approach is adopted on chip to minimize the error in a least-squares sense, adding to the complexity of the design. To achieve low power operation, sub-threshold design are used for lower biasing current, magnifying the mismatch and robustness issue in the analog circuits. Furthermore, except some SPICE simulation results, no measurement results are published to support the silicon viability of the architecture. A recent work proposes a universal computing architecture for neural signal decoding [6]. The architecture consists of internal part integrating with implanted neural recording device and external part. The internal part pre-process the neural signal at each time step by doing binary classification for a series of possible states according a set of pre-defined rules. Only classification decision vector is transmitted to external device, reducing data rate by a factor of 10000. The transmitted data is further processed by external device in a non-causal manner, selecting a most probable state. The computation power is distributed unbalanced between internal and external part, where internal part performs only light-weight logic but reduce data rate effectively and external part with less power constraint finishes more complicated computation required by the algorithm. The architecture is claimed to be universal, capable of implementing various decoding algorithm. An example using pattern matching algorithm is shown and implemented in field programmable gate array (FPGA) is shown with verification in rodent animal experiment. The power consumption of the FPGA for implementing this example is estimated to be 537 μW.
Custom hardware implementations of neural networks have many advantages over generic processor based ones in terms of speed, power and area. In the past, the Support Vector Machine (SVM) algorithm has been implemented in a single chip VLSI by many groups for various applications [7]-[14]. For example, [7] describes a digital synthesized SVM-based recognition system with sequential minimal optimization (SMO) algorithm in a FPGA. The authors in [8] developed an analog circuit architecture of Gaussian-kernel support vector machines having on-chip training capability. The problem with these SVM systems is that each parameter of the network has to been tuned one by one and sufficient memory is required for the storage of these parameters. As an alternative, floating-gate transistors could be used for non-volatile data storage and analog computation [12]-[15]. However, this will require special process (typically double poly process is used) for fabrication and additional charge programming stage for floating-gate devices. There is also a reliability issue if the weights need to be programmed frequently. Finally, the device size of a floating-gate based multiplier will be much larger than our minimum sized transistor based multiplier since the floating-gate device has to be a thick oxide one with minimum channel length typically larger than 350 nm.