Machine learning can roughly be characterized as the process of creating algorithms that can learn a behavior from examples. One simple example is that of pattern classification. A series of input patterns are given to the algorithm along with a desired output (the label) and the algorithm learns how to classify the patterns by producing the desired label for any given input pattern. Such a method is called supervised learning since the human operator must provide the labels during the teaching phase. An example is the kernal-based SVM algorithm. Alternately, unsupervised “clustering” is a process of assigning labels to the input patterns without the use of a human operator. Such unsupervised methods must usually function through a statistical analysis of the input data, for example, finding the Eigen value vectors of the covariance matrix. One such example of unsupervised clustering is the suit of k-means algorithms.
A few problems have continued to challenge the field of machine learning. Few if any standard and accepted methods exist for learning based on few patterns or exemplars. Without sufficient examples, finding a solution that balances memorization with generalization is often difficult. The difficultly is due to separation of a training and testing stage, where the variables that encode the algorithms learning behavior are modified during the learning stage and tested for accuracy and generalization during the testing phase. Without sufficient examples during the learning stage, it is difficult or impossible to determine the appropriate variable configurations leading to this optimal point. Theoretically, the mathematical technique of support-vector-maximization provides an optimal solution, should there be sufficient training data to encompass the natural statistics of the data and presuming the statistics do not change over time, a problem called concept drift. The idea is that all input patterns are projected into a high dimensional where they are linearly separable space.
A linear classifier can then be used to label the data in binary classification task. A linear classifier can be thought of as a hyper plane in a high-dimensional space, where we call the hyper plane the decision boundary. All input falling on one side of the decision boundary results in a positive output, while all inputs on the other side result in a negative output. The support-vectors are the distances from the closest input points to the decision boundary. The process of maximizing this distance is the process of support-vector-maximization. However, without sufficient examples it is of little or no use since identifying the support-vectors requires testing a number of input patterns to find which ones are closer to the decision boundary. Indeed, some thought may convince the reader that finding the point of optimal generalization is not possible with only one example since by definition measuring generalization requires evaluation of a number of exemplars.
Another problem facing the field of machine learning is adaptation to non-stationary statistics, i.e. concept drift. The problem occurs when the statistic of the underlying data changes over time. Any method that relies on a separation of training and testing is doomed to failure, as whatever the algorithm has learned quickly becomes incorrect as time moves forward. Methods for continual real-time adaptation are clearly needed, but such methods are often at odds with the training methods employed to find the initial solution.
Another problem facing the field of machine learning is power consumption. Finding statistical regularities in large quantities of streaming information can be incredibly power intensive, as the problem encounters combinatorial explosions. The complexity of the task is echoed in biological nervous systems, which are essentially communication networks that self-evolve to detect and act on regularities present in the input data stream. It is estimated that there are between 2 and 4 kilometers of wires in one cubic millimeter of cortex. At 2500 cm2 total area and 2 mm thick, that is 1.5 million kilometers of wire in the human cortex, or enough wire to wrap around the earth 37 times.
For this reason, the closer one can match the distributed processors of the hardware to the structure of the underlying network being simulated, the less information must be shuttled back-and-forth between memory and processor and the lower the power dissipation required for emulation. The limit of efficiency occurs when the hardware becomes the network, which occurs when memory becomes processing. We call this point physical computation, since the physical properties of the system are now “computing” the answer rather than the answer being arrived at abstractly through operations on numbers represented as binary values. Physical computation is related to, but not the same as, analog computation. For example, consider the problem of simulating the fall of a rock dropped from some height. We may go about a solution in a number of ways. First, we may derive a mathematical expression and evaluate this on a digital computer. This is digital computing. Second, we may solve a differential equation by noticing the equations of motions are mathematically equivalent to some other process, for example, those of transistor physics. This is analog computing. The third option is that we could find a rock and drop it. This is physical computing. Relatively simple arguments can be made to show that this is the only practical solution to large adaptive systems on the scale of living systems such as a brain. Digital and analog computing each suffer from the memory-processing duality, a condition which does not exist in nature and which introduces very high power dissipations for highly adaptive large-scale systems.
As an example of just how significant computation is, consider IBM's recent cat-scale cortical simulation of 1 billion neurons and 10 trillions synapses. This effort required 147,456 CPU's and ran at 1/100th real time. At a power consumption of 20 W per CPU, this is 3 megawatts. If we presume perfect scaling, a real-time simulation would consume 100× more power: 300 megawatts. A human brain is ˜20 times larger than a cat, so that a real-time simulation of a network at the scale of a human would consume 6 GW if done with traditional serial processors. This is 600 million times more energy than a human brain actually dissipates. It is worth consideration that every brain in existence has evolved for just one purpose: to control an autonomous platform. An algorithm for finding regularities in large quantities of streaming information that cannot be mapped directly to physically adaptive hardware will likely not find use in mobile platforms, as the energy demands far exceed practical power budgets.