1. Field of the Invention
The present invention relates generally to brain dynamics and, more particularly, to methods for solving the “distal reward problem” or “credit assignment problem.”
2. Description of the Related Art
Learning the associations between cues and rewards (classical or Pavlovian conditioning) or between cues, actions, and rewards (instrumental or operant conditioning) involves reinforcement of neuronal activity by rewards or punishments. Typically, the reward comes seconds after reward-predicting cues or reward-triggering actions, creating an explanatory conundrum known in the behavioral literature as the distal reward problem and in the reinforcement learning literature as the credit assignment problem. Indeed, how does an animal know which of the many cues and actions preceding the reward should be credited for the reward? In neural terms, in which sensory cues and motor actions correspond to neuronal firings, how does the brain know what firing patterns, out of an unlimited repertoire of all possible firing patterns, are responsible for the reward if the firing patterns are no longer there when the reward arrives? How does the brain know which of the spikes of many neurons result in the reward if many of these neurons fire during the waiting period to the reward? Finally, how does a reinforcement signal in the form of the neuromodulator dopamine (DA) influence the right synapses at the right time, if DA is released globally to many synapses?
This problem, mentioned above as the distal reward problem in the behavioral literature or the credit assignment problem in the machine learning literature, is notoriously difficult to solve in autonomous robotics. Such robotic devices have to execute multiple steps before they achieve the goal and obtain a reward. There is a whole subfield of the machine learning field known as “reinforcement learning theory” that attempts to solve this problem using artificial intelligence and dynamic programming methods.
A similar problem exists when the behavior of the robot is controlled by a simulated neural network, as in what are known in the art as brain-base devices (BBDs). Indeed, how does the simulated neural network of a BBD know what firing patterns of what neurons are responsible for the reward if (a) the firing patterns are no longer there when the reward arrives and (b) most neurons and synapses are active during the waiting period to the reward? Traditionally, this problem is solved using one of the two assumptions: (1) the neural network is designed to be quiet during the waiting period to the reward; then the last firing neurons are the ones that are responsible for the reward, or (2) the firing patterns that are responsible for the reward are somehow preserved until the reward arrives; then whatever neurons are firing at the moment of reward are the ones that are responsible for the reward. Both assumptions are not suitable for BBDs because BBDs are embedded into and operate in real-world environments and thereby receive inputs and produce behavior all the time, even during the waiting period to the reward.
With respect to DA modulation of synaptic plasticity, an important aspect is its enhancement of what is known as long-term potentiation (LTP) and long-term depression (LTD). For example, in the hippocampus of the brain, dopamine D1 receptor agonists enhance tetanus-induced LTP, but the enhancement effect disappears if the agonist arrives at the synapses 15-25 seconds after the tetanus. LTP in the hippocampal→prefrontal cortex pathway is enhanced by direct application of DA in vivo or by burst stimulation of the ventral tegmental area (VTA), which releases DA. Correspondingly, D1 receptor antagonists prevent the maintenance of LTP, whereas agonists promote it via blocking depotentiation even when they are applied after the synapse plasticity-triggering stimuli. DA is also shown to enhance tetanus-induced LTD in layer 5 pyramidal neurons of the prefrontal cortex, and it gates corticostriatal LTP and LTD in striatal projection neurons.
Synaptic connections between neurons may be modified in accordance with what is known as the spike-timing dependent plasticity (STDP) rule. STDP involves both LTP and LTD of synapses: firing of a presynaptic neuron immediately before firing of a postsynaptic neuron results in LTP of synaptic transmission, and the reverse order of pre, post synaptic neuron firing results in LTD. It is reasonable to assume that the LTP and LTD components of STDP are modulated by DA the same way as they are in the classical LTP and LTD protocols. That is, a particular order of firing induces a synaptic change (positive or negative), which is enhanced if extracellular DA is present during the critical window of a few seconds.