Operant Conditioning
John E. R. Staddon and Yael Niv (2008), Scholarpedia, 3(9):2318. | doi:10.4249/scholarpedia.2318 | revision #91609 [link to/cite this article] |
Operant conditioning (also known as instrumental conditioning) is a process by which humans and animals learn to behave in such a way as to obtain rewards and avoid punishments. It is also the name for the paradigm in experimental psychology by which such learning and action selection processes are studied.
Contents |
Definitions
The behavior of all animals, from protists to humans, is guided by its consequences. The bacterium finds its way, somewhat inefficiently, up a chemical gradient; the dog begs for a bone; the politician reads the polls to guide his campaign. Operant conditioning is goal-oriented behavior like this.
These examples are instances of ontogenetic selection, that is guidance by consequences during the life of the individual. Other names for ontogenetic selection are instrumental or operant (B. F. Skinner’s term) conditioning.
Closely related to, and often thought to be a component of, operant conditioning is classical or Pavlovian conditioning. The prototypical example of Pavlovian conditioning is of course Pavlov and his dogs. In Pavlovian conditioning, the repeated pairing of a stimulus such as Pavlov's bell to an affectively important event like the receipt of food, leads to the anticipatory elicitation of what is termed a conditioned response, such as salivation, when the bell is sounded. Unlike operant conditioning, in classical conditioning no response is required to get the food.
The distinction between Pavlovian and operant conditioning therefore rests on whether the animal only observes the relationships between events in the world (in Pavlovian conditioning), or whether it also has some control over their occurrence (in operant conditioning). Operationally, in the latter outcomes such as food or shocks are contingent on the animal's behavior, whereas in the former these occur regardless of the animal's actions. However, the distinction between these two paradigms is more than technical -- in Pavlovian conditioning, changes in behavior presumably reflect innately specified reactions to the prediction of the outcomes, while operant learning is at least potentially about maximizing rewards and minimizing punishment. Consequently, Pavlovian and operant conditioning can differ in the behaviors they produce, their underlying learning processes, and the role of reinforcement in establishing conditioned behavior. The scientific study of operant conditioning is thus an inquiry into perhaps the most fundamental form of decision-making. It is this capacity to select actions that influence the environment to one’s subjective benefit, that marks intelligent organisms.
There is also phylogenetic selection – selection during the evolution of the species. Darwin’s natural selection is an example and behavior so evolved is often called reflexive or instinctive. Much reproductive and agonistic (aggressive/defensive) behavior is of this sort. It emerges full-blown as the animal matures and may be relatively insensitive to immediate consequences. Even humans (who should know better!) are motivated to sexual activity by immediate gratification, not the prospect of progeny, which is the evolutionary basis for it all.
The selecting consequences that guide operant conditioning are of two kinds: behavior-enhancing (reinforcers) and behavior-suppressing (punishers), the carrot and the stick, tools of parents, teachers – and rulers – since humanity began. When the dog learns a trick for which he gets a treat, he is said to be positively reinforced. If a rat learns to avoid an electric shock by pressing a lever, he is negatively reinforced. There is often ambiguity about negative reinforcement, which is sometimes confused with punishment – which is what happens when the dog learns not to get on the couch if he is smacked for it. In general, a consequence is called a reinforcer if it strengthens the behavior that led to it, and it is a punisher if it weakens that behavior.
History
The scientific study of operant conditioning dates from the beginning of the twentieth century with the work of Edward L. Thorndike in the U.S. and C. Lloyd Morgan in the U.K.
Graduate student Thorndike’s early experimental work, looking at cats escaping from puzzle boxes in William James’ basement at Harvard, led to his famous "Law of [Effect]":
Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal…will, other things being equal, be more firmly connected with the situation…; those which are accompanied or closely followed by discomfort…will have their connections with the situation weakened…The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911, p. 244)
Thorndike soon gave up work with animals and became an influential educator at Columbia Teachers College. But the Law of Effect, which is a compact statement of the principle of operant reinforcement, was taken up by what became the dominant movement in American psychology in the first half of the twentieth century: Behaviorism.
The founder of behaviorism was John B. Watson at Johns Hopkins university. His successors soon split into two schools: Clark Hull at Yale and Kenneth Spence at Iowa were neo-behaviorists. They sought mathematical laws for learned behavior. For example, by looking at the performance of groups of rats learning simple tasks such as discrimination the correct arm in a T-maze, they were led to the idea of an exponential learning curve and a learning principle of the form V(t+1) = A(1-V(t)), where V is response strength A is a learning parameter less than one, and t is a small time step.
Soon, B. F. Skinner, at Harvard, reacted against Hullian experimental methods (group designs and statistical analysis) and theoretical emphasis, proposing instead his radical a-theoretical behaviorism. The best account of Skinner's method, approach and early findings can be found in a readable article -- "A case history in scientific method" -- that he contributed to an otherwise almost forgotten multi-volume project "Psychology: A Study of a Science" organized on positivist principles by editor Sigmund Koch. (A third major behaviorist figure, Edward Chace Tolman, on the West coast, was close to what would now be called a cognitive psychologist and stood rather above the fray.)
Skinner opposed Hullian theory and devised experimental methods that allowed learning animals to be treated much like physiological preparations. He had his own theory, but it was much less elaborate that Hull’s and (with one notable exception) he neither derived nor explicitly tested predictions from it in the usual scientific way. Skinner’s ‘theory’ was more an organizing framework than a true theory. It was nevertheless valuable because it introduced an important distinction between reflexive behavior, which Skinner termed elicited by a stimulus, and operant behavior, which he called emitted because when it first occurs (i.e., before it can be reinforced) it is not (he believed) tied to any stimulus.
The view of operant behavior as a repertoire of emitted acts from which one is selected by reinforcement, immediately forged a link with the dominant idea in biology: Charles Darwin’s natural selection, according to which adaptation arises via selection from a population that contains many heritable variants, some more effective – more likely to reproduce – than others. Skinner and several others noted this connection which has become the dominant view of operant conditioning. Reinforcement is the selective agent, acting via temporal contiguity (the sooner the reinforcer follows the response, the greater its effect), frequency (the more often these pairings occur the better) and contingency (how well does the target response predict the reinforcer). It is also true that some reinforcers are innately more effective with some responses - flight is more easily conditioned as an escape response in pigeons than pecking, for example.
Contingency is easiest to describe by example. Suppose we reinforce with a food pellet every 5th occurrence of some arbitrary response such as lever pressing by a hungry lab rat. The rat presses at a certain rate, say 10 presses per minute, on average getting a food pellet twice a minute. Suppose we now give additional food pellets on a random basis, independent of the animal’s lever pressing. Will he press more, or less? The answer is less. This is an effect of weakening the contingency (Skinner’s usage) between lever pressing and food. Lever pressing is less predictive of food than it was before, because food sometimes occurs at other times.
Exactly how all this works is still not understood in full theoretical detail, but the empirical space – the effects on response strength (rate, probability, vigor) of reinforcement delay, rate and contingency – is well mapped.
Data: acquisition
What happens during operant conditioning? In Thorndike’s original experiments the experimenter simply put the animal, a cat, into a puzzle box ( Figure 1) from which it could escape by making some arbitrary response such as pushing a pole or pulling a string. The experimenter intervened no further, allowing the animal to do what it would until, by chance, it made the correct response. The result was that, according to what has sometimes been called the principle of postremity, the tendency to perform the act closest in time to the reinforcement – opening of the door – is increased. This observation was the origin of Thorndike’s law of effect.
Notice that this account emphasizes the selective aspect of operant conditioning, the way the effective activity, which occurs at first at 'by chance,' is strengthened or selected until, within a few trials, it becomes dominant. The nature of how learning is shaped and influenced by consequences has also remained at the focus of current research. Omitted is any discussion of where the successful response comes from in the first place. Why does the cat push the pole or pull the string for the first time, before it’s reinforced for anything? It is something of a historical curiosity that almost all operant-conditioning research has been focused on the strengthening effect of reinforcement and almost none on the question of origins, where the behavior comes from in the first place, the problem of behavioral variation, to pursue the Darwinian analogy.
Some light is shed on the problem of origins by Pavlovian conditioning, a procedure that has been studied experimentally even more extensively than operant conditioning. In the present context, perhaps the best example is something called autoshaping, which works like this: A hungry, experimentally naive pigeon ( Figure 2), that has learned to eat from the food hopper (H), is placed in a Skinner box. Every 60 seconds or so, on average, the response key (K) lights up for 7 s. As soon as it goes off, the food hopper comes up for a second or two, allowing the bird to eat. No other behavior is required and nothing the bird does can make the food come any sooner. Nevertheless, after a few trials, the pigeon begins to show vigorous stereotyped key-pecking behavior when the key light (called the conditioned stimulus: CS), comes on. Eventually, the pigeon is likely to peck the key even if a contingency is set up such that key-pecking causes the removal of the food. This conditioned response (CR) is an example of classical conditioning: behavior that emerges as a consequence of a contingent relationship between a stimulus, the CS, and a reinforcer – in this context termed the unconditioned stimulus (US).
Autoshaping, and a related phenomenon called superstitious behavior, has played an important role in the evolution of our understanding of operant conditioning. In the present context it illustrates one of the mechanisms of behavioral variation that generate behavior in advance of operant (i.e., consequential) reinforcement. A stimulus (like the CS) that predicts food generates via built-in mechanisms, a repertoire that is biased towards food-getting behaviors – behaviors that in the evolution of the species have been appropriate in the neighborhood (both spatial and temporal) of food. Thus, the pigeon pecks the thing that predicts food (this is sometimes called sign-tracking), the dog salivates (or goes to his food bowl if released from the Pavlovian harness), the raccoon “washes” the token that is paired with food (instinctive drift), and so on. The usual conditioned response in classical conditioning experiments is what Skinner called a respondent, a reflexive response such as salivation, eyeblink or the galvanic skin response (GSR). But the autoshaping experiment shows that more complex, non-reflexive skeletal acts may also follow the same rules if the CS is “hot” enough, i.e., sufficiently predictive of food (brief, high-frequency, etc.).
The general principle that emerges from these experiments is that the predictive properties of the situation determine the repertoire, the set of activities from which consequential, operant, reinforcement can select. Moreover, the more predictive the situation, the more limited the repertoire might be, so that in the limit the subject may behave in persistently maladaptive way – just so long as it gets a few reinforcers. Many of the behaviors termed instinctive drift are like this.
This finding is the basis for an old psychological principle known as the Yerkes-Dodson law: that performance on a learning task increases with “arousal” but only to a certain point. When levels of arousal become too high, performance will decrease; thus there is an optimal level of arousal for a given learning task. This bitonic relation seems to be the result of two opposed effects. On the one hand, the more predictive the situation the more vigorously the subject will behave – good. But at the same time, the subject's repertoire becomes more limited, so that for many tasks the effective response may cease to occur if the situation is too predictive, the subject is too ”aroused” – which is obviously bad.
Autoshaping was so named because it is often used instead of manual shaping by successive approximations, which is one of the ways to train an animal to perform a complex operant task. Shaping is a highly intuitive procedure that shows the limitations of our understanding of behavioral variation. The trainer begins by reinforcing the animal for something that approximates the target behavior. If we want the pigeon to turn around, we first reinforce any movement; then any movement to the left (say) then wait for a more complete turn before giving food, and so on. But if the task is more complex than turning – if it is teaching a child to do algebra, for example – then the intermediate tasks that must be reinforced before the child masters the end goal are much less well defined. Should he do problems by rote in the hope that understanding eventually arrives? (And, if it does, why?) Should we strive for “errorless” learning, as many early teaching-machine programs did? Or should we let the pupil flounder, and learn from his mistakes? And just what is “understanding” anyway? (A few behaviorists deny there even is such a thing.)
These examples show, I think, that understanding behavioral variation is one of the most pressing tasks for learning psychologists. If our aim is to arrive at knowledge that will help us educate our children, then the overwhelming emphasis in the history of this field on selection (reinforcement), which was once appropriate, may now be failing to address some of the most important unsolved problems. It should be said, however, that the study of operant conditioning is not aimed only at improving our education systems. The recent combination of operant conditioning with neuroscience methods of investigating the neural structures responsible for learning and expression of behavior, has contributed considerably to our current understanding of the workings of the brain. In this sense, even a partial understanding of how learning occurs once the sought-after behavior has spontaneously appeared, is a formidable goal.
Data: steady state
Skinner made three seminal contributions to the way learning in animals is studied: the Skinner box (also called an operant chamber) -- a way to measure the behavior of a freely moving animal ( Figure 2); the cumulative recorder -- a graphical way to record every operant response in real time; and schedules of reinforcement -- rules specifying how and when the animal must behave in order to get reinforcement. The combination of an automatic method to record behavior and a potentially infinite set of rules relating behavior, stimuli (e.g., the key color in the Skinner box – Figure 2) and reinforcement, opened up schedules of reinforcement as a field of inquiry in its own right. Moreover, automation meant that the same animal could be run for many days, an hour or two a day, on the same procedure until the pattern of behavior stabilized.
The reinforcement schedules most frequently used today are ratio schedules and interval schedules. In interval schedules the first response after an unsignaled predetermined interval has elapsed, is rewarded. The interval duration can be fixed (say, 30 seconds; FI30) or randomly drawn from a distribution with a given mean or the sequence of intervals can be determined by a rule -- ascending, descending or varying periodically, for example. If the generating distribution is the memoryless exponential distribution, the schedule is called a random interval (RI), otherwise it is a variable interval (VI) schedule. The first interval in an experimental session is timed from the start of the session, and subsequent intervals are timed from the previous reward.
In ratio schedules reinforcement is given after a predefined number of actions have been emitted. The required number of responses can be fixed (FR) or drawn randomly from some distribution (VR; or RR if drawn from a Geometric distribution). Schedules are often labeled by their type and the schedule parameter (the mean length of the interval or the mean ratio requirement). For instance, an RI30 schedule is a random interval schedule with the exponential waiting time having a mean of 30 seconds, and an FR5 schedule is a ratio schedule requiring a fixed number of five responses per reward.
Researchers soon found that stable or steady-state behavior under a given schedule is reversible; that is, the animal can be trained successively on a series of procedures – FR5, FI10, FI20, FR5,… – and, usually, behavior on the second exposure to FR5 will be the same as on the first. Moreover, the ease with which instances of a repeated response such as lever-pressing by a rat or key-pecking by a pigeon, can be counted (first by electromagnetic counters in the 1950s and ‘60s and then by digital computers), together with Skinner’s persuasive insistence on response probability as the basic datum for psychology, soon led to the dominance of response rate as the dependent variable of choice.
The apparently lawful relations to be found between steady-state response rates and reinforcement rates soon led to the dominance of the so-called molar approach to operant conditioning. Molar independent and dependent variables are rates, measured over intervals of a few minutes to hours (the time denominator varies). In contrast, the molecular approach – looking at behavior as it occurs in real time, has been rather neglected, even though the ability to store and analyze any quantity of anything and everything that can be recorded makes this approach much more feasible now than it was 40 years ago.
The most well-known molar relationship is the matching law, first stated by Richard Herrnstein in 1961. According to this, if two response options are concurrently available (say, right and left levers), each associated with a separate interval schedule, animals ‘match' the ratio of responding on the two levers to the ratio of their experienced reward rates. For instance when one lever is reinforced on an RI30 schedule, while the other is reinforced on an RI15 schedule, rats will press the latter lever roughly twice as fast as they will press the first lever.
Although postulated as a general law relating response rate and reinforcement rate, it turned out that the matching relationship is actually far from being universally true. In fact, the matching relationship can be seen as a result of the negative-feedback properties of the choice situation (concurrent variable-interval schedule) in which it is measured. Because the probability a given response will be reinforced on a VI schedule declines the more responses are made – and increases with time away from the schedule – almost any reward-following process yields matching on concurrent VI VI schedules. Hence matching by itself tells us little about what process is actually operating and controlling behavior.
And indeed, molecular details matter. Matching behavior depends on the animal’s tendency to switch between keys. If pigeons are first trained to each choice separately then allowed to choose, they do not match, they pick the richer schedule exclusively. Conversely, a pigeon trained from the start with two choices will match poorly or not at all (i.e., tend towards indifference, irrespective of the two VI values) unless switching is explicitly penalized with a changeover delay (CoD; implicit in the need to travel between two distant levers, as with foraging on distant food patches, or explicitly imposed by requiring a certain number of actions or a minimal amount of time to pass before the schedule on the switched-to option resumes) that prevents immediate reinforcement of switching. Moreover, the degree of matching depends to some extent on the size of penalty.
Finally, the fact that steady-state behavior is often reversible does not mean that the animal’s state is equally reversible. The pigeon on second exposure to FR5 is not the same as on first exposure, as can readily be shown by between-group experiments where (for example) the effects of extinction of the operant response or transfer of learning to a new task are measured. Animals with little training (first exposure) behave very differently from animals with more and more varied training (second exposure). There are limits, therefore, to what can be learned simply by studying supposedly reversible steady-state behavior in individual organisms. This approach must be supplemented by between-group experiments, or by sophisticated theory that can take account of the effect on the individual animal of its own particular history. There are also well-documented limits to what can be learned about processes operating in the individual via the between-group method that necessarily requires averaging across individuals. And sophisticated theory is hard to come by. In short, there is no royal road, no algorithmic method, that shows the way to understanding how learning works.
Theory
Most theories of steady-state operant behavior are molar and are derived from the matching law. These tend to restrict themselves to descriptive accounts of experimental regularities (including mathematical accounts, such as those suggested by Peter Killeen). The reason can be traced back to B. F. Skinner, who argued all-to-persuasively against any explanation of behavior “which appeals to events taking place somewhere else, at some other level of observation, described in different terms, and measured, if at all, in different dimensions.” (Skinner, 1950 p. 193 – although Skinner was in fact very unhappy with the shift away from real-time behavior to molar averages). There are many arguments about what Skinner actually meant by this stricture, but it’s effect was to limit early operant theory largely to variations on the theme of response rates as a function of reinforcement rates. A partial exception is the extensive studies of interval timing that began in the early 1960s but blossomed in the 1980s and later with the injection of cognitive/psychophysical theory. Most timing theories do refer to “events taking place somewhere else”, be they internal clocks, brain oscillations or memory process.
Associative theories of operant conditioning, concerned with underlying associations and how they drive behavior, are not as limited by the legacy of Skinner. These theoretical treatments of operant learning are interested in the question: What associative structure underlies the box-opening sequence performed by the cat in Figure 1? One option, espoused by Thorndike (1911) and Skinner (1935), is that the cat has learned to associate this particular box with this sequence of actions. According to this view, the role of the desired outcome (the opening of the box’s door) is to “stamp in” such stimulus-response (S-R) associations. A different option, advocated by Tolman (1948) (and later demonstrated by Dickinson and colleagues), is that the cat has learned that this sequence of actions leads to the opening of the door, that is, an action-outcome (A-O) association. The critical difference between these two views is the role of the reinforcer: in the former it only has a role in learning, but once learned, the behavior is rather independent of the outcome or its value; in the latter the outcome is directly represented in the association controlling behavior, and thus behavior should be sensitive to changes in the value of the outcome. For instance, if a dog is waiting outside the box, such that opening the door is no longer a desirable outcome to the cat, according the S-R theory the cat will nevertheless perform the sequence of actions that will lead to the door opening, while A-O theory deems that the cat will refrain from this behavior. Research in the last two decades has convincingly shown that both types of control structures exist. In fact, operant behavior can be subdivided into two sub-classes, goal directed and habitual behavior, based exactly on this distinction.
More modern theories are focused on on the core problem of assignment of credit, a term from artificial intelligence that refers to the question: “A reinforcer (or a punisher) has occurred, what caused it?” Is it me (the acting individual) or something else? If it was me, what did I do? Historically, interest in assignment of credit arrived rather late on the scene. Early learning theories, based largely on data from apparatus like the T-maze and jumping stand, were concerned with the growth of ‘response strength’ over trials in situations where neither animal nor experimenter had any doubt about what the relevant responses were. But there is a growing realization that assignment of credit is the question an operant conditioning process must answer. There are now a few theories of credit assignment (notably, those from the field of reinforcement learning). Most assume a set of pre-defined competing, emitted operant responses that compete in winner-take-all fashion.
Most generally, current theories of operant learning can be divided into three main types -- those that attempt to accurately describe behavior (descriptive theories), those that are concerned with how the operant learning is realized in the brain (biologically inspired theories), and those that ask what is the optimal way to solve problems like that of assigning credit to actions, and whether such optimal solutions are indeed similar to what is seen in animal behavior (normative theories). Many of the theories in recent years are computational theories, in that they are accompanied by rigorous definitions in terms of equations for acquisition and response, and can make quantitative predictions.
The computational field of reinforcement learning has provided a normative framework within which both Pavlovian and operant conditioned behavior can be understood. In this, optimal action selection is based on predictions of long-run future consequences, such that decision making is aimed at maximizing rewards and minimizing punishment. Neuroscientific evidence from lesion studies, pharmacological manipulations and electrophysiological recordings in behaving animals have further provided tentative links to neural structures underlying key computational constructs in these models. Most notably, much evidence suggests that the neuromodulator dopamine provides basal ganglia target structures with a reward prediction error that can influence learning and action selection, particularly in stimulus-driven instrumental behavior.
In all these theories, however, nothing is said about the shaping of the response itself, or response topography. Yet a pigeon pecking a response key on a ratio schedule soon develops a different topography than the one it shows on VI. Solving this problem requires a theory the elements of which are neural or hypothetically linked to overt behavior. Different topographies then correspond to different patterns of such elements. The patterns in turn are selected by reinforcement. A few such theories have recently emerged.
Finally, it is Interesting in this respect that even very simple animals show some kind of operant and classical conditioning. A recent study purported to show discrimination learning in the protist Paramecium for example; and certainly a simple kind of operant behavior, if not discrimination learning, occurs even in bacteria. Thus, the essentials of operant conditioning need not depend on specific neural structures. On other hand, neural networks are powerful computing devices and some neurally based theories now embrace a wide range of experimental data, matching rat behavior well and rat neurophysiology, at least at a gross-anatomy level, reasonably well.
References and recommended reading
Historical
- Bindra, D. (1974) A motivational view of learning, performance and behavior modification. Psychological Review, 81, 199-213.
- Boakes, R. (1984) From Darwin to behaviourism: psychology and the minds of animals. Cambridge University Press.
- Ferster. C. B., & Skinner, B. F. (1957) Schedules of reinforcement. New York: Appleton-Century-Crofts.
- Herrnstein, R. J. (1997) The matching law: Papers in psychology and economics. Harvard University Press.
- Honig,W. K. (1966) Operant conditioning: Areas of research and application. New York: Appleton-Century-Crofts.
- Honig,W. K. & Staddon, J. E. R. (Eds.) (1977) The handbook of operant behavior. Englewood Cliffs, N.J.: Prentice-Hall.
- Skinner, B. F. (1938) The behavior of organisms. New York: Appleton-Century-Crofts.
- Skinner, B. F. (1956) A case history in scientific method. American Psychologist, 11, 221-233. Also published in Psychology: A study of a science. Vol. II, S. Koch (Ed.) New York, McGraw-Hill, 1958.
- Thorndike, E. L. (1911) Animal intelligence: Experimental studies. Macmillan.
- Tolman, E. C. (1948) Cognitive maps in rats and men. Psychological Review, 55, 189-208
Contemporary
- Balleine, B. W. (2000) Incentive processes in instrumental conditioning. In R. Mowrer & S. Klein (Eds.), Handbook of contemporary learning theories (p. 307-366). Mahwah, NJ: Lawrence Erlbaum Associates.
- Barto, A. G. (1995) Adaptive critic and the basal ganglia. In: Houk, J. C.; Davis, J. L. & Beiser, D. G. (eds.), Models of information processing in the basal ganglia, Cambridge: MIT Press, 215-232.
- Barto, A. G., Sutton, R. S., & Anderson, C.W. (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 13, 834-846.
- Davison, M., & McCarthy, D. (1988) The matching law: A research review. Hillsdale, NJ: Erlbaum.
- Dickinson, A. (1985) Actions and habits: The development of behavioural autonomy. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences, 308(1135), 67-78.
- Dickinson, A., & Balleine, B. W. (2002) The role of learning in the operation of motivational systems. In Gallistel, C.R. (Ed.), Learning, motivation and emotion (Vol. 3, p. 497-533). New York: John Wiley & Sons.
- Killeen, P., & Sitomer, M. (2003) MPR. Behavioral Processes, 62(1-3), 49–64.
- Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936-1947. (The seminal paper connecting between dopamine in the basal ganglia and operant learning)
- Schultz, W., Dayan, P. & Montague, P. R. (1997) A neural substrate of prediction and reward. Science 275, 1593-1599.
- Staddon, J. E. R. (2001) Adaptive dynamics: The theoretical analysis of behavior. Cambridge, MA: MIT/Bradford. Pp. xiv, 1-423.
- Staddon J. E. R. (2003/1983) Adaptive Behavior and Learning. Cambridge: Cambridge University Press.
- Staddon, J. E. R. & Cerutti, D. T. (2003) Operant behavior. Annual Review of Psychology, 54:115-144.
- Sutton, R. S. & Barto, A. G. (1998) Reinforcement learning: An introduction. MIT Press.
Internal references
- Tony J. Prescott (2008) Action selection. Scholarpedia, 3(2):2705.
- Peter Redgrave (2007) Basal ganglia. Scholarpedia, 2(6):1825.
- Valentino Braitenberg (2007) Brain. Scholarpedia, 2(11):2918.
- Nestor A. Schmajuk (2008) Classical conditioning. Scholarpedia, 3(3):2316.
- Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.
- Rodolfo Llinas (2008) Neuron. Scholarpedia, 3(8):1490.
- Florentin Woergoetter and Bernd Porr (2008) Reinforcement learning. Scholarpedia, 3(3):1448.
- Wolfram Schultz (2007) Reward. Scholarpedia, 2(3):1652.
- Wolfram Schultz (2007) Reward signals. Scholarpedia, 2(6):2184.