# User:Gerald Tesauro/Proposed/Td-gammon

Jump to: navigation, search
Figure 1: A sample learning curve of one of the original nets of (Tesauro, 1992), containing 10 hidden units, showing playing strength as a function of the number of self-play training games. Performance is measured by expected points per game (ppg) won or lost against a benchmark opponent (Sun Microsystems' Gammontool program).

TD-Gammon is a machine learning program developed in the early 1990s by IBM researcher Gerald Tesauro that was able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon combined neural networks with reinforcement learning. Specifically, temporal difference learning was combined with with gradient-descent, i.e., error back-propagation, in order to train the weights of a multi-layer perceptron to estimate the expected outcome of a given backgammon position. Starting from random initial play, TD-Gammon's self-teaching methodology produced a surprisingly strong program: without lookahead, its positional judgement rivaled that of human experts, and when combined with shallow lookahead, it reached a level of play that surpassed even the best human players.

TD-Gammon was regarded as a breakthrough achievement in machine learning and computer game-playing research, and inspired numerous subsequent applications of reinforcement learning in domains such as elevator dispatch, job-shop scheduling, cell-phone channel assignment, assembly line optimization and production scheduling, financial trading systems and portfolio management, and call admission and routing in telecommunications networks. Additionally, the program's innovative style of play and rollout analyses sparked a revolution in concepts and strategies used by human expert backgammon players. This revolution was further accelerated with the commercial release of two world-class neural net programs, Jellyfish and Snowie, directly inspired by TD-Gammon. The new knowledge generated by these neural nets was widely disseminated, resulting in enormous improvements in the level of play in backgammon tournaments in recent years.

## Prior History

TD-Gammon built upon three different historic lines of research. The first aimed at developing self-play learning algorithms that are able to solve the temporal credit assignment problem, i.e., apportioning credit or blame to each of the moves made during a self-play game so as to improve the quality of play. This research originated in the 1950s with the work of Arthur Samuel on checkers. Samuel's work was provided a much stronger mathematical foundation through subsequent research on reinforcement learning (RL) and temporal difference learning. The most notable contribution was by Rich Sutton, who coined the term "temporal difference learning" and invented the TD($$\lambda$$) learning algorithm (Sutton, 1988) used in TD-Gammon.

The second line of research addressed the training of artificial neural networks as nonlinear function approximators for classification and regression. There was great interest and excitement in the 1980s regarding this topic, motivated most strongly by the work of Geoffrey Hinton, David Rumelhart and Ronald Williams (Hinton et al., 1987) on gradient-based error back-propagation algorithms for training feed-forward neural networks, i.e., multi-layer perceptrons.

The third line of research focused on developing evaluation functions and algorithms for computer backgammon. The first serious effort in computer backgammon was undertaken in the 1970s by Hans Berliner: his program BKG made use of a hand-tuned evaluation function, comprising a set of hand-crafted features that were developed over several years in consultation with top backgammon experts (Berliner, 1980). BKG was thought to play at a very strong intermediate to weak advanced level.

Prior to developing TD-Gammon, Tesauro had worked on supervised learning of backgammon evaluation functions, initially in collaboration with Terence Sejnowski. (Tesauro and Sejnowski, 1987) used an approach for training on a dataset of human-graded moves, in which the best move in a given position (according to the human expert) was assigned a score of +1 and various sub-optimal moves were asssigned lesser scores, ranging to a minimum value of -1 for the worst possible move. This training methodology achieved an intermediate level of play. Subsequently, (Tesauro, 1988) developed a substantially improved "comparison training" methodology which trained a neural network to score the expert-selected move higher than any other candidate move. This approach, combined with a set of hand-crafted features based on Berliner's work, led to Tesauro's Neurogammon program, which won the backgammon competition at the first International Computer Olympiad (Tesauro, 1990). Neurogammon was thought to play at a strong intermediate level and was probably a bit weaker than Berliner's BKG program.

## Self-Play Learning Methodology

The training procedure for TD-Gammon is as follows: in each self-play training game, the neural network observes a sequence of board positions starting at the opening position and ending in a terminal position characterized by one side having removed all its checkers. The sequence of board positions is obtained by rolling dice for the side to move, and using TD-Gammon's current evaluation function to select a move based on a simple 1-ply search: every legal move is scored and the highest scoring move is selected. The network begins training with random initial weights, which effectively implement a random initial move policy. Such a policy is not only exceedingly poor, it also leads to games that can easily take much longer than normal (several hundred to several thousand moves) to complete.

The board positions are fed sequentially as input vectors $$x_1 , x_2 , ..., x_f$$ to the neural network, encoded using a representation scheme that is described below. For each input pattern $$x_t$$ there is a neural network output vector $$Y_t$$ indicating the neural network's estimate of expected outcome for pattern $$x_t\ .$$ Here $$Y_t$$ is a four-component vector corresponding to the four possible outcomes of either White or Black winning either a normal win or a double-value gammon. (Due to the extreme rarity of occurrence, triple-value backgammons were not represented.) At each time step, a gradient-descent variant of the TD($$\lambda$$) algorithm is applied to change the network's weights. The formula for the weight change is as follows:

$$\tag{1} w_{t+1} - w_t = \alpha ( Y_{t+1} - Y_t ) \sum_{k=1}^t \lambda^{t-k} \nabla_w Y_k$$

where $$\alpha$$ is a small constant (commonly thought of as a "learning rate" parameter), $$w$$ is the vector of weights that parameterizes the network, and $$\nabla_w Y_k$$ is the gradient of network output with respect to weights. (Note that equation (1) expresses the weight change due to a single output unit. In cases where there are multiple output units, the right-hand side of equation (1) should be modified by summing over each individual output unit.)

The quantity $$\lambda$$ is a heuristic parameter controlling the temporal credit assignment of how an error detected at a given time step feeds back to correct previous estimates. When $$\lambda = 0\ ,$$ no feedback occurs beyond the current time step, while when $$\lambda = 1\ ,$$ the error feeds back without decay arbitrarily far in time.

Empirically, small-to-moderate values of $$\lambda$$ gave about equally good asymptotic performance, whereas the performance degraded for large values of $$\lambda$$ close to 1. In the initial experiments reported in (Tesauro, 1992) a value of $$\lambda = 0.7$$ was used. Subsequent development of TD-Gammon mostly used $$\lambda = 0\ ,$$ which has a principal merit of requiring about a factor of two less computation per time step.

At the end of each game, a final reward signal $$z$$ (containing four components as described previously) is given, based on the outcome of the game. Once again equation (1) is used to change the weights, except that the difference $$( z - Y_f )$$ is used instead of $$( Y_{t+1} - Y_t )\ .$$ The trained network's output is naturally interpreted as an estimate of expected outcome, or "equity" of the position. This interpretation is exact in cases where TD($$\lambda$$) has been proven to converge to the optimal value function.

## Input Encoding

The input representation in the initial experiments of (Tesauro, 1992) only encoded the raw board information, i.e., a list of non-negative integers ranging from 0 to 15, indicating the number of White or Black checkers at each board location. Tesauro encoded this information using 99 input units to represent White state and 99 to represent Black state (198 total). The 99 input units representing White or Black state comprise four units at each of the 24 board locations (96 total) indicating the number of checkers per location using a truncated unary encoding, plus two units indicating number of checkers borne off and on the bar (normalized by dividing by 15 and 2 respectively), and one binary unit indicating if it is the side's turn to move.

The truncated unary encoding followed common practice in neural networks of using unary encodings of integer data, and the truncation was imposed primarily to economize on the total number of input units. Letting $$n$$ denote the number of checkers at a location, three of the four units represent binary indicator functions $$(n==1,n\ge 2,n==3)\ .$$ (Note that the $$n\ge 2$$ condition corresponds to the definition of having "made" a point that cannot be occupied by the opponent.) The fourth unit is set to zero unless $$n\ge 4\ ,$$ in which case it is set to $$0.5*(n-3)\ .$$

Tesauro stressed that the above encoding is "knowledge-free" in that no knowledge of expert concepts or strategies was built in to the input representation. In further experiments, the hand-crafted features used in Neurogammon were added to the representation. These features encoded standard concepts used by human experts such as "advanced anchor," "blockade strength," "home board strength" and the probability of a "blot" (single checker) being hit. The use of such features resulted in significantly improved performance.

## Move Selection Search Algorithms

Version 1.0 of TD-Gammon made real-time move decisions using simple 1-ply search, in which every top-level move is scored by the neural net, and the highest-scoring move is selected.

Versions 2.0 and 2.1 implemented a 2-ply search algorithm works as follows: First, an initial 1-ply analysis is performed and unpromising candidates are pruned based on the 1-ply score. (This is commonly known as forward pruning.) Then, the remaining top-level candidates are expanded by an additional ply. The 1-ply expansion of the surviving candidates involves making a 1-ply move decision for each of the opponent's 21 possible dice rolls, and computing a probability-weighted average score (weighting non-doubles twice as much as doubles) for each of the resulting states.

Versions 3.0 and 3.1 performed simplified 3-ply search. This is similar to the 2-ply search described above, except that a depth-2 expansion of the top level moves is performed, rather than a depth-1 expansion. The depth-2 expansion consists first doing a depth-1 expansion of the 21 dice rolls as above, selecting a move for each dice roll, and then doing an additional depth-1 expansion of the 21 followup dice rolls. In other words, a total of 441 two-roll sequences are examined, in which a 1-ply move decision is made by each player, and the score backed up to the top-level move is the probability-weighted average score of the 441 resulting successor states. This gives a huge speed advantage over full-width minimax backup, while still producing a significant boost in move quality relative to 2-ply search.

## Doubling Algorithm

Modern backgammon as played in Western countries features the use of a doubling cube. Before rolling the dice, a player may offer to double the stakes of the game. The opponent may either accept ("take") the double, and thereby obtain exclusive right to make the next double, or may decline ("pass") the double and forfeit the current stake.

TD-Gammon's neural network, which was trained on cubeless play, was adapted to make doubling and take/pass decisions by feeding its cubeless equity estimates into theoretically-based heuristic formulae. As detailed in (Tesauro, 2002) these formulae were obtained by generalizing earlier theoretical work on optimal doubling strategies (Keeler and Spencer, 1975; Zadeh and Kobliska, 1977) that assumed no-contact races and simple wins only (no gammons or backgammons). These papers computed optimal doubling and fold points as a function of race length by interpolating between limiting cases of infinite-length races and last-roll positions.

Tesauro reformulated these works by interpolating based on volatility, i.e., standard deviation in expected equity averaging over the next two dice rolls. He also allowed for gammons by considering fluctuations in a four-component probability vector $$\vec{p}\ ,$$ whose components sum to 1 (lying in a three-dimensional unit simplex), rather than a simple scalar probability of winning $$p\ .$$ In this higher-dimensional analysis, there are optimal doubling and fold surfaces lying within the three-dimensional unit simplex. These surfaces were approximated by planes defined by intersection points with the edges of the simplex. Each intersection point corresponds to eliminating one of White's and one of Black's possible winning outcomes, leaving a binary game where either White wins $$K$$ points or Black wins $$L$$ points. There are four possible combinations of $$(K,L)\ :$$ (1,1), (1,2), (2,1) and (2,2). For each $$(K,L)$$ combination, Tesauro computed the low-volatility double and fold points using the Keeler-Spencer formalism, and the set of four doubling and fold points yielded well-defined doubling and fold planar surfaces in the interior of the simplex.

In the high-volatility limit, there are likewise well-defined doubling and fold planar surfaces, corresponding to the cubeless equity equaling zero and -0.5 respectively. For positions with intermediate volatility, Tesauro devised an interpolation scheme by converting the Zadeh-Kobliska formula based on race length to an equivalent formula based on volatility.

Having defined a low-volatility and a high-volatility doubling surface and fold surface, TD-Gammon makes doubling decisions and take-pass decisions as follows:

• 1. Use the neural net to estimate the volatility $$v$$ and the cubeless outcome vector $$\vec{p}$$ of the position.
• 2. Given $$v\ ,$$ compute the interpolated doubling, redoubling, and fold surfaces using the converted Zadeh-Kobliska formulae.
• 3. Determine which side of the interpolated surfaces $$\vec{p}$$ lies on. This determines the double, redouble, and take/pass decisions.

TD-Gammon also employed a similar calculation of a "veto" surface, beyond which the state is too good to double, and the player should play on in the hopes of winning a gammon.

## Performance Results

### Version 0.0 (1991)

The initial experiments of (Tesauro, 1992) utilizing a knowledge-free input encoding were performed in 1991 on a high-end RS/6000 workstation of that era. The largest network studied had 40 hidden units and was trained for 200,000 games, requiring two weeks of CPU time. These experiments, which were later referred to by Tesauro as "version 0.0" of TD-Gammon, showed that a surprising amount of learning took place, and that increasing the number of hidden units consistently yielded substantially stronger playing ability.

A sample curve illustrating the progress of learning is shown inFigure 1. Performance is measured by periodic benchmarking in simple 1-ply search mode of expected points per game (ppg) against a fixed opponent, Sun Microsystems' Gammontool program. The initial random strategy loses nearly every game against Gammontool, and nearly every loss is a double-value gammon. In the first few thousand training games, the network learns a number of elementary principles, such as hitting the opponent, playing safe, and building new points. More sophisticated context-sensitive concepts (e.g., slotting home board points in certain situations but not in others) emerged later, after several tens of thousands of training games. The end of learning is characterized by a long slow asymptote to peak performance, which ends up being significantly better than Gammontool.

The network with 40 hidden units reached a strong intermediate level of play approximately equal to Neurogammon. An examination of the input-to-hidden weights in this network revealed interesting spatially organized patterns of positive and negative weights, roughly corresponding to what a knowledge engineer might call useful features for game play. Thus the neural networks appeared to be capable of automatic "feature discovery," one of the long-standing goals of game learning research since the time of Samuel.

In 2007, Tesauro reran his original 1991 experiments on a 3.0GHz laptop, which does floating-point arithmetic several thousand times faster than the 1991 workstation. This allowed much longer runs of 16 million training games, achieving substantially higher performance, as shown inFigure 1. These networks were benchmarked against Pubeval, a simple linear evaluation function released by Tesauro to provide a benchmarking standard. While weak in human terms, Pubeval is about 0.2 ppg stronger than Gammontool and thus provides a better benchmarking opponent.

Figure 2: Rerun of experiments of (Tesauro, 1992) for 16 million self-play training games. Each data point represents the result of 40k games played against Pubeval. Black lines represent moving averages of performance for each network.

Figure 1 shows that performance may continue to improve even after millions of self-play training games, particularly for the larger nets. After 16 million games, the networks with 10 and 20 hidden units have clearly reached peak performance, and most likely the 40 hidden unit net is close to a plateau. However, an 80 hidden unit net (which was not originally studied) appears to be still improving after 16 million games.

A comparison of the peak observed performance (ppg vs. Pubeval) of each neural net inFigure 1 with its 1991 counterpart is given in following table. The new results are consistently more than 0.1 ppg better, and in the case of the 80 hidden unit net, may be close to reaching an expert level of play. (The score of +0.535 ppg is within a few percent of equaling TD-Gammon 2.1, which scores +0.596 ppg vs. Pubeval in its 1-ply search mode.) This suggests that the use of hand-crafted features are perhaps not as critical to achieving expert-level performance as originally believed.

Hidden Units 1991 perf 2007 perf
10 +0.033 ppg +0.192 ppg
20 +0.202 ppg +0.315 ppg
40 +0.297 ppg +0.451 ppg
80 --- +0.535 ppg

Comparison of peak benchmark performance vs. Pubeval in 1991 and 2007 experiments.

### Version 1.0 (1991)

Version 1.0 and subsequent versions of TD-Gammon included Neurogammon's hand-crafted features as part of the input encoding. TD-Gammon 1.0 contained 80 hidden units and was trained for 300,000 self-play games, requiring a month of CPU time. In late 1991, it played a total of 51 games against three world-class human grandmasters: Bill Robertie and Paul Magriel, both noted authors and highly respected former World Champions, and Malcolm Davis, the 11th-highest rated player in the world at the time. TD-Gammon achieved a very respectable net loss of 13 points, for an average loss rate of about one-quarter point per game. Bill Robertie's assessment at the time (Robertie, 1992) was that TD-Gammon had reached the level of a competent advanced player which was clearly better than any previous backgammon program.

### Version 2.0 (1992)

TD-Gammon 2.0 had much greater training experience (800k self-play games) than version 1.0. Benefiting from both hardware and software speedups, it was also able to make move decisions in real time using a 2-ply search algorithm. The program made its public debut at the 1992 World Cup of Backgammon tournament. In 38 exhibition games against top human players such as Kent Goulding, Kit Woolsey, Wilcox Snellings, former World Cup Champion Joe Sylvester, and former World Champion Joe Russell, the program had a net loss of only 7 points.

### Version 2.1 (1993)

TD-Gammon 2.1 had 80 hidden units and 1.5 million self-play training games. In a 40 game rematch with Bill Robertie in 1993, it achieved a near-parity result: Robertie trailed the entire match, and only in the very last game was he able to pull ahead for an extremely narrow 1-point victory.

Tesauro later ran an exhaustive rollout analysis of the checker plays and doubling decisions in this match, using Snowie 3.2 to perform the rollouts (Tesauro, 2002). The results are summarized in the following tables. The analysis showed that TD-Gammon 2.1 technically outplayed Bill Robertie in piece-movement decisions, although the results are fairly close. The results confirm impressions at the time that the two players were fairly evenly matched. Robertie had an edge in technical plays, while TD-Gammon had an edge in vague positional situations. It is of interest to note that TD-Gammon made significantly fewer large errors, or "blunders" that gave up a large amount of equity.

Snowie Rollouts Equity loss Avg. errors/game Avg. blunders/game
Bill Robertie -0.188 ppg 2.12 0.47
TD-Gammon 2.1 -0.163 ppg 1.67 0.20

Rollout analysis by Snowie 3.2 of the move decisions in the 1993 match between Bill Robertie and TD-Gammon 2.1. First column gives the average cumulative equity loss per game due to inferior moves. Second column gives the average number of move decisions per game classified as "errors" (inferior to the best move by at least 0.02 ppg). Third column gives the average number of move decisions per game classified as "blunders" (inferior to the best move by at least 0.08 ppg).

As for the cube decisions, the rollouts indicate that Robertie's take/pass decisions were superb, and somewhat better than TD-Gammon's. However, TD-Gammon was clearly better in double/no double decisions: several of Robertie's doubling decisions were extremely conservative and would almost certainly be regarded by any top expert as large errors.

Snowie Rollouts BR equity loss TD equity loss
Double decisions -0.081 ppg -0.013 ppg
Take/pass decisions -0.007 ppg -0.010 ppg

Rollout analysis by Snowie 3.2 (depth-11 truncated) of the cube action in the 1993 match between Bill Robertie and TD-Gammon 2.1.

### Version 3.0 (1995)

Version 3.0 used the same neural net as version 2.1, but made real-time move decisions using 3-ply search. Tesauro estimated its playing level at about 0.07 to 0.08 ppg better than version 2.1. The use of 3-ply search eliminated nearly all of the technical errors that had marred the play of earlier versions. TD-Gammon 3.0 won a 25-point match played on FIBS (First Internet Backgammon Server) against Neil Kazaross, who was rated among the world's top ten players at the time. It also narrowly lost a 20-game session against Malcolm Davis in 1997 at the AAAI Hall of Champions exhibition.

### Version 3.1 (1998)

The last version of TD-Gammon was version 3.1, which was developed in 1998 for a 100-game rematch against Malcolm Davis at the AAAI Hall of Champions. Version 3.1 had 160 hidden units and was trained for more than 6 million self-play games. It clearly outplayed Davis in piece-movement decisions, and looked as though it would hand him a crushing defeat. However, due to a suboptimally tuned doubling algorithm which had never been tested in competition, it made one doubling blunder costing 32 points in a single game (Davis redoubled to 16 and won a gammon). As a result, TD-Gammon ended up with an overall loss by 8 points.

Exhaustive rollout analysis using Snowie 3.2 (Tesauro, 2002) confirmed that TD-Gammon had a lopsided advantage over Malcolm Davis in checker plays, and a slight edge in doubling cube decisions. Results are shown in the tables below. Davis remarked after the match that in his estimation, a top human player would be an underdog against any of the top neural net programs by about a tenth of a point per game.

Snowie Rollouts Equity loss Avg. errors/game Avg. blunders/game
Malcolm Davis -0.183 ppg 1.85 0.48
TD-Gammon 3.1 -0.050 ppg 0.59 0.04

Rollout analysis by Snowie 3.2 of the move decisions in the 1998 AAAI Hall of Champions exhibition match between Malcolm Davis and TD-Gammon 3.1. Equity loss, errors and blunders are defined as in table

Snowie Rollouts MD equity loss TD equity loss
Double decisions -0.031 ppg -0.020 ppg
Take/pass decisions -0.026 ppg -0.005 ppg
TD-Gammon Rollouts MD equity loss TD equity loss
Double decisions -0.022 ppg -0.015 ppg
Take/pass decisions -0.026 ppg -0.002 ppg

Rollout analysis of the cube action in the 1998 Hall of Champions match between Malcolm Davis and TD-Gammon 3.1. First set of figures are based on Snowie 3.2 depth-11 truncated rollouts. Second set of figures are TD-Gammon 2.1 full rollouts including the doubling cube.

## Impact

• Backgammon community
• Computer games research
• Applied RL
• Theoretical RL
• Biological RL
• RL in IBM Watson's Jeopardy! strategy

The main significance of TD-Gammon was that it showed that reinforcement learning from self-play is a viable method for learning complex tasks to an extent previously unrealized by AI and machine learning researchers. Prior to TD-Gammon, there had been no significant real-world applications of reinforcement learning. As a result of TD-Gammon's success, there was much renewed interest in applying reinforcement learning in numerous real-world problem domains, and in expanding our theoretical understanding of such methods. Some of the successful applications inspired by TD-Gammon include: elevator dispatch (Crites and Barto, 1996), job-shop scheduling for the NASA Space Shuttle (Zhang and Dietterich, 1996), cell-phone channel assignment (Singh and Bertsekas, 1997), assembly line optimization and production scheduling (Schneider et al., 1998; Mahadevan and Theocharous, 1998), financial trading systems and portfolio management (Moody et al., 1998), and call admission and routing in telecommunications networks (Marbach et al., 2000).

## References

• L. V. Allis, A knowledge-based approach of Connect-Four. The game is solved: White wins. M. Sc. Thesis, Faculty of Mathematics and Computer Science, Free University of Amsterdam (1988).
• J. Baxter, A. Tridgell and L. Weaver, KnightCap: A chess program that learns by combining TD($$lambda$$) with minimax search. Proceedings of ICML-98, 28-36 (1998).
• H. Berliner, "Computer backgammon." Scientific American 243:1, 64-72 (1980).
• M. Buro, "Efficient approximation of backgammon race equities." ICCA Journal 22:3, 133--142 (1999).
• R. H. Crites and A. G. Barto, Improving elevator performance using reinforcement learning. In: D. Touretzky et al., eds., Advances in Neural Information Processing Systems 8, 1017-1023, MIT Press, 1996.
• K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366 (1989).
• O. Jacoby and J. R. Crawford, The Backgammon Book. New York: Bantam Books (1970).
• E. B. Keeler and J. Spencer, "Optimal doubling in backgammon." Operations Research 23, 1063-1071 (1975).
• P. Magriel, Backgammon. New York: Times Books (1976).
• S. Mahadevan and G. Theocharous, Optimizing production manufacturing using reinforcement learning. Proceedings of the Eleventh International FLAIRS conference, 372-377. AAAI Press (1998).
• P. Marbach, O. Mihatsch, and J. N. Tsitsiklis, Call Admission Control and Routing in Integrated Service Networks Using Neuro-Dynamic Programming. IEEE Journal on Selected Areas in Communications 18:2, 197-208 (2000).
• J. Moody, M. Saffell, Y. Liao and L. Wu, "Reinforcement learning for trading systems and portfolios." In: A.N. Refenes, N. Burgess and J. Moody, eds., Decision Technologies for Computational Finance: Proceedings of the London Conference. Kluwer Financial Publishing (1998).
• J. B. Pollack and A. D. Blair, Co-evolution in the successful learning of backgammon strategy. Machine Learning 32, 225-240 (1998).
• B. Robertie, Advanced Backgammon (vols. 1 and 2). Arlington, MA: The Gammon Press (1991).
• B. Robertie, Carbon versus silicon: matching wits with TD-Gammon. Inside Backgammon 2:2, 14-22 (1992).
• D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representation by error propagation." In D. Rumelhart and J. McClelland (Eds.), Parallel Distributed Processing, Vol. 1. Cambridge MA: MIT Press (1986).
• A. Samuel, Some studies in machine learning using the game of checkers. IBM J. of Research and Development 3, 210-229 (1959).
• J. Schaeffer, "The games computers (and people) play." In: M. Zelkowitz, ed., Advances in Computers 50, 189--266. Academic Press (2000).
• J. G. Schneider, J. A. Boyan and A. W. Moore, Value function based production scheduling. Proceedings of ICML-98 (1998).
• S. P. Singh and D. Bertsekas, Reinforcement learning for dynamic channel allocation in cellular telephone systems. In: M. C. Mozer, M. I. Jordan and T. Petsche, eds., Advances in Neural Information Processing Systems 9, 974--980. Cambridge, MA. MIT Press (1997).
• R. S. Sutton, Learning to predict by the methods of temporal differences. Machine Learning 3, 9-44 (1988).
• R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press (1998).
• G. Tesauro, "Neurogammon wins Computer Olympiad." Neural Computation 1, 321-323 (1989).
• G. Tesauro, Optimal doubling in multi-outcome probabilistic games. IBM Research, unpublished manuscript (1990).
• G. Tesauro, "Practical issues in temporal difference learning." Machine Learning 8, 257-277 (1992).
• G. Tesauro and G. R. Galperin, On-line policy improvement using Monte-Carlo search. In: M. C. Mozer, M. I. Jordan and T. Petsche, eds., Advances in Neural Information Processing Systems 9, 1068--1074. Cambridge, MA: MIT Press (1997).
• K. Woolsey, "Computers and rollouts." On-line article available at: www.gammonline.com (2000).
• N. Zadeh and G. Kobliska, "On optimal doubling in backgammon." Management Science 23, 853-858 (1977).
• W. Zhang and T. G. Dietterich, High-performance job-shop scheduling with a time-delay TD($$lambda$$) network. In: D. Touretzky et al., eds., Advances in Neural Information Processing Systems 8, 1024-1030, MIT Press (1996).