# User:Max Berniker/Proposed/Bayesian sensorimotor control

Bayesian sensorimotor control is an approach for analyzing and describing the integration of sensory feedback and motor behaviors through the formalism of Bayesian inference. Most often this is performed in the context of neuroscience and robotics. In the Bayesian paradigm, beliefs about the properties of both the world and one's own body are represented with probability distributions. Through interactions with our environment, we receive observations that are used to update these beliefs and inform out motor choices.

We live in an uncertain world, and each of our actions may have many possible outcomes. Yet we must cope with all these uncertainties if we are to thrive in the world. Thus our nervous system is faced with the challenge of combining many uncertain pieces of information into estimates of the properties of our bodies and the world around us, and then choosing appropriate actions. Bayesian approaches to estimation formalize the problem of how this uncertain information should be integrated, and used to control our movements. Utilizing this approach, many studies faithfully predict human sensorimotor behavior. Here we will explore and review many of these ideas and the recent evidence surrounding them.

## Introduction

A central objective of the nervous system is to sense the state of the world around us and to affect this state such that it is more favorable to us. To measure this state we use sensory feedback. Yet this information is imprecise and subject to noise. Each of our sensory modalities offers only a limited level of precision, which can vary depending on the situation. Vision is limited under dim light or extra-foveal conditions, hearing becomes unreliable for weak sounds, and proprioception drifts without calibration from vision (Deneve, 2008; Hoyer & Hyvärinen, 2003; Ma, Beck, Latham, & Pouget, 2006; Pouget, Dayan, & Zemel, 2003; Zemel, Dayan, & Pouget, 1998). Furthermore, the actions we take do not have deterministic outcomes; they too are subject to noise and uncertainty. Our motor commands possess noise, and the properties of our muscles themselves can vary from day to day due to fatigue, strength, health, etc. Therefore, to achieve its objectives, the nervous system must integrate sensory information into a cohesive whole, and then choose amongst uncertain actions. As such, we can conclude that a crucial problem the nervous system faces is how to cope with noisy information.

Efficiently managing uncertain information requires a statistical framework. Bayesian integration is the mathematical framework that calculates how uncertain information from multiple sources can be combined optimally. It results in a coherent and maximally accurate estimate derived from a set of observations. Therefore, to integrate sensory information with what we know about the world, we can use the Bayesian framework. Similarly, to use uncertain information to make choices about how to act in the world, we can use the Bayesian framework. Thus, the Bayesian framework supplies us with a principled formalism against which to test how an optimal nervous system senses its world and then acts upon it.

To illustrate these ideas, consider the act of descending a staircase. To achieve this, the nervous system must sense the stairs, and then act to transport us down them. Based on our familiarity with walking down stairs, we have strong expectations for things like the distance between steps, their height and their general shape. Often these expectations are strong enough that we feel comfortable taking stairs without even observing them, as when we descend stairs without looking at our feet, or in the dark. Normally though, we will first observe how far we need to step. Vision, however, does not provide perfect measurements. The visual system provides us with an estimate of the step’s height. Bayes’ rule defines how to combine our expectations of the step’s height, with out visual sense to make an optimal estimate of the step’s height.

Taking the example further, once we have begun to take a step, we receive sensory information about our ongoing motion. However, we can combine this sensory information, with the action we have just chosen, to make an optimal estimate of where we are, and where we are headed. Finally, we can even use the Bayesian framework to choose how we should step, given all our expectations and sensory estimates of the steps.

In the sections that follow, we will discuss how we can use Bayesian statistics to make sense of an uncertain world, and then act upon it, and how this uncertainty may be represented in the brain. In doing so, we will review a number of recent psychophysics and computational studies that have examined the similarities between these Bayesian predictions and people’s behavior. In particular, we discuss how Bayesian integration allows us to combine multiple pieces of information into a single distribution (section 1), how we can update this distribution over time as we continue to gain new information about it (section 2), how we can use this statistical framework to control our movements (section 3), and finally recent theories and evidence on how the brain may represent these uncertainties (section 4).

## Bayesian Information Estimation

### Combining prior knowledge with new evidence

Combining uncertain information to produce a coherent and accurate estimate of our body and the world is integral to the normal functioning of our nervous system. Sometimes the source of this uncertain information is our senses, and we must compare it against what we’d expect before it can be of benefit. As an example, consider a game of tennis. When returning a fast moving ball, it is often helpful to estimate where the ball will land while it is still in motion. Based on our familiarity with the game and our opponent, we may have strong assumptions about where the ball usually lands. For instance, the ball probably doesn’t land uniformly over the court, but instead often lands in locations that are concentrated in bounds and highly peaked near the sidelines where it is most difficult to return the ball (Fig 1B). This assumed distribution over locations forms what is referred to as a prior, a belief in the ball’s typical locations when no other information is available.

If this prior were strong enough, say if we were returning balls lobbed from a tennis ball machine, we might swing without even looking at the ball. However, normally, we’ll first look at where the ball is heading. Vision does not provide noise free information about the ball’s trajectory or velocity. Instead our vision gives us an estimate, or what is referred to as the likelihood, of where the ball will strike the court (Fig 1A). This likelihood is the probability of having a particular sensory input for each possible location the ball may land. Bayes’ rule defines how to combine the prior and the likelihood to make an optimal estimate of the location where the ball lands (Fig 1C).

Figure 1: Bayesian integration in tennis. A) we view a tennis ball traveling towards the court floor and estimate the location it will land, a likelihood (shaded in red). B) Experience from many such instances tells us the ball usually lands close to the baseline, a prior (shaded in green). C) From these two pieces of information we produce an updated estimate of where we believe the ball will land (transparent ball), a posterior, shown with blue ellipses.

Bayes’ rule states that the probability of the ball landing at position $$x$$ given our observation $$o$$ is the product of the prior probability of the ball at $$x$$, with the likelihood of the observation $$o$$ given the position $$x$$, divided by the probability of the observation. Mathematically, this is expressed as $\tag{1} p(x|o) = \frac{p(x)p(o|x)}{p(o)}.$

The distribution produced by equation 1 is known as the posterior probability (shown graphically in Fig 2A). We can also interpret Bayes’ rule as the “optimal” means of combining a prior and a likelihood, as it produces an estimate with the minimum uncertainty.

Several studies, using many sensory modalities, have shown that when subjects combine preceding knowledge with new information their behavior reflects the integration of a prior and likelihood in a manner prescribed by Bayes’ rule (K. Kording, 2007). In a typical study (K. P. Kording & Wolpert, 2004) subjects will indicate their estimate of a target’s location through a motor response. On each trial, the target’s location is drawn from a normal distribution, the prior. Noisy feedback of the target’s location is then provided, the likelihood. The distribution used for the prior or the likelihood can be fixed, or vary across subjects as an experimental condition. Bayesian statistics predicts how subjects should combine the likelihood and the prior. These predictions are then compared against human performance, often revealing a high degree of similarity.

These paradigms have been applied to a wide range of topics spanning sensorimotor integration, force estimation, timing estimations, speed estimations, stance regulation, the interpretation of visual scenes and even cognitive estimates (Chater, Tenenbaum, & Yuille, 2006; D. Knill & Richards, 1996; Körding, Ku, & Wolpert, 2004; K. P. Kording & Wolpert, 2006; Miyazaki, Nozaki, & Nakajima, 2005; Miyazaki, Yamamoto, Uchida, & Kitazawa, 2006; Peterka & Loughlin, 2004; Tassinari, Hudson, & Landy, 2006; Weiss, Simoncelli, & Adelson, 2002). Together these studies demonstrate that people are adept at combining prior knowledge with new evidence in a manner predicted by Bayesian statistics.

### Combining multiple pieces of information

Figure 2: Results from an audio-visual cue combination study (reproduced from (Alais & Burr, 2004)). A) Subjects are asked to report a cue as being to the left or right of a reference cue as the displacement between the two cues is varied. The resulting data is used to construct a psychometric curve, which characterizes the probability that a subject perceives the cue correctly shifted to the left of a reference cue, as the displacement is varied. The curve quantifies a subject’s sensory uncertainty. Curves are obtained for visual cues of different noise levels, and auditory cures (green triangles) B) Based on the psychometric curves obtained for each subject, their visual and auditory precision can be inferred and used to make predictions for combining two cues. Plotted are the three subject’s estimates of a cue’s position (dots) as the conflict between visual and auditory cues was varied (horizontal axis) with different levels of visual noise (red circles, blue triangles and black squares). The solid lines are the predictions for each subject, based on their psychometric curves. As can be easily seen, the data lies on top of the optimal Bayes’ predictions.

Oftentimes we aren't combining new information with prior knowledge, but rather trying to combine two or more different pieces of uncertain information into a single cohesive estimate. For example, we may see and feel an object at the same time (Ernst & Banks, 2002). We can then use our sense of vision and touch to estimate a common property of the object, for instance, its size or its texture. This type of task is what is commonly referred to as cue combination; two or more sensory cues are combined to form a common, or joint, estimate. Just as before, Bayesian statistics prescribe how we should combine the likelihoods to compute an optimal estimate from the posterior distribution (Fig 2B). Again, only by accurately combining these sensory cues can we optimally integrate our information and ultimately choose our movements.

Recent studies have found that when combining information from multiple senses, people are also similarly optimal. To study how people combine this information experimentally, we must first examine their likelihoods; that is, we must first quantify a subject’s belief in the properties of a particular stimulus. Typically, measuring what is known as a psychometric function does this. This function quantifies the precision of a subject’s sensory modality. By obtaining these functions for various stimuli we can predict how subjects should combine multiple stimuli to form a new belief in a multi-sensory cue, if they are Bayes’ optimal.

As an example study, we consider how people combine visual and auditory information to estimate the position of a target. In a typical study (Alais & Burr, 2004) individual subjects are probed for their psychometric functions. This is usually done using what is refereed to as a two alternative forced choice (2AFC) paradigm. For example, a subject is shown two noisy visual cues (e.g. a group of dots flashed on a screen) in succession. The subject is then asked to decide which of the two cues was further to the left. When the two cues are widely separated, the decision is easy. However, as distance between the two cues narrows, which cue is further to the left becomes difficult to judge. As the noise in the cues increases too, this judgment becomes relatively harder to make. After many such trials, the subject’s data is used to compute their average correct response for the various conditions. A curve is then fit to this data to characterize a subject’s probability of correctly judging the distance between two cues as the distance, or their uncertainty is varied (Fig 3A).

Psychometric curves generally start at zero and reach one: this simply reflects that if a decision is sufficiently easy, subjects make no more mistakes. Each psychometric curve can be parameterized by two free parameters, the point of subjective equality (PSE), which is defined as the stimulus settings where subjects have a 50% response rate, and the just noticeable difference (JND) that determines the range over which the response probability goes from 25% to 75%. One equation for the curve that is often used is: $\tag{2} P(choice) = \frac{1}{2}\left( 1 + erf\left(\frac{x-PSE}{JND\sqrt{2}}\right) \right),$

where $$x$$ is some feature of the stimulus we are characterizing. A key feature of this psychometric curve is that it can be used to infer a subject’s precision in the sensory cues, the JND. That is, as their precision in distinguishing between two cues increases, the curve’s transition becomes sharper. The details concerning how to interpret psychometric functions are studied under the term signal detection theory. Similar experiments can be performed to obtain a subject’s psychometric curve for audition, and their corresponding precision with that sensory modality.

The second part of this experiment then tests how subjects perform when they use both senses, and in particular, if they are close to the behavior prescribed by Bayesian integration. In this portion of the experiment, a noisy visual cue and a noisy auditory cue are given at the same time, but at different spatial locations. Now the subject’s task is to judge where the source of the cues was. When the two cues are coincident, the task is relatively simple and straightforward. As the disparity between them grows, subjects must use both cues to decide where a common source was. According to Bayes’ rule, subjects should bias their decisions towards the cue they perceive with more precision; that is, if you can see the cue more precisely than you can hear a cue, you should judge the source as close to where you think you saw it. What has been observed is that performance in these cue combination trials can be predicted well using the rules of Bayesian integration (Fig 3B). These results are further evidence that people cope with uncertain information in a statistically optimal manner.

Figure 3: A) Bayesian integration of a prior and likelihood. The prior, denoted with the green curve, represents the probability of a state, $$x$$. The likelihood, denoted with the red curve, represents the probability of observing the data, $$o$$, given $$x$$. The posterior is the probability that $$x$$ is the state, given our observation, $$o$$. B) Bayes’ rule applied to cue combinations is the mathematical analogue, only instead of using a prior and likelihood, we integrate two likelihoods. Here, for simplicity we assume the two observations are conditionally independent given the state, and a flat prior distribution over $$x$$.

In addition to auditory and visual cues, combinations of other sensory modalities have been analyzed in a good number of studies (Ernst & Banks, 2002; Ernst & Bulthoff, 2004; Geisler & Kersten, 2002; Kersten & Yuille, 2003; D. Knill & Richards, 1996; D. C. Knill, 2007; #Kording2007K. Kording, 2007; Rosas, Wagemans, Ernst, & Wichmann, 2005; van Ee, Adams, & Mamassian, 2003). While the sensory modalities are distinct from what we have reviewed above, the experiments are very similar. The precision of individual sensory modalities is measured, and then subjects are tested when they use the combined modalities. In all reported cases cues are combined by the subjects in a fashion that is close to the optimum prescribed by Bayesian statistics. Interestingly, recent studies in both monkeys and rodents demonstrate that the ability to optimally combine sensory information is not limited to humans (Fetsch, Turner, DeAngelis, & Angelaki, 2009; Raposo, Sheppard, Schrater, & Churchland, 2012). This is especially significant as it may suggest that the neural mechanisms underlying Bayesian behavior are conserved across species.

### Credit assignment

The rules of Bayesian statistics reviewed above prescribe how we bring new information to bear upon our beliefs. Bayesian inference also plays a profound role in resolving ambiguities in perceptual inference. This is because the forward or generative models implicit in the likelihood function - used by a Bayesian observer - can entertain many different causes of the same sensory data. The inherent ambiguity for model inversion (estimating the causes) is resolved through prior beliefs. The importance of Bayesian inference in estimating the causes of our sensations can be illustrated in many ways. For instance, suppose we hold two blocks in our hand, one sitting atop the other, and we want to estimate their individual weights. Our observation of their combined weight upon our hand is indicative of their individual weights, namely the magnitude of their sum. However, this information is not sufficient to accurately establish their individual weights. To estimate them, we need to solve a credit assignment problem; that is, how does each property (individual weight) contribute to our observation (overall weight). Bayesian statistics also prescribes an optimal solution to this problem.

Our observations of their combined weight forms a likelihood, a probability distribution of having a particular sensory observation of their combined weight, for each value of their individual weights. By combining this with a prior over the object’s weight, perhaps based on their size (Flanagan, Bittner, & Johansson, 2008), we can compute a posterior distribution of their individual weights.

Again, this distribution will provide an optimal estimate of their individual weights, simultaneously solving the credit assignment problem: how much does each block contribute to the total weight I feel? Recent research (Berniker & Kording, 2008) addresses this type of problem by integrating information over time.

### Nonlinear cue combination

Above we have discussed how cue-combination studies have provided evidence that the nervous system combines multiple pieces of information for optimal estimates. However, the brain receives a very large number of sensory cues, not all of which should be combined; many cues may simply be irrelevant. To make sense of the world then, the nervous system should only combine those sensory cues that are likely to provide information about the common sources or causes of sensory signals.

Consider the example of a ventriloquist: he/she synchronizes his/her voice with the movement of the puppet’s mouth. Due to visuo-auditory cue combination, the audience experiences the illusion of a talking puppet, or more precisely, that a voice is being emitted from the puppet’s mouth. If the ventriloquist’s voice is out of sync with the puppet’s movements, this illusion will immediately break down. In this example, the temporal proximity of cues is an overriding factor in inducing the merging of cues. Similarly, spatial proximity or disparity can also influence whether we combine different cues into a single percept. As in the visuo-auditory cue combination study reviewed above, because the light and tone appear to emanate from a similar location the two cues are readily interpreted a single event. In contrast, if the tone is far away from the light, these two stimuli may be perceived as independent events. The nervous system combines cues that appear to originate from the same source. In other words, it seems that the nervous system estimates the structure of the world, i.e., what causes the perceived cues.

Figure 4: A) Two structural beliefs of the world. If two cues are adequately coincident, subject’s perceive them as having a common cause (the green box), a phenomenon typified through ventriloquism. If the cues are disparate in time or space, subjects perceive them as having independent causes (the red box). B) Subject’s belief of a common cause. As the spatial disparity of two cues, a light flash and a tone, is experimentally controlled the belief in a common cause can be manipulated.

Traditional Bayesian analyses of cue combination examine tasks where the experimental cues are close to coincident, and implicitly assume that the cues are caused by the same event, or source. New studies have tested subject performance in situations where two cues are dissimilar from one another. In the simplest case, a visual cue and an auditory cue are simultaneously presented for estimating the location of an object. These two cues can either have the same cause or have different causes. In the same-cause case, these two cues should be combined to form a single estimate; in the different-cause case, these two cues should be processed separately (Fig 4A). The optimal estimate should weigh each cause with respect to its likelihood. This likelihood is a function expressing the probability of observing the cues when they arise from the same cause, or not. Not surprisingly, this likelihood depends on spatial and temporal disparity between cues, with increasing disparity between the cues subjects’ belief in a common cause decreases (Fig 4B). Recent studies have provided support for the predictions of Bayesian models of causal inference for cross-modality cue combination (KP Kording, Tenenbaum, & Shadmehr, 2006; Shams, Ma, & Beierholm, 2005), depth perception (D. C. Knill, 2007), and estimation of stimuli numbers (Wozny, Beierholm, & Shams, 2008). These studies highlight that the nervous system estimates state with consideration of causal structures among sensory cues in a way that is close to statistical optimum.

One central feature of the human movement system is that it adapts continuously to changes in the environment and in the body. It has been suggested that this adaptation is largely driven by movement errors. However, errors are subject to unknown causal structures. For example, if an error is caused by changes in the motor apparatus, such as strength changes due to muscle fatigue, the nervous system should adapt its estimates of the body. On the other hand, if an error is caused by random external perturbations, the nervous system should not adapt its estimates of the body, as this error is irrelevant. Given the inherent ambiguity in errors, the best choice of adaptive action should be a function of the probability of error being relevant. Bayesian models can predict the optimal adaptation to errors with unknown causal relationship. These predictions have been confirmed in a recent study on motor adaptation in a visuomotor task (Wei & Körding, 2008). It appears that in these sensorimotor tasks the nervous system not only estimates the state but also estimates the causal structure of perceptual information.

The statistical problem that subjects solve in cue combination implicitly involves an inference about the causal structure of the stimuli, e.g. did the flash of light cause the auditory tone, did the tone cause the flash, was there an unknown cause for them both, or were they simply coincidental? The problem faced by the nervous system is thus similar to those studied extensively in psychology and cognitive science that occur in the context of causal induction (Gopnik et al., 2004; Griffiths & Tenenbaum, 2005; Michotte, 1963; Sperber & Premack, 1995; Tenenbaum, Griffiths, & Kemp, 2006). Many experiments demonstrate that people interpret events in terms of cause and effect. The results presented here show that sensorimotor integration exhibits some of the same factors found in human cognition. Incorporating the causal relationships among state variables, it appears the nervous system estimates a structural view of the world. Once again, this process is necessary to optimally select actions, whether they be for motor or any behavior.

## Bayesian integration over time

In the situations we discussed above we dealt with the estimation of variables that do not change over time. Examples of such variables would be the weight of a block or the location of the source of sensed cues. However, the world is dynamic, and as such its properties are continually changing and, consequently, so should our perceptions of them. For example, we might want to continuously estimate the state of our body or our location relative to our environment. We constantly need to integrate newly obtained information with our current beliefs to inform new estimates of the world if we want to maintain meaningful and precise estimates of the variables that are relevant for our actions.

These ideas imply that Bayesian integration should take place in an ongoing, or continuous manner. Our past observations define a belief that we have. As time passes we need to update our beliefs because we are constantly obtaining new observations, and because the world changes our old beliefs become less relevant. For example, if we estimate the location of our hand without viewing it, knowledge of its prior position and the commands we send to our muscles will provide an uncertain belief about the hands new position. Looking at our hand however, will improve our estimate of the hand’s position. As we saw in section 1, Bayes’ rule allows us to combine what we believe based on past observations with new observations. The interplay between these two attributes defines how we make estimates in a time-varying world and is the foundation of frequently used algorithms including Hidden Markov Models (HMM) and Kalman Filters.

Figure 5: A) Typical procedure for optimally estimating the world’s state in modern control theory. A model of the world, combined with a motor command is used to estimate a predicted state (the prior). Observations of the world dictate the likelihood of a particular observation given the current world state. A Kalman filter is used to make a Bayesian update of our belief in the world’s current state. B) This process of Bayesian inference repeats itself at each time step, using the posterior from one time step, as the prior for the following time step. C) Motor adaptation can be framed as an analogous update procedure. Our prior belief in muscle properties (e.g. muscle strength, labeled in green) is integrated with our observed motor errors to update estimates in our muscle properties.

This ongoing integration of information is an approach taken extensively in modern applications of control theory through the use of a Kalman filter. Kalman filtering is a procedure for using a linear model and our observations, corrupted by Gaussian noise, to continuously update our beliefs (Fig 5A). At each update the Kalman filter combines a model’s estimate of the world’s state (the prior) with a measured observation (the likelihood) to update a prediction of the world’s current state, represented by a posterior (Fig 5A). At any point of time, the posterior from the past defines the prior for the future. This formalism is well developed and used in a wide range of applications from aeronautics to humanoid robotics (Stengel, 1994; Emanuel Todorov, 2006). Indeed, even the motor control problems of two applications as disparate as controlling a jet and controlling our bodies, share many computational analogies. In both cases continuously incoming information needs to be assessed to move precisely (albeit on different time scales).

To illustrate how this Bayesian update is performed, we will consider a simple example system. We will assume the system has a single state, $$x$$, the variable we are interested in estimating given a noisy observation, $$y$$ (say the signals from our proprioceptive system). We assume this variable to be Markov, that is, the probability of future values this variable may take, are conditionally independent from all past values once we are given its current value. Said another way, knowing earlier values of the variable will not improve our ability to predict its future probability distribution. The model for this state’s transitions and observation is linear with white (Gaussian) noise, $\tag{3} x_{k+1}=ax_k+\omega_k,$ $\tag{4} y_{k+1}=cx_k+\epsilon_k$

where $$Std(\omega)=\sigma_{\omega}$$ and $$Std(\epsilon)=\sigma_{obs}$$ are the standard deviations, and the means are zero. Since the state is driven in part by noise, and we only have access to noisy observations, namely $$y$$, we cannot be certain of the value of $$x$$, but instead must estimate its value. Let us represent our belief in the state’s possible values at some time, $$k$$, as a Gaussian distribution, $$N(\mu_k, \sigma_p)$$. Here $$\mu_k$$ represents the current mean of our estimate for $$x_k$$ while $$\sigma_p$$ represents the statistical uncertainty as to the actual value of $$x_k$$. The challenge is to update these to time $$k+1$$ using the new data provided at that time by the observation $$y_{k+1}$$. Since the state’s dynamics are assumed to be linear (eqns 3), and the noise distributions are usefully approximated by a Gaussian, we can easily form a belief distribution (like a prior) over the state at time $$k+1$$, $\tag{5} P(x_{k+1}) = N(a \mu_k, a^2 \sigma^2_\omega + \sigma^2_p).$

This is our belief in the values the state may take, before we have made any observations. The observation dynamics, also linear and Gaussian, induce their own distribution, called the likelihood. $\tag{6} P(y_{k+1}|x_{k+1}) = N(c x_{x+1}, \sigma^2_{obs}).$

This is the probability of observing y, given the state’s possible values. Now, according to Bayes’ rule, (eqn 1), we can combine the prior and the likelihood, to compute a posterior distribution over the state’s values at time $$k+1$$: $\tag{7} P(x_{k+1}|y_{k+1}) = P(y_{k+1}| x_{k+1})P(x_{k+1}) / P(y_{k+1}) = N(\mu_{k+1}, \sigma^2)$ with $\mu_{k+1} = a \mu_k + K(y_{k+1} - c a \mu_k),$ $\tag{8} K = c(a^2 \sigma^2_\omega + \sigma^2_p) \left[ c^2(a^2 \sigma^2_\omega + \sigma^2_p) + \sigma^2_{obs}\right]^{-1}.$

As can be seen, the mean of our posterior belief in the state is a linear combination of the prior’s and likelihood’s mean, scaled by $$K$$. This value, $$K$$, is referred to as the Kalman gain. This Bayesian update has very similar analogues when the state is multi-dimensional and continuous in time. Conveniently, since the posterior distribution at this time step is our prior belief in the state’s value at the next time step, this whole procedure can be repeated at each time step just as above (Fig 5B).

Only recently has the methodology of Kalman filtering been applied to make quantitative predictions of human movement behavior. Studies have found evidence that people estimate the state of their limb in a manner consistent with Kalman filtering (Mehta & Schaal, 2002; Wolpert, Ghahramani, & Jordan, 1995). The variables estimated need not be properties of the body, however, and studies have found evidence that human subjects estimate other time varying processes similarly in both motor (Stevenson, Fernandes, Vilares, Wei, & Kording, 2009) and decision tasks (Acuna & Schrater, 2010; Green, Benson, Kersten, & Schrater, 2010). Using this precise Bayesian formalism to examine human behavior is quickly finding many applications.

What’s more, even if the estimation is not linear and Gaussian as above, an estimate based on the nonlinear posterior distribution is still optimal. Recent work examining people’s abilities to integrate evidence over time in such circumstances has also found that humans may be Bayes’ optimal (Berniker, Voss, & Kording, 2010). In a simple visuomotor task subjects were asked to estimate the location of a hidden random variable. In order to do so optimally, subjects would have to learn the underlying Gaussian distribution. In contrast with the above examples, estimating both the mean and variance of a variable is nonlinear. However, using the appropriate Bayesian model to represent this nonlinear update, it was found that the dynamics of subject behavior could be well explained (Fig 6).

Figure 6: Learning a switching prior. A) Across-subject averages for the gain, $$r$$, in the wide variance first group (mean +/- standard error). The bold black line indicates the Bayesian inference model’s predicted result based on the fit from experiment 1. B) The inferred average prior as it evolved over the experiment. C) and D) Across-subject averages for the gain, $$r$$, in the narrow variance first group and the inferred prior as it evolved.

Finally, estimating variable properties is not only important for making decisions, it is also an important feature for motor adaptation. Since the properties of our bodies change continuously throughout our lives, it is essential that we monitor these changes if we are to control our motor behaviors accurately and precisely. For example, errors in the perceived strength of our muscles will translate to movement errors. However, we can use these errors to obtain a likelihood characterizing how strong are muscles are (Fig 5C). According to Bayesian statistics we should combine this newly obtained information, our motor errors, with our prior beliefs. We then infer new and improved estimates of our muscles. Using this same approach many features of motor adaptation can be explained as the result of people’s attempts to optimally estimate motor variables (Berniker & Kording, 2008).

In summary, the findings of many studies suggest that motor adaptation across time can be understood using the predictions of Kalman filtering (and extended Kalman filtering for nonlinear models). More to the point, when people integrate information over time, they seem to do so in a fashion that is consistent with the optimal Bayesian solution. If this is the case, then these estimates and this mechanism might be reflected in our motor behavior as well. As we shall see in the next section, there is ample evidence to suggest this is the case as well.

## Bayesian decision theory in dynamical systems: optimal control

### The control problem

A central objective of the nervous system is to sense the state of the world around us and to affect this state such that it is more favorable to us. Effecting these changes is the task of motor control. In the previous sections we saw how the information from our senses can be used to accurately estimate both the state of our environment and our body. Using these estimates we must decide what actions to take to bring about our desired results. When our choice of actions does this, we are said to be behaving rationally. Equivalently, we could say that rational behavior is optimal, in that this behavior executes the best actions for achieving our desired results. Thus behaving rationally is equivalent to solving an optimality problem: what actions should we select to best achieve our goals?

The formal mathematics of these optimality problems are developed under the fields of decision theory and optimal control. Using these approaches, all possible actions can be assessed in terms of their ability to achieve our goals. Then the optimal action can be identified and executed. To quantify the relative merit of one action over another, we must define a cost function (also referred to as a loss function). In its most general form, the cost function assigns a value to a choice of action and the resulting outcome, within the context of our goals. In essence, it mathematically quantifies our goals and our strategy for achieving them. Therefore, by selecting actions that minimize this cost, we can best achieve our goals.

Figure 7: Optimal control in a dart-throwing example. A) According to decision theory, we should throw the dart in the direction that maximizes our score. By predicting the location the dart will land for every possible direction, we can identify the best choice. B) Under a more realistic scenario, the dart’s final location is not known exactly, but instead we have a likelihood of positions on the board associated with every direction. Now the best direction to throw the dart is that which maximizes our expected score given this likelihood.

To illustrate, consider a simplified game of darts. Assume that the possible actions uniquely specify where the dart lands on the board, defining the score, the outcome of our action. Assuming our goal is to get the highest score possible, our “cost” function is inversely proportional to our score; that is, we achieve the lowest cost by getting the highest score. The optimal choice of action is the one that results in the dart landing on the triple twenty (60 points, better than a bull’s eye, Fig 7A). Mathematically, we would express this optimal action as, $\tag{9} a^o = \mathrm{arg max}_a \left[\mathrm{cost}(outcome(a), a)\right].$

Here $$a$$ is shorthand for the possible actions, and $$a^o$$ is the optimal action. The above example, while valuable in portraying the general approach to selecting optimal actions, is clearly a simplification of the true problem. As the previous sections have emphasized, motor behaviors are stochastic in nature, as such we cannot choose a command and be certain of its outcome. Instead, we must account for the statistics of the task. If we aim at the center and throw many times we will end up with a distribution of dart positions. This distribution characterizes the likelihood of each outcome, the dart landing at a certain position on the board, given that we aimed at the center. Given this likelihood, it may not make sense to aim for the triple twenty any longer; our errors may be large enough to make many low scores likely. Instead we will use the likelihood to define a new optimization problem: what action minimizes the expected cost? To identify these actions we will need to minimize what is referred to as the expected cost: the average cost when weighted by the probability of the various outcomes. Mathematically, we would express this best action as, $\tag{10} a^o = \mathrm{arg max}_a \left[\Sigma_{outcomes} \mathrm{cost}(outcome(a), a)p(outcome|a)\right].$

Here $$p(outcome|a)$$ is the probability of an outcome conditioned on the choice of action, $$a$$ (the likelihood). Under this approach, the best aiming point is that which yields high scores even if we make large mistakes (Fig 7B). In fact, both amateur and world-class players are known to adopt a strategy that is well predicted by this approach.

Pressing our example further, we can examine more sophisticated and realistic action selection problems. For instance, a more sensible description of dart throwing recognizes that the task is dynamic. The dart’s final position on the board depends on the ballistic trajectory it takes once it has been released from our hand. The motion of the dart up until the moment we release it is dictated by the inertial mechanics of our arm and the force generating properties of our muscles. As the state of our arm and the dart evolve in time, so to must our actions. Under these more realistic conditions our possible actions move from a range of aiming points, to a list of possible muscles to excite, and the time-history of how these muscles should be excited. Clearly, our description of the dart-throwing problem can take on greater and greater levels of detail and physical accuracy.

Under these more realistic and dynamic conditions, our costs too, are no longer immediately evident, and instead may evolve over time, or may not even be apparent until some delayed period of time after we have taken them. For instance, how hard we throw the dart will influence its path once it leaves our hands and eventually hits its target. Moreover, how hard we throw will influence how quickly we tire, influencing our later throws. How we perform during the dart game may also influence our physical and emotional well being for the remainder of the afternoon. Taking these considerations into account, one can see how choosing our optimal actions can quickly become a daunting task. Mathematically, we still express our optimal actions in essentially the same form as equation 2, however, now the actions are functions of time and state, and the costs too must be integrated across time. This time and state dependent series of actions is often referred to as a policy: a plan of action to control decisions and achieve our desired outcome.

The best possible policy dictates the action we should take at any instant, taking into account how it will influence both future states and the actions we can take at that future time. In a sense, to be optimal we need to know the future; we need to be aware of the possible future states and actions that will result from the decision we take now. In simple scenarios these future states and actions could be learned through trial and error. However, more generally the future is inferred through models of the world. These models, referred to as forward models, require a hypothesis about the nature of the world (Bayesian estimation of structure) and how it evolves (Bayesian estimation across time). Regardless of how the future states are estimated, we need to know how choosing an action based on our current state, will affect future states and the costs associated with them. Mathematically, this is compactly represented by what is referred to as the value function. The value function quantifies the cost of being in any state, and acting according to a given policy for all future times. Not surprisingly, computing this value function is difficult. However, once we know this value function, we can relatively easily asses the value of being in any given state, and choose the action that moves us to a state with the best value.

Despite the varying complexity of the above examples and their solutions, we can still formulate these problems in the Bayesian framework. We must solve a statistical problem concerning the likely evolution of states conditioned on the actions we choose. Though the problem in general is onerous, there are many circumstances under which solutions can be calculated, and many algorithms for approximating solutions for the remaining circumstances. Below we discuss some of these strategies.

### Solution strategies in optimal control

Solving the value function, an integral component to solving optimal control problems, has spawned a whole field of numerical techniques (e.g. see Bryson & Ho, 1975; Stuart & Peter, 2003). Difficult to compute analytically, many techniques have been developed to approximate the value function and the accompanying optimal policy. As the problem descriptions get more complicated (e.g. as when we move from choosing where to aim a dart to choosing muscle activation patterns), the likelihood that these techniques will provide an accurate solution diminishes. As such, approximating the optimal actions is often the best one can hope for. Regardless of the difficulty in solving these problems, the same basic principles hold; we represent the necessary statistics of actions and how they impact the statistics of outcomes.

When our observations of the state of the world are relatively certain, we can attempt to approximate the value function directly. There are two widely used approaches. Value iteration is a boot-strapping algorithm, initialized with a naïve value function and repeatedly updated until the estimate is acceptably accurate. Using the current estimate of the value function, the algorithm sweeps across all world states, updating the value of being in that state given that states cost and the value of the best possible future state. The policy is never explicitly computed, instead actions are chosen by evaluating the value function’s estimate. In a similar approach, policy iteration uses a current estimate of the value function to compute a policy. This policy is then used to converge on an estimate of the value function. These steps are repeated until both the policy and value function converge. Value and policy iteration techniques are particularly applicable to problems with a finite and discrete number of world states and actions. When our certainty of the world and how our actions will influence it is less certain we must rely on other methods. In reinforcement learning algorithms, rather than using accurate distributions of states and costs, empirical observations are used (Sutton & Barto, 1998). By observing world states and actions taken, an intermediate function is computed that can be used to estimate the optimal policy.

While the above algorithms are very successful for a large class of problems, a more general approach to solving optimal control problems is found through the Hamilton-Jacobi-Bellman equation (or simply the Bellman equation when dealing with discrete-time). These equations define the necessary and sufficient conditions for optimality in terms of the value function. Under limited conditions these equations can be used to find analytical solutions to the optimal policy. For example, the widely popular method of dynamic programming can be used to solve for an optimal policy when the states and actions are discrete and finite. What’s more, for linear problems with Gaussian noise, these equations can be used to derive the value function and the optimal policy, the so-called linear quadratic Gaussian regulator (LQR) problem. Under more general conditions, these equations can be used to approximate, perhaps iteratively, the value function and an optimal policy (Stengel, 1994; E. Todorov & Tassa, 2009). A wide range of numerical techniques is available to solve problems of optimal control and these techniques generally incorporate Bayesian updating.

### Optimal control as Bayesian inference

Interestingly, recent advanced statistical techniques suggest that these traditional methods for computing optimal actions may be supplemented, or even altogether abandoned for what promises to be a truly Bayesian inference formalism. To motivate this new approach, we first point out that in the simplest of cases, it is well known that there is a correspondence between how to choose optimal actions, and how to make optimal inferences. In the linear setting, the optimal estimation problem is the mathematical dual to the optimal control problem (e.g. Stengel, 1994). The defining equations for these two problem’s solutions are identical. Practically speaking, this means the same numerical techniques can be employed to solve either problem. Theoretically, the implication is that choosing an optimal action is a similar problem to optimally estimating the world’s state.

Recent work has extended this duality to a larger class of problems. Making some broad assumptions concerning system dynamics and probability distributions of the optimal solutions, this duality has been extended to a large class of nonlinear problems (E. Todorov, 2008, E. Todorov, 2009). There is a correspondence between the distribution of states under optimal conditions, and the solution to the value function. This work further emphasizes the implicit connection between optimal actions and optimal inference.

Taking another approach, recent work has neglected a cost function altogether and reformulated the problem of optimal action selection as a problem of optimal Bayesian inference (K. Friston, 2009; K. J. Friston, Daunizeau, & Kiebel, 2009; Toussaint, 2009, Toussant, 2010). This new work replaces the traditional cost function with a desired distribution over states. For example, under this approach we might assume we have a known distribution over the initial state, and the final observation of the state (or equivalently, we can assume it is constrained to a certain value, Fig 8). Then, we can write the probability of any given state or command conditioned on the initial state and our final observation. Through advanced inference techniques, such as message passing, we can then infer the command at each time step. These new methods, which are purely inference processes, subsume the problem of optimal action selection and optimal inference into a more general Bayesian inference procedure.

Figure 8: Optimal control as a Bayesian inference problem. Assuming a known distribution over the initial state and final observation (corresponding to some final desired state or task constraint) we can infer all the intermediate states, observations and commands. These best estimates of command values can then be used to command the system.

## Neural representations of uncertainty and the limits of optimality

So far, we have discussed how the noisy variables we observe can be used to optimally estimate properties like hidden states that vary in time, and how these estimated properties can be used in the control of our actions. All of these processes require knowledge of a variable’s uncertainty, or its underlying probability distribution. In this section we want to more generally discuss how the nervous system may represent uncertainty. We will consider a range of theories that have been put forward to describe how neurons may represent uncertainty and also discuss actual neural data that can be related to these theories. In addition, we will also note many of the manners in which people apparently deviate from the optimal predictions. Regardless of whether or not this evidence argues against the Bayesian formalism, they may provide valuable insight into how the brain actually estimates and controls our actions.

### Neural representations of uncertainty

Imagine we are recording spikes from a neuron somewhere in the nervous system. How could these measured spikes, defined by their temporal history, indicate the level of uncertainty in some encoded property? For example, maybe the firing rate of one neuron transmits the value of a parameter, while the firing rate of a different neuron transmits this same parameter’s uncertainty. Maybe a neuron can encode both the value of a property and its uncertainty simultaneously. These two basic ideas have been specified in a wide range of models, and pose important unanswered questions to neurophysiology.

The idea that there is a subset of neurons that only encode uncertainty is probably most popular (Fig. 9A). This theory appears to have significant experimental support (Yu & Dayan, 2005). For example, some experiments indicate that uncertainty about rewards appear to be represented by groups of dopaminergic neurons in the Substantia Nigra, insula, orbitofrontal cortex, cingulate cortex and amygdala (Fiorillo, Tobler, & Schultz, 2003; Huettel, Song, & McCarthy, 2005; McCoy & Platt, 2005; Singer, Critchley, & Preuschoff, 2009). However, neurons in these areas appear to also be influenced by variables not related to uncertainty. Further, the experiments supporting this theory generally examine “high level”, or cognitive uncertainty, such as a subject’s uncertainty in a potential reward, and not necessarily the uncertainty caused by noise in sensory or motor information. It may be the case that the latter uncertainty is represented in a different way, or in other brain regions that are not fully understood yet. Though the overall evidence is not yet conclusive, this description for the neural representation of uncertainty is promising.

Figure 9: Possible neural representations of uncertainty. In red are firing rates (or connections) in a low uncertainty state, and in blue are the rates occurring in a high uncertainty state. A) Specific populations of neurons encode the uncertainty of a stimulus or variable. B) Tuning widths of neurons could change based on uncertainty. C) In probabilistic population coding, the firing rate of a neuron indirectly encodes uncertainty. D) Neurons could encode uncertainty with the relative timing of their spiking. E) In the sampling hypothesis, the variability of a neuron’s firing rate encodes uncertainty. F) Uncertainty could be encoded through the functional connections between neurons.

A second possibility states that the width of a neuron’s tuning curve may change with uncertainty (Anderson, 1994) and that many neurons would jointly encode probability distributions (Fig. 9B). Such an ensemble encoding makes sense given that the visual system exhibits far more neurons than inputs (Van Essen, Anderson, & Felleman, 1992). Large groups of neurons could encode probability distributions instead of point estimates. When uncertainty is high then a broad set of neurons will have low firing rates, whereas when uncertainty is low a small set of neurons have high firing rates (see Fig. 9B). Support for this theory comes from early visual physiology where spatial frequency tuning curves of neurons in the retina are larger during darkness (when there is more visual uncertainty) than during the day (Barlow, Fitzhugh, & Kuffler, 1957).

A third influential theory of the encoding of uncertainty is the so-called probabilistic population code (PPC, Fig. 9C). Note that the Poisson-like firing observed in most neurons, automatically implies that the stimulus driving a neuron can be known with less uncertainty the higher the firing rate – because the standard deviation of a Poisson process increases more slowly than its mean (Ma et al., 2006). In this way, populations of neurons transmit the stimulus information while at the same time transmitting the uncertainty associated with that stimulus. Specifically, the standard versions of this theory predict that increased firing rates of neurons imply decreased levels of uncertainty. Some data in support of this theory comes from studies on cue combination (Gu, Angelaki, & Deangelis, 2008; Morgan, Deangelis, & Angelaki, 2008). More support comes from the general finding that early visual activity is higher when contrast is higher and thus uncertainty is lower (Carandini & Heeger, 1994; Cheng, Hasegawa, Saleem, & Tanaka, 1994; Shapley, Kaplan, & Soodak, 1981).

Interestingly, for PPCs Bayesian theory provides the central link between neural activities and probabilities. The probability distribution implied by the PPC is the best possible Bayesian readout from the neural vector, assumed to follow a Poisson distribution. However, how the incoming signals get converted into exactly the right number of spikes is still unclear.

Another theory that has been put forward suggests that while the tuning curves do not change with uncertainty, the relative timing of signals does (Fig. 9D) (Deneve, 2008; Huan & Rao, 2010). If uncertainty is low then neurons will fire quickly in response to a stimulus and then quickly stop firing. If, on the other hand, uncertainty is high, then neurons fire less but for a longer period of time. In this way, the total number of spikes may be the same, but their relative timing changes. There is some evidence for this theory coming from studies in the visual area MT (middle temporal) that shows differential temporal modulation when animals are more uncertain (Bair & Koch, 1996).

In fact, there is a deep link between the Bayesian models of temporal integration we discussed here and the idea of predictive coding (Rao and Ballard, 1997). According to this idea, higher areas try to predict the incoming data from lower areas using a mechanism similar to an extended Kalman filter. This gives rise to a coding scheme where neural activities represent the difference between the incoming signals and the prediction by a higher area – they basically represent the error term of the Kalman filter. This coding scheme can be used to explain some visual effects (Rao and Ballard, 1999) and has received some support from fMRI (e.g. Alink et al., 2010, den Ouden et al., 2010). These theories thus constitute a potential link between Kalman filtering used for perception and movement in this chapter and its use in neural coding.

Another prominent theory is the sampling hypothesis (Fiser, Berkes, Orban, & Lengyel, 2010; Hinton & Sejnowski, 1983; Hoyer & Hyvärinen, 2003). According to this theory, when uncertainty is low, neurons will have a relatively constant firing rate for a period of time, but if uncertainty is high, neurons’ firing rates will be highly variable (see Fig. 9E). As such, a variable’s value and uncertainty is encoded in the mean and variance of the firing rate respectively. Evidence for the sampling hypothesis comes from some recent experiments comparing the statistics of neuronal firing across different situations (Fiser, Chiu, & Weliky, 2004; Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003). It is also compatible with the observed contrast invariant tuning properties of neurons in the primary visual cortex (Finn, Priebe, & Ferster, 2007). There are also behavioral experiments with bi-stable percepts that can be interpreted in this framework (Fiser et al., 2010; Hoyer & Hyvarinen, 2003). However, there is no experimental evidence yet that an explicit change in the uncertainty of a stimulus has resulted in a measured neuronal variability in firing rates.

As a last theory we note the possibility that uncertainty could be encoded not in the firing properties of neurons but in the connections between them (see Fig. 9F). For example, the number and strength of synapses between neurons could encode uncertainty (Wu & Amari, 2003). This type of coding makes sense for prior probabilities, as they are acquired over long periods of time and need to store information in a robust manner. Uncertainty, in this case, would thus change the way that neurons interact with one another.

As we have discussed, there is a wide range of exciting theories on how the nervous system could represent probability distributions with neurons. Experimental data, however, does not strongly support one theory over the others. More importantly, these theories are not mutually exclusive and the nervous system may use any or all of these mechanisms at the same time, or differently for different types of uncertainty. Indeed, there is no reason to believe that the brain represents uncertainty in a manner easily interpretable by humans. Ongoing research is trying to understand which, if any, encoding of uncertainty the brain employs, and how it is used to inform our decisions.

### Behavioral deviations from optimality

Up to this point we have focused on how we can compute optimal estimates and use those estimates to make rational choices and actions. However, we are all aware that our decisions don’t always seem optimal. What’s more, we frequently make the wrong decisions repeatedly. In fact, it is known that under certain conditions people are prone to make the same kind of mistakes systematically. Below we will discuss many of these phenomena, pointing out how people often deviate from what should be optimal, and some potential reasons for this sub-optimality.

Many suboptimal behaviors concern what is one of the first steps in decision-making: estimating probabilities. Under many circumstances human subjects systematically misestimate probabilities, even when given sufficient time and practice to learn. Consider what is referred to as the Conjunction Fallacy. Suppose a group of subjects is presented with the following scenario: “Linda is 36 years old, outspoken and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice and also participated in anti-nuclear demonstrations.” The subjects are then asked to assess different probabilities. For example, they were asked “how likely is it that Linda is a bank teller?” and “how likely is it that Linda is a bank teller and active in the environmental movement?” Subjects often judged the second question as being more likely. When it comes to the probability calculus it is clear that the second probability (the joint probability) cannot be greater than the first. That is, $$P(bank teller, activist) = P(bank teller)P(activist|bank teller) ≤ P(bank teller))$$. These experiments, and many similar ones since, reveal systematic biases in subjects’ probability judgments.

Though these biases appear to reveal suboptimal computations, they are believed to be the result of how subjects interpret the question. The person described as a bank teller AND an activist seems more likely to be who we expect Linda to be than the person who is just a bank teller. Instead of correctly calculating with probabilities, it appears that subjects use a simple comparison. Interestingly, when the question was rephrased, “Of 100 people like Linda, how many do you think are bank tellers? How many are bank tellers and active in the environmental movement?” subjects answered in an unbiased fashion (Gigerenzer, Todd, Group, & NetLibrary, 1999). Thus, it may be that humans simply have strong preferences for representing questions in certain ways.

Another common error in judging probabilities is termed Base Rate Neglect. Suppose you want to identify a small population of citizens, say republicans of Chilean descent, who are only 1% of the overall population. You find that based on individual voting history, you identify citizens as republican Chileans 5% of the time when in fact they aren’t (i.e. 5% are false positives) and 100% of the time when they are of the correct demographic. When voting history suggests a new person is a republican Chilean, what is the probability that they actually are? Most people answer the question as, 0.95 probability (95% chance of being correct). Using Bayes’ rule we find that the correct answer is in fact 17%. Subjects appear to ignore the fact that the base rate for this population is only 1%.

Base rate neglect is a widespread phenomenon, and is estimated to be the source of many errors in the medical industry. However, just as in the conjunction fallacy phenomenon, asking the questions in a different way helps people to avoid base rate neglect. One interpretation of base rate neglect is that subjects may assume the base rate to be reflected in other criteria. For example, subjects might assume that people given a voting test are not drawn randomly from a population.

As we’ve seen there are many ways in which human subjects show errors in judging probabilities. As a further example, consider what the probability of dying from Ebola, or a shark attack, versus the probability of dying from a stroke or heart attack. People usually judge the former (very rare occurrences) as relatively high, and the latter (very common events) as relatively low. Given these biases, it would seem quite unlikely that people would be able to make the right (or optimal) choices in their everyday lives. Not surprisingly there are many cases where this is the case, and humans appear to choose their actions suboptimally.

A common case of suboptimal decision making, is what is referred to as Matching. Imagine you are given the choice of gambling with two slot machines (one-armed bandits). Suppose the first machine randomly rewards 70% percent of the time, and the second 30% of the time. If we intend to maximize our money, the optimal choice is to repeatedly play the first machine, which pays off 70% percent of the time. Strangely, when human subjects are presented with this scenario, they play the first machine 70% of the time and the second 30% of the time. Instead of choosing the best choice, they match their choices to the reward distribution. This choice of behavior, which persists even when there are true monetary rewards involved, significantly lowers the overall rewards that subjects obtain. Why then, do they use this apparently suboptimal strategy? Much research has addressed this question.

There are several scenarios where the matching strategy would be justified. For example, if we believe that there is the possibility of getting all the rewards correct then it is clear that this strategy must be matched to the reward distribution. Hence, it might be that human subjects simply have very strong prior beliefs that they can track a system with dynamics. Alternatively, if we assume subjects are trying to learn the probability distributions, or that they believe the distributions are varying, then matching is actually optimal (Acuna & Schrater, 2010). Interestingly, matching is not an unavoidable behavior. Recent research shows that when high rewards are at stake, with much training and good feedback subjects can overcome this bias (Shanks, Tunney, & McCarthy, 2002).

Just as with matching, there are many described instances where subjects systematically make choices that betray optimal decision-making. In the Endowment effect subjects assign high value to objects after they possess them (Kahneman, Knetsch, & Thaler, 1990). What’s more, subjects often appear irrational in their assessments of gains and losses; e.g. winning $1,000 is more favorable than loosing$1,000. This so-called, loss aversion is very common and accounts for what appear to be suboptimal decisions (Tversky & Kahneman, 1991). These and many other instances of irrational behavior, including anchoring, the Allais paradox (Allais, 1953) and the sunk cost biases are examined in the field of decision theory. Just as with matching, however, many lines of investigation aim to understand if these phenomena are the result of using heuristics to achieve optimality, or are in fact optimal under certain conditions.

Finally, it is worth making a distinction between what are cognitive decisions and what can be classified as sensory or perceptual choices. As we have seen in the previous sections, sensory and perceptual choices often appear to be optimal, whereas the preceding cognitive examples of this section seem less than optimal. There may be evolutionary reasons for this. Arguably, the nervous system has had to solve sensorimotor problems since animals existed (before the cortex evolved). However, high-level social, cognitive and economic decision-making is evolutionarily new. Hence, it may be that poor performance is due to the limited time we’ve had the ability to solve these problems. Another possibility is that cognitive and economic decision-making generally happens in social situations where our gains are other people’s losses and vice versa. In such situations, being predictable allows exploitation by other agents. Strangely, acting in a predictable and optimal way can be counterproductive.

## Summary

In these sections we have reviewed a wide range of ideas on how Bayesian statistics may be used by the nervous system to integrate information we acquire through sensing the world, and then act upon it with our motor system; process we have argued is central to the nervous system’s basic functioning. In section 1 we discussed how the Bayesian formalism prescribes the optimal combination of multiple pieces of information into a cohesive whole. This information could be our prior beliefs and new information, or multiple cues or measurements of a single property. Along with this mathematical formalism we surveyed some of the many experimental studies that have examined the similarities between Bayesian predictions and people’s behavior. The collective evidence from these studies suggests that people integrate information in a manner consistent with statistical optimum, suggesting that the nervous system may have evolved to integrate information accurately.

In section 2 we discussed how the integration of information is not always a discrete act, but instead occurs repeatedly throughout life. The same basic Bayesian ideas from section 1 could be modified to update our beliefs continuously as new information became available. When these beliefs are described with linear dynamics and Gaussian distributions, the Bayesian update is the familiar Kalman filter. Even when information is combined nonlinearly, or with non-Gaussian distributions, Bayes’ rule still prescribes the proper update of our beliefs. As in section 1, there many experiments have examined the similarity between these Bayesian updates and human behavior and found strong similarities. Given how accurately people combine discrete pieces of information, it was not surprising that people’s ability to integrate uncertain pieces of information over time was also close to the Bayesian optimum. This too suggests that the nervous system is equipped to update our beliefs across time while we constantly acquire new uncertain information from our senses.

Integrating the information from our noisy senses across time helps us to create accurate perceptions of the world. However, without the ability to act on these beliefs they do us little good. In section 3 we presented the connection between our beliefs and our actions. We demonstrated that the general problem of choosing actions may be modeled as an optimal control problem, and how this approach has been successfully applied in a number of motor behaviors. We went on to discuss how even in relatively simple examples the mathematics behind these control problems require Bayesian statistics. Furthermore, the optimal control problem itself can be formalized as an inference problem, emphasizing the broad applicability of the Bayesian approach.

Finally, in section 4 we surveyed a host of recent theories and evidence on how the brain may represent uncertainty and when our ability to manipulate this uncertainty breaks down. There are several prominent computational models for how neurons might easily solve some of the statistical problems of information integration. Similarly, many ongoing studies are looking for neural correlates of uncertainty and electrophysiological evidence of multimodal cue combination in the context of uncertainty. At the same time, there are many scenarios where our ability to compute statistics and make decisions based on uncertain circumstances breaks down. These sub-optimal behaviors may be key to understanding how the brain represents and manipulates uncertainty. These many approaches to examining the nervous system promise to help clarify the connection between the mathematical descriptions of behaviors and their neural substrates.

Recognizing the inherent uncertainty in all the sensory information available to the brain, as well as the uncertainty in our motor behaviors, a great number of the brain’s tasks have been successfully described in the Bayesian framework. Additionally, experimental investigations have found strong evidence linking human behaviors to these Bayesian predictions. This two-pronged approach promises to soon provide a deeper understanding of the nervous system.

## References

Acuna, D. E., & Schrater, P. (2010). Structure learning in human sequential decision-making. PLoS Comput Biol, 6(12), e1001003. doi: 10.1371/journal.pcbi.1001003

Alais, D., & Burr, D. (2004). The ventriloquist effect results from near-optimal bimodal integration. Curr Biol, 14(3), 257-262.

Allais, M. (1953). "Le comportement de l’homme rationnel devant le risque: critique des postulats et axiomes de l’école Américaine". Econometrica 21 (4): 503–546. JSTOR 1907921.

Anderson, C. H. (1994). Basic elements of biological computational systems. Int. J. Modern Phys. C, 5(313-315).

Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comput, 8(6), 1185-1202.

Barlow, H. B., Fitzhugh, R., & Kuffler, S. W. (1957). Change of organization in the receptive fields of the cat's retina during dark adaptation. J Physiol, 137(3), 338-354.

Berniker, M., & Kording, K. (2008). Estimating the sources of motor errors for adaptation and generalization. Nat Neurosci, 11(12), 1454-1461. doi: nn.2229 [pii] 10.1038/nn.2229

Berniker, M., Voss, M., & Kording, K. (2010). Learning priors for bayesian computations in the nervous system. PLoS One, 5(9). doi: 10.1371/journal.pone.0012686

Bryson, A. E., & Ho, Y. C. (1975). Applied Optimal Control: Optimization, Estimation, and Control: Taylor & Francis Group.

Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264(5163), 1333-1336.

Chater, N., Tenenbaum, J. B., & Yuille, A. (2006). Probabilistic models of cognition: where next? Trends Cogn Sci, 10(7), 292-293.

Cheng, K., Hasegawa, T., Saleem, K. S., & Tanaka, K. (1994). Comparison of neuronal selectivity for stimulus speed, length, and contrast in the prestriate visual cortical areas V4 and MT of the macaque monkey. J Neurophysiol, 71(6), 2269-2280.

Deneve, S. (2008). Bayesian spiking neurons I: inference. Neural Comput, 20(1), 91-117.

Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870), 429-433. doi: 10.1038/415429a 415429a [pii]

Ernst, M. O., & Bulthoff, H. H. (2004). Merging the senses into a robust percept. Trends Cogn Sci, 8(4), 162-169. doi: 10.1016/j.tics.2004.02.002 S1364661304000385 [pii]

Fetsch, C. R., Turner, A. H., DeAngelis, G. C., & Angelaki, D. E. (2009). Dynamic reweighting of visual and vestibular cues during self-motion perception. The Journal of Neuroscience, 29(49), 15601-15612.

Finn, I. M., Priebe, N. J., & Ferster, D. (2007). The emergence of contrast-invariant orientation tuning in simple cells of cat visual cortex. Neuron, 54(1), 137-152. doi: S0896-6273(07)00169-9 [pii] 10.1016/j.neuron.2007.02.029

Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299(5614), 1898-1902.

Fiser, J., Berkes, P., Orban, G., & Lengyel, M. (2010). Statistically optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences, 14(3), 119-130. doi: DOI 10.1016/j.tics.2010.01.003

Fiser, J., Chiu, C., & Weliky, M. (2004). Small modulation of ongoing cortical dynamics by sensory input during natural vision. Nature, 431(7008), 573-578. doi: 10.1038/nature02907 nature02907 [pii]

Flanagan, J. R., Bittner, J. P., & Johansson, R. S. (2008). Experience Can Change Distinct Size-Weight Priors Engaged in Lifting Objects and Judging their Weights. Current Biology.

Friston, K. (2009). The free-energy principle: a rough guide to the brain? Trends Cogn Sci, 13(7), 293-301.

Friston, K. J., Daunizeau, J., & Kiebel, S. J. (2009). Reinforcement learning or active inference? PLoS One, 4(7).

Geisler, W. S., & Kersten, D. (2002). Illusions, perception and Bayes. Nat Neurosci, 5(6), 508-510.

Gigerenzer, G., Todd, P. M., Group, A. B. C. R., & NetLibrary, I. (1999). Simple heuristics that make us smart: Oxford University Press New York.

Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T., & Danks, D. (2004). A theory of causal learning in children: causal maps and Bayes nets. Psychol Rev, 111(1), 3-32.

Green, C. S., Benson, C., Kersten, D., & Schrater, P. (2010). Alterations in choice behavior by manipulations of world model. Proc Natl Acad Sci U S A, 107(37), 16401-16406. doi: 1001709107 [pii] 10.1073/pnas.1001709107

Griffiths, T. L., & Tenenbaum, J. B. (2005). Structure and strength in causal induction. Cognit Psychol, 51(4), 334-384.

Gu, Y., Angelaki, D. E., & Deangelis, G. C. (2008). Neural correlates of multisensory cue integration in macaque MSTd. Nat Neurosci, 11(10), 1201-1210. doi: nn.2191 [pii] 10.1038/nn.2191

Hinton, G. E., & Sejnowski, T. J. (1983). Optimal perceptual inference. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 448-453.

Hoyer, P. O., & Hyvarinen, A. (2003). Interpreting neural response variability as Monte Carlo sampling of the posterior. Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference.

Hoyer, P. O., & Hyvärinen, A. (2003). Interpreting neural response variability as Monte Carlo sampling of the posterior. Paper presented at the Neural Information Processing Systems

Huan, Y., & Rao, R. P. (2010). Predictive coding Wire's cognitive science.

Huettel, S. A., Song, A. W., & McCarthy, G. (2005). Decisions under uncertainty: probabilistic context influences activation of prefrontal and parietal cortices. J Neurosci, 25(13), 3304-3311. doi: 25/13/3304 [pii] 10.1523/JNEUROSCI.5070-04.2005

Kahneman, D., Knetsch, J. L., & Thaler, R. H. (1990). Experimental tests of the endowment effect and the Coase theorem. Journal of political Economy, 1325-1348.

Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425(6961), 954-956. doi: 10.1038/nature02078 nature02078 [pii]

Kersten, D., & Yuille, A. (2003). Bayesian models of object perception. Curr Opin Neurobiol, 13(2), 150-158. doi: S0959438803000424 [pii]

Knill, D., & Richards, W. (Eds.). (1996). Perception as Bayesian Inference: Cambridge University Press.

Knill, D. C. (2007). Robust cue integration: a Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. J Vis, 7(7), 5 1-24.

Kording, K. (2007). Decision theory: what "should" the nervous system do? Science, 318(5850), 606-610. doi: 318/5850/606 [pii] 10.1126/science.1142998

Kording, K., Tenenbaum, J., & Shadmehr, R. (2006). A generative model based approach to motor adapatation. Paper presented at the Neural Information Processing Systems, Vancouver, Canada.

Körding, K. P., Ku, S. P., & Wolpert, D. (2004). Bayesian Integration in force estimation. Journal of Neurophysiology, 92(5), 3161-3165.

Kording, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427(6971), 244-247. doi: 10.1038/nature02169 nature02169 [pii]

Kording, K. P., & Wolpert, D. M. (2006). Bayesian decision theory in sensorimotor control. Trends Cogn Sci.

Ma, W. J., Beck, J. M., Latham, P. E., & Pouget, A. (2006). Bayesian inference with probabilistic population codes. Nat Neurosci, 9(11), 1432-1438. doi: nn1790 [pii] 10.1038/nn1790

McCoy, A. N., & Platt, M. L. (2005). Risk-sensitive neurons in macaque posterior cingulate cortex. Nature Neuroscience, 8(9), 1220-1227. doi: nn1523 [pii] 10.1038/nn1523

Mehta, B., & Schaal, S. (2002). Forward models in visuomotor control. J Neurophysiol, 88(2), 942-953.

Michotte, A. (1963). The Perception of Causality: Methuen.

Miyazaki, M., Nozaki, D., & Nakajima, Y. (2005). Testing Bayesian models of human coincidence timing. J Neurophysiol, 94(1), 395-399. doi: 01168.2004 [pii] 10.1152/jn.01168.2004

Miyazaki, M., Yamamoto, S., Uchida, S., & Kitazawa, S. (2006). Bayesian calibration of simultaneity in tactile temporal order judgment. Nat Neurosci, 9(7), 875-877.

Morgan, M. L., Deangelis, G. C., & Angelaki, D. E. (2008). Multisensory integration in macaque visual cortex depends on cue reliability. Neuron, 59(4), 662-673. doi: S0896-6273(08)00567-9 [pii] 10.1016/j.neuron.2008.06.024

Peterka, R. J., & Loughlin, P. J. (2004). Dynamic regulation of sensorimotor integration in human postural control. J Neurophysiol, 91(1), 410-423.

Pouget, A., Dayan, P., & Zemel, R. S. (2003). Inference and computation with population codes. Annu Rev Neurosci, 26, 381-410. doi: 10.1146/annurev.neuro.26.041002.131112 041002.131112 [pii]

Raposo, D., Sheppard, J. P., Schrater, P. R., & Churchland, A. K. (2012). Multisensory decision-making in rats and humans. The Journal of Neuroscience, 32(11), 3726-3735.

Rosas, P., Wagemans, J., Ernst, M. O., & Wichmann, F. A. (2005). Texture and haptic cues in slant discrimination: reliability-based cue weighting without statistically optimal cue combination. Journal of the Optical Society of America A, 22(5), 801-809.

Shams, L., Ma, W. J., & Beierholm, U. (2005). Sound-induced flash illusion as an optimal percept. Neuroreport, 16(17), 1923-1927.

Shanks, D. R., Tunney, R. J., & McCarthy, J. D. (2002). A Re-examination of Probability Matching and Rational Choice. Journal of Behavioral Decision Making, 15, 233–250.

Shapley, R., Kaplan, E., & Soodak, R. (1981). Spatial summation and contrast sensitivity of X and Y cells in the lateral geniculate nucleus of the macaque. Nature, 292(5823), 543-545.

Singer, T., Critchley, H. D., & Preuschoff, K. (2009). A common role of insula in feelings, empathy and uncertainty. Trends Cogn Sci, 13(8), 334-340. doi: S1364-6613(09)00135-1 [pii] 10.1016/j.tics.2009.05.001

Sperber, D., & Premack, D. (1995). Causal cognition: A multidisciplinary debate. Oxford: Oxford Unversity Press.

Stengel, R. F. (1994). Optimal Control and Estimation: Dover Publications.

Stevenson, I. H., Fernandes, H. L., Vilares, I., Wei, K., & Kording, K. P. (2009). Bayesian integration and non-linear feedback control in a full-body motor task. PLoS Comput Biol, 5(12), e1000629. doi: 10.1371/journal.pcbi.1000629

Stuart, J. R., & Peter, N. (2003). Artificial Intelligence: A Modern Approach: Pearson Education.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning. Cambridge, MA: MIT Press.

Tassinari, H., Hudson, T. E., & Landy, M. S. (2006). Combining priors and noisy visual cues in a rapid pointing task. J Neurosci, 26(40), 10154-10163. doi: 26/40/10154 [pii] 10.1523/JNEUROSCI.2779-06.2006

Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. (2006). Theory-based Bayesian models of inductive learning and reasoning. Trends Cogn Sci, 10(7), 309-318. doi: S1364-6613(06)00134-3 [pii] 10.1016/j.tics.2006.05.009

Todorov, E. (2006). Optimal Control Theory. In K. Doya (Ed.), Bayesian Brain. Cambridge, MA: MIT Press.

Todorov, E. (2008). General duality between optimal control and estimation. Paper presented at the Proceedings of the 47th IEEE Conference on Decision and Control, Cancun, Mexico.

Todorov, E. (2009). Efficient computation of optimal actions. Proc Natl Acad Sci U S A, 106(28), 11478-11483. doi: 0710743106 [pii] 10.1073/pnas.0710743106

Todorov, E., & Tassa, Y. (2009). Iterative local dynamic programming. Paper presented at the Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.

Toussaint, M. (2009). Robot trajectory optimization using approximate inference. Paper presented at the 25nd International Conference on Machine Learning (ICML 2009).

Toussaint, M. (2010). A Bayesian view on motor control and planning: Springer.

Tversky, A., & Kahneman, D. (1991). Loss aversion in riskless choice: A reference-dependent model. The Quarterly Journal of Economics, 106(4), 1039-1061.

van Ee, R., Adams, W. J., & Mamassian, P. (2003). Bayesian modeling of cue interaction: bistability in stereoscopic slant perception. Journal of the Optical Society of America A, 20(7), 1398-1406.

Van Essen, D. C., Anderson, C. H., & Felleman, D. J. (1992). Information processing in the primate visual system: an integrated systems perspective. Science, 255(5043), 419-423.

Wei, K., & Körding, K. P. (2008). Relevance of error: what drives motor adaptation? Journal of Neurophysiology, 90545.92008.

Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nat Neurosci, 5(6), 598-604. doi: 10.1038/nn858 nn858 [pii]

Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). An internal model for sensorimotor integration. Science, 269(5232), 1880-1882.

Wozny, D. R., Beierholm, U. R., & Shams, L. (2008). Human trimodal perception follows optimal statistical inference. J Vis, 8(3), 24 21-11.

Wu, s., & Amari, S. (2003). Neural Implementation of Bayesian Inference in Population Codes. Advances in Neural Information Processing Systems, 2003(14), 1-8.

Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681-692. doi: S0896-6273(05)00362-4 [pii] 10.1016/j.neuron.2005.04.026

Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Comput, 10(2), 403-430.