# Sampling bias

Post-publication activity

Curator: Cesare Magri

Sampling bias means that the samples of a stochastic variable that are collected to determine its distribution are selected incorrectly and do not represent the true distribution because of non-random reasons. Let us consider a specific example: we might want to predict the outcome of a presidential election by means of an opinion poll. Asking 1000 voters about their voting intentions can give a pretty accurate prediction of the likely winner, but only if our sample of 1000 voters is 'representative' of the electorate as a whole (i.e. unbiased). If we only poll the opinion of, 1000 white middle class college students, then the views of many important parts of the electorate as a whole (ethnic minorities, elderly people, blue-collar workers) are likely to be underrepresented in the sample, and our ability to predict the outcome of the election from that sample is reduced.

In an unbiased sample, differences between the samples taken from a random variable and its true distribution, or differences between the samples of units from a population and the entire population they represent, should result only from chance. If their differences are not only due to chance, then there is a sampling bias. Sampling bias often arises because certain values of the variable are systematically under-represented or over-represented with respect to the true distribution of the variable (like in our opinion poll example above). Because of its consistent nature, sampling bias leads to a systematic distortion of the estimate of the sampled probability distribution. This distortion cannot be eliminated by increasing the number of data samples and must be corrected for by means of appropriate techniques, some of which are discussed below. In other words, polling an additional 1000 white college students will not improve the predictive power of our opinion poll, but polling 1000 individuals chosen at random from the electoral roll would. Obviously, a biased sample may cause problems in the measure of probability functionals (e.g., the variance or the entropy of the distribution), since any statistics computed from that sample has the potential to be consistently erroneous.

## Causes of sampling bias

A common cause of sampling bias lies in the design of the study or in the data collection procedure, both of which may favor or disfavor collecting data from certain classes or individuals or in certain conditions. Sampling bias is also particularly prominent whenever researchers adopt sampling strategies based on judgment or convenience, in which the criterion used to select samples is somehow related to the variables of interest. For example, referring again to the opinion poll example, an academic researcher collecting opinion data may choose, because of convenience, to collect opinions mostly from college students because they happen to live nearby, and this will further bias the sampling toward the opinion prevalent in the social class living in the neighborhood.

In social and economic sciences, extracting random samples typically requires a sampling frame such as the list of the units of the whole population, or some auxiliary information on some key characteristics of the target population to be sampled. For instance, conducting a study about primary schools in a certain country requires obtaining a list of all schools in the country, from which a sample can be extracted. However, using a sampling frame does not necessarily prevent sampling bias. For example, one may fail to correctly determine the target population or use outdated and incomplete information, thereby excluding sections of the target population. Furthermore, even when the sampling frame is selected properly, sampling bias can arise from non-responsive sampling units (e.g. certain classes of subjects might be more likely to refuse to participate, or may be harder to contact etc.) Non-responses are particularly likely to cause bias whenever the reason of non-response is related to the phenomenon under study. Figure 1 illustrates how the mismatches between sampling frame and target population, as well as non-responses, could bias the sample.

In experiments in physical and biological sciences, sampling bias often occurs when the target variable to be measured during the experiment (e.g. the energy of a physical system) is correlated to other factors (e.g. the temperature of the system) that are kept fixed or confined within a controlled range during the experiment. Consider for example the determination of the probability distribution of the speed of all cars on British roads at any time during a certain day. Speed is definitely related to location: therefore measuring speed only at certain types of locations may bias the sample. For instance, if all measures are taken at busy traffic junctions in the city centre, the sampled distribution of car speeds will not be representative of Britain’s cars and will be strongly biased toward slow speeds, because it neglects cars travelling on motorways and on other fast roads. It is important to note that a systematic distortion of a sampled distribution of a random variable can result also from factors other than sampling bias, such as a systematic error in the instruments used to collect the sample data. Considering again the example of the distribution of the speed of cars in Britain, and suppose that the experimenter has access to the simultaneous reading of the speedometers placed on every car, so that there is no sampling bias. If most speedometers are tuned to overestimate the speed, and to overestimate it more at higher speed, then the resulting sampled distribution will be biased toward high velocities.

## Correction and reduction of sampling bias

To reduce sampling bias, the two most important steps when designing a study or an experiment are (i) to avoid judgment or convenience sampling (ii) to ensure that the target population is properly defined and that the sample frame matches it as much as possible. When finite resources or efficiency reasons limit the possibility to sample the entire population, care should be taken to ensure that the excluded populations do not differ from the overall one in terms of the statistics to be measured. In social sciences population representative surveys most commonly are not simple random samples, but follow more complex sample designs (Cochran 1977). For instance, in a typical household survey a sample of households is selected in two stages: in a first stage there is a selection of villages or parts of cities (cluster) and in a second stage a set number of households is selected within the same cluster. When adopting such complex sample designs it is essential to ensure that the sample frame information is used properly and that the probability and random selection are implemented and documented at each stage of the sampling process. In fact, such information will be essential to compute unbiased estimates for the population using sampling weights (the inverse of the probability of selection) and taking into account the sampling design in order to properly compute the sampling error. In complex sample designs the sampling error will always be larger than in the simple random samples (Cochran 1977).

Whenever the sampling frame includes units that do not exist anymore (e.g., because the sample frames are incorrect and outdated) it will be impossible to obtain any samples from such non existing units. This situation does not bias the estimates, provided that such cases are not substituted using non-random methods, and that original sampling weights are properly adjusted to take into account such sample frame imperfections (nevertheless sample frame imperfections clearly have costs implications and if the sample size is reduced this also influences the size of the sampling error).

Solutions to the bias due to non-response are much more articulated, and can generally be divided in ex-ante and ex-post solutions (Groves et al. 1998). Ex-ante solutions try to prevent and minimize non-response in various ways (for instance specific training of enumerators, several attempts to interview the respondent, etc.) whereas ex-post solutions try to gather auxiliary information about non-respondents which is then used to calculate a probability of response for different population sub-groups and so re-weight response data for the inverse of such probability or alternatively some post-stratification and calibration.

## Sampling bias, sampling error, bias of probability function, and limited sampling bias

The concept of sampling bias should not be confused with other related but distinct concepts such as “sampling error”, “bias of a probability functional” and “limited sampling bias”. The sampling error of a functional of the probability distribution (such as the variance or the entropy of the distribution) is the difference between the estimate of the probability functional computed over the sampled distribution and the correct value of the functional computed over the true distribution. The bias of a functional of a probability distribution is defined as the expected value of the sampling error. Sampling bias can lead to a bias of a probability functional. However, the two concepts are not equivalent.

A bias can arise when measuring a nonlinear functional of the probabilities from a limited number of experimental samples even when these samples are truly randomly picked from the underlying population and there is thus no sampling bias. This bias is called “limited sampling bias”. We will give below an example of the limited sampling bias of mutual information.

## The effect of limited sampling on the determination of statistical and causal relationships

Samples of random variables are often collected during experiments whose purpose is to establish whether two variables $$X$$ and $$Y$$ are statistically inter-related. If so, observing the value of variable $$X$$ (the “explanatory variable”) might allow us to predict the likely value of variable $$Y$$ (the “response variable”). A way to quantify the amount of statistical dependence between the random variables $$X$$ and $$Y$$ is to measure mutual information $$I(X;Y)$$

$$\tag{1} I(X;Y) = \sum_{x,y} P(x,y) \, log_2 \frac{P(x,y)}{P(x) \cdot P(y)}$$

where $$P(x)$$ denotes the probability that $$X$$ takes a particular value $$x\ .$$ In the above the sums are over all possible values $$x$$ and $$y$$ that can be taken by $$X$$ and $$Y\ ,$$ respectively. $$I(X;Y)$$ is non-negative and is zero only when $$X$$ and $$Y$$ are statistically independent, i.e. the joint probability $$P(x,y)$$ of observing $$x$$ and $$y$$ equals their product $$P(x) \cdot P(y)\ .$$ Thus, measuring $$I(X;Y)$$ represents one way to test the hypothesis that the explanatory and the response variables are related.

However, in practice it can be difficult to measure $$I(X;Y)$$ because the exact values of the probabilities $$P(x), P(y) and P(x,y)$$ are usually unknown. It may be easy in principle to estimate these probabilities from observed frequency distributions in experimental samples, but this usually leads to biased estimates of $$I(X;Y)\ ,$$ even if the samples used to estimate $$P(x), P(y) and P(x,y)$$ are themselves unbiased, representative samples of the underlying distributions of $$X$$ and $$Y\ .$$ This particular type of bias is called the “limited sampling bias”, and is defined as the difference between the expected value of the probability functional computed from the probability distributions estimated with $$N$$ samples, and its value computed from the true probability distributions. Figure 2: The limited sampling bias. Simulation of an “uninformative” system whose discrete response y is distributed with a uniform distribution ranging from 1 to 10, regardless of which of two values of a putative explanatory variable x were presented. Examples of empirical response probability histograms (red solid lines) sampled from 40 and 200 observations (top and bottom row respectively) are shown in the Left and Central columns (responses to x = 1 and x = 2 respectively). The black dotted horizontal line is the true response distribution. The Right column shows (as blue histograms) the distribution (over 5000 simulations) of the mutual information values obtained with 40 (top) and 200 (bottom) observations respectively. As the number of observation increases, the limited sampling bias decreases. The dashed green vertical line in the Right columns indicates the true value of the mutual information carried by the simulated system (which equals 0 bits).

By way of example, consider a hypothetical response variable $$Y$$ which is uniformly distributed in the range 1-10, and an "explanatory variable" $$X$$ which can assume values of either 1 or 2. Let us assume that these are in reality completely independent of one another, and therefore observing values of $$x$$ can not help predict likely values of $$y\ .$$ However, an experimentalist searching for possible relationships between $$X$$ and $$Y$$ does not know this. In this case, the true conditional probability $$P(y|x)$$ is 0.1 ( Figure 2A and Figure 2B, black dotted line) for all combinations of $$x$$and $$y\ ,$$ which means that $$P(y)$$ is also 0.1; consequently, the true value of the mutual information is null. Figure 2A and Figure 2B show experimental observation frequencies (red curves) obtained from a simulated experiment with $$N$$= 40 samples (20 samples for each value of $$x$$). In this simulated example, the samples were taken truly randomly and correctly from the underlying probability distributions, and thus there was no sampling bias. However, due to limited sampling, the estimated probabilities (red line of Figure 2A and Figure 2B) differ markedly from 0.1 and from one another, and the mutual information estimate obtained by plugging the experimentally obtained estimates into the formula above is non-null (0.2 bits). Repeating the simulated experiment over and over, one obtains slightly different results each time ( Figure 2C): the information distribution computed from $$N$$= 40 samples is centred at 0.202 bits – and not at the true value of 0 bits. This shows that the mutual information estimate suffers from limited sampling bias. The greater the number of samples, the smaller the fluctuations in the estimated probabilities, and consequently the smaller the limited sampling bias. For example, with $$N$$= 200 samples; (100 samples for each value of $$x\ ;$$ Figure 2D-F), the limited sampling bias of mutual information is 0.033 bits. Similar problems apply also to measures of causal relationships such as Granger causality and transfer entropy. Note that the limited sampling bias arises because mutual information is a nonlinear function of the probabilities. The probabilities themselves would be unaffected by limited sampling bias, because they would average to the true probabilities over many repetitions of the experiment with a finite number of data.

Limited sampling bias can be corrected by computing its approximated value analytically and subtracting it out, or by using prior information about the underlying probability distributions to reduce their statistical sampling fluctuations (Panzeri et al. 2007).

## Sampling bias in neuroscience

Over recent years there has been growing interest in the effect of sampling bias and of limited sampling bias in neuroscience. An important problem in sensory neuroscience is to understand how networks of neurons represent and exchange sensory information by means of their coordinated pattern of response to stimuli. A widely used empirical approach to this problem is to record extracellularly the action potentials emitted by neurons. Extracellular electrodes are often placed in a brain location selected because action potentials can be detected. It is recognized that this procedure may bias the sampling toward larger neurons (emitting signals that are easier to detect) and toward most active neurons (Shoham et al. 2006). This is somewhat related to the problem of 'convenience sampling' discussed above. Neuroscientists are more likely to report the behavior of those neurons that are most easily ("conveniently") observed with the methods at their disposal. Correcting this sampling bias requires recording also from smaller and less active neurons and evaluating, using various types of anatomical and functional information, the relative distributions of different types of neural populations. The implications of this sampling problem and ways to take it into account are discussed in (Shoham et al. 2006). The limited sampling bias gives problems in the determination of the causal relation between sensory stimuli and certain features of the neuronal population responses, because it may artificially increase the mutual information available in complex characterizations of the neuronal responses (such as those based on the precise times of action potentials) over the information available in simpler characterization of the neuronal activity (such as those which neglect the details of the temporal structure of the neuronal response). The implications of this sampling problem and ways to correct for it are discussed in (Panzeri et al. 2007).