Computational foundations for attentive processes
John K. Tsotsos and Neil D. B. Bruce (2008), Scholarpedia, 3(12):6545. | doi:10.4249/scholarpedia.6545 | revision #91149 [link to/cite this article] |
Computational foundations for attentive processes - a formal, theoretical basis for why algorithms that address perceptual or cognitive problems require attentional mechanisms in order for their realizations to perform at human-like performance levels.
Contents |
Introduction
Although attention is a human ability we all intuitively think we understand, the Computational Foundations for Attentive Processes in the brain or in computer systems are not quite as obvious. Notions such as capacity limits pervade the attention literature but remain vague. This presentation, focusing on vision and sensory perception mostly, attempts to make these more concrete. Through mathematical proofs, it is possible to derive the necessity of attentive processes, and through algorithmic approximations (i.e., relaxing constraints to allow for non-ideal solutions and thus some error) and processing optimizations (i.e., that improve efficiency in time or memory or processing machinery) it is possible to discover realizations given either biological or computational resource constraints. Perhaps the most important conclusion is that the brain is not solving some generic perception problem, and by extension, some generic cognition problem. Rather, the generic problem is re-shaped through approximations so that the transformed problem becomes solvable by the amount of processing power available in the brain.
The human cognitive ability to attend has been widely researched in cognitive and perceptual psychology, neurophysiology and in computational systems. Regardless of discipline, the core issue has been identified to be Information Reduction. Humans, and many other animals as well, are faced with an immense amount of information and are limited in their ability to process all of it by the size of their brains. It is not simply that there is too much information; the problem is that each component of each stimulus can be matched to many different objects and scenes in our memory resulting in a combinatorial explosion of potential interpretations, as is caricatured in Figure 1.
The basic idea that humans can be viewed as limited-capacity information processing systems was first proposed by Broadbent in his 1958 book Perception and Communication. Further, attention appears in early artificial intelligence (AI) systems explicitly as a focus of attention mechanism or implicitly as a search-limiting heuristic motivated primarily from practical concerns – computers were not powerful enough and one had to do whatever possible to limit the amount of processing required so resources could be allocated to the most relevant tasks. All search methods that involve ordering or pruning of the search space reduce the quantity of data that must be processed. Information reduction is needed because the size of the full search space for a problem does not match the computing resources and system performance requirements and thus a brute-force search scheme is not sufficient. The classic mechanisms include: Branch-and-bound search, Hill-climbing, Best-first search, A* search, and more. However, motivation from cognitive psychology also made an important impact with the early appearance of a number of systems. Anderson's 1976 ACT system was intended as a general model of cognition. ACT has a focus of attention which changes as nodes in long-term memory are activated and put in working memory and as other nodes are pushed out of working memory. Focus is implemented with a small working memory (capacity limit), with strategies for determining which productions are applicable at any time. Along a very different application domain, Barstow's 1979 automatic programming system PECOS uses intermediate level grouping to focus attention on the relevant and to ignore detail. LIBRA was a system developed by Kant in 1979 for efficient analysis of code. LIBRA has explicit attention and resource management rules. Rules determine how LIBRA's own resources are to be allocated on the basis of greatest utility. A landmark in AI, the 1980 HEARSAY-II system for speech understanding due to Erman, Hayes-Roth, Lesser and Reddy, ranked concepts using goodness-of-fit in order to focus in on the strongest and those with greatest utility. Several computer vision systems also included attention strategies to limit the region of interest that is processed in an image (the earliest being Kelly 1971 in the context of face outline detection) or even the window in time for video sequence input (the first being Tsotsos 1980 for heart motion analysis). There are many more examples that help make the point that efficient matching of input, processing methods and resources has played a major role in the development of computer systems whose performance attempts to match that of humans.
Capacity limits and computational complexity
One of the most frustrating things about studying attention is that research is so often accompanied by vague discussions of capacity limits, bottlenecks, and resource limits. How can these notions be made more concrete? The area of computer science known as Computational Complexity is concerned with the theoretical issues dealing with the cost of achieving solutions to problems in terms of time, memory and processing power as a function of problem size. It thus provides the necessary theoretical foundation on which to base an answer to the capacity question.
It is reasonable to ask whether computational complexity has relevance for real problems. Stockmeyer & Chandra (1988) present a compelling argument. The most powerful computer that could conceivably be built could not be larger than the known universe, could not consist of hardware smaller than the proton, and could not transmit information faster than the speed of light. Such a computer could consist of at most \(10^{126}\) pieces of hardware. It can be proved that, regardless of the ingenuity of its design and the sophistication of its program, this ideal computer would take at least 20 billion years to solve certain mathematical problems that are known to be solvable in principle (for example, the well-known traveling salesman problem with a sufficiently large number of destinations). Since the universe is probably less than 20 billion years old, it seems safe to say that such problems defy computation1. There exist many real problems for which this argument applies (see Garey & Johnson, 1979, for a catalog), and they form the foundation for the theorems presented here.
With respect to neurobiology, many have considered complexity constraints in the past but mostly with coarse, counting arguments (for example, Feldman & Ballard 1982 or Tsotsos 1987). All roads lead to the same conclusions: the brain cannot fully process all stimuli in parallel in the observed response times. But this is like saying there is a capacity limit: This does not constrain a solution. By arguing in this manner we are no closer to knowing what exactly the brain is doing to solve this problem. This paper takes the position that a more formal analysis of vision at the appropriate level of abstraction will help to reveal quantitative constraints on visual architectures and processing. First, however, it is important to address the applicability of this analysis for the neurobiology of the brain.
- This claim, plus the entire discussion of intractability, relies on the current unproved conjecture that the space of problems that fall into the category referred to by Stockmeyer and Chandra is not the same as the space of problems that are efficiently solvable. Formally, we assume P≠NP, and one normally needs to acknowledge this.
Human perception/cognition and computation
Can cognition, intelligence, perception or vision be expressed in computational form? Surely, the discovery of explanatory mechanisms that yield observed performance is central to all experimental studies. In a very real sense, such explanations are presented as algorithms, a step-by-step procedure for achieving some result, and algorithms are a central component of the language of computation. But could the brain actually process signals in an algorithmic manner? This non-trivial issue is important because if it could be proved that human brain processes cannot be modeled computationally (and this is irrespective of current computer hardware), then modeling efforts are futile. The jury is still out on this point, but a perhaps overwhelming amount of support has accumulated beginning with a critical theoretical notion, decidability. A proof of decidability is sufficient to guarantee that a problem can be modeled computationally (Davis, 1958, 1965). Decidability should thus not be confused with tractability. A tractable problem is one where enough resources can be found and enough time can be allocated so that the problem can be solved. An undecidable problem is one that cannot be solved for all cases by any algorithm whatsoever. An intractable problem is one for which no algorithm can exist which computes all instances of it in polynomial, as opposed to exponential, time. That is, the mathematical function that relates processing time/space to the size of the input is exponential in that input size; the problem shows exponential complexity. If the input is large, then its role as an exponent leads to the need for enormous amounts of processing resources, as described by Stockmeyer and Chandra above. There are several classes of such problems with differing characteristics and NP-Complete is one of those classes.
To show that vision is decidable, then it must first be formulated as a Decision Problem. This means that if it is the case that for some problem we wish to know of each element in a countably infinite set \(A\ ,\) whether or not that element belongs to a certain set \(B\) which is a proper subset of \(A\ ,\) then that problem can be formulated as a decision problem. Such a problem is decidable if there exists a Turing Machine which computes ‘yes’ or ‘no’ for each element of \(A\) in answer to the decision question. A Turing Machine is a hypothetical computing device consisting of a finite state control, a read-write head and a two-way infinite sequence of labelled tape squares. A program then provides input to the machine, is executed by the finite state control, and computations specified by the program read and write symbols on the squares of the tape.
If we look at the perceptual task definitions provided by Macmillan and Creelman (2005), we see that all psychophysical judgments are of one stimulus relative to another - the basic process is comparison. The most basic task is discrimination, the ability to tell two stimuli apart. This certainly looks very much like the decision problem defined above. Macmillan and Creelman define recognition as a discrimination task where a subject is required to say to which of two classes a given a stimulus belongs. Strongly related to this, the visual match problem defined by Tsotsos (1989) - does an image contain an instance of a given target - is an instance of Yashuhara’s Comparing Turing Machine (1971) and thus decidable.
This however is not a proof that all of human vision can be modeled computationally. If no sub-problem of vision could be found to be decidable, then it might be that perception as a whole is undecidable and thus cannot be computationally modeled. But, what if there are other undecidable vision sub-problems? Even if some other aspect of vision is determined to be undecidable, this does not mean that all of vision is also undecidable or that other aspects of perception cannot be modeled computationally. Hilbert’s 10th problem in mathematics and the Halting Problem for Turing Machines are two examples of famous undecidable problems. The former does not imply that mathematics is not possible while the latter does not mean that computers are impossible; computer science and mathematics both include sub-problems that are decidable as well as other sub-problems that are undecidable and these co-exist with no insurmountable difficulty.
Once a problem is known to be decidable, it may in fact be an instance of one of the many known NP-Complete problems and proof methods exist for demonstrating this. The proofs need to show that for any possible algorithm or implementation, the time complexity function of a solution is exponential in the size of the input. The analysis is asymptotic; that is, the problem is intractable only as the input size becomes very large. Small values of an exponent (i.e., small input sizes) are often quite realizable.
Some might think that since the brain’s neurons operate in parallel, that parallelization of computations makes the overall process tractable. NP-complete problems are strongly conjectured to not be parallelizable and showing a problem is NP-complete is very strong evidence that the problem is not parallelizable; ie., computable in polynomial time with a set of parallel machines. It is important to note that the problem may still be parallelizable on some of the possible inputs and it might be that all sufficiently small instances may be parallelizable. Although NP-completeness is definitely a sign that the problem may be hard, it does not rule out efficient algorithms for the search problems the brain must actually compute. We are faced with two choices as a result: can we find those algorithms or ask if the brain is perhaps not even solving the original problem. We approach these questions in the context of vision (since 40% or more of the cortex is devoted to vision, it is a reasonable starting point).
The computational complexity of vision
What is the generic vision problem? For our purposes we will use the following: Given a sequence of images, for each pixel determine whether it belongs to some particular object or other spatial construct, localize all objects in space, detect and localize all events in time, determine the identity of all the objects and events in the sequence, determine relationships among objects and relate all objects and events to the available world knowledge. This section will briefly summarize the steps of an argument that concludes that the generic vision problem as defined here is intractable. If a vision system needs to search through the set of all possible image locations (pixels) and compare them to each element of memory, then without any task guidance or knowledge of the characteristics of the subset it seeks, it cannot know which subset may be more likely than another. As a result, it is the powerset of all locations that gives the number of image subsets to examine, an exponential function. Thus, purely feed-forward, unconstrained visual processing seems to have an inherent exponential nature.
Two abstract problems can be defined to make this more concrete. The important variables that play a role in the definition are: the number of image locations (\(P\)), the number of object/event prototypes in memory (\(N\)) and the number of measurements made at each image location (\(M\)). The first problem is Unbounded Visual Match. This was intended to model recognition where no task guidance to optimize search is permitted. It corresponds to recognition with all top-down connections in the visual processing hierarchy removed or disabled. In other words, this is pure data-directed vision. More precisely, it determines if there exists a subset of pixels in an image that matches a particular target. The second problem is Bounded Visual Match. This is recognition with knowledge of a target and task in advance, such that the knowledge is used to optimize the process, the bound thus being on the size of pixel subset that is considered in the search. The basic theorems, proved in (Tsotsos 1989, 1990a) are:
- Theorem 1
- Unbounded (Bottom-Up) Visual Matching (UVM) is NP-Complete, with time complexity an exponential function of \(P\) (the worst-case time complexity is \({\rm O}(N 2^{P M})\)).
- Theorem 2
- Bounded (Task-Directed) Visual Match (BVM) has time complexity linear in \(P\) (the worst-case time complexity is \({\rm O}(N P 2^M)\)).
The results have important implications. They first tell us that the pure data-directed approach to vision (and in fact to perception in any sensory modality) is computationally intractable in the general case. They also tell us that bounded visual search takes time linearly dependent on the size of the input, something that has been observed in a huge number of experiments. Even small amounts of task guidance can turn an exponential problem into one with linear time complexity.
However, the above theorems are presented using worst-case analysis; perhaps biological vision systems are designed around the average or best case scenarios as Lowe (1990) has suggested. He suggested that it might be likely that expected case analysis more correctly reflects biological systems. Parodi et al. (1998) addressed this criticism by developing an algorithm for generating random instances of polyhedral scenes and examining median case complexity of labeling the scenes. The key results echo the worst-case results: the unbounded median-case time complexity is exponential in the number of polyhedral junctions while the bounded median-case time complexity is linear in number of junctions. In other words, Lowe’s suggestion did not lead to an analysis that shows higher efficiency solution. The key conclusion remains: The application of task knowledge guides the processing and converts an exponential complexity problem into a linear one. This is one of the main forms of attentive processing.
Complexity constrains the architecture for visual processing
How can one deal with an intractable problem in practice? NP-Completeness eliminates the possibility of developing a completely optimal and general algorithm. Garey & Johnson (1979) provide a number of guidelines. The relevant guideline for this analysis is: Use natural parameters to guide the search for approximation algorithms, algorithms that solve the problem but not optimally, providing a solution only to within some specified error tolerance. One way of doing this it to reduce the exponential effect of the largest valued parameters. The process was presented in (Tsotsos 1987, 1988, 1990a, 1990b, 1991).
The natural parameters of the computation for visual search are \(N\ ,\) \(P\) and \(M\) with worst-case time complexity being \({\rm O}(N 2^{P M}\)). \(N\) is a large number but any reduction leads to linear improvements. \(P\) is also a large number but reduction in \(P\) can lead to exponential improvement, as does reduction in \(M\ ,\) however \(M\) is not so large a number. We can conclude that the best improvement would come from reductions in \(P\ .\) What would such changes look like? Here are several that are straightforward yet sufficient (but not necessary):
- Hierarchical organization takes search of model space from \({\rm O}(N)\) to \({\rm O}(\log_2N)\ .\)
- Search within a pyramidal representation of the image (a layered representation, each layer with decreasing spatial resolution and with bidirectional connections between adjacent layers) operates in a top-down fashion, beginning with a smaller more abstract image, and is then refined locally thus reducing \(P\ .\)
- Spatio-temporally localized receptive fields reduce number of possible receptive fields from \(2 P\) to \({\rm O}(P^{1.5})\) (this assumes contiguous receptive fields of all possible sizes centered at all locations in the image array).
- Spatial selectivity can further reduce the \({\rm O}(P^{1.5})\) term if one selects the receptive field that is to be processed. This is not a selection of location only, but rather a selection of a local region and its size at a particular location.
- Feature selectivity can further reduce the \(M\) term, that is, which subset of all possible features actually is represented in the image or is important for the task at hand.
- Object selectivity can further reduce the \(N\) term, reflecting task-specific information.
After applying the first three constraints, \({\rm O}(P^{1.5}2M\log_2N)\) is the worst-case time complexity. The next three reduce \(P\ ,\) \(N\) and \(M\) all to 1 to bring the expression down to perhaps its lowest value. These last three are all manifestations of attentive processing.
But how do these actions affect the generic perception problem described above? Hierarchical organization and logical map separation do not affect the nature of the vision problem. However, the other mechanisms have the following effects:
- Pyramidal abstraction affects the problem through the loss of location information and signal combination.
- Spatio-temporally localized receptive fields force the system to look at features across a receptive field instead of finer grain combinations and thus arbitrary combinations of locations are disallowed.
- Attentional selection further limits what is processed in the location, feature and object domains. The knowledge that task-related attentive processes employ to accomplish the selection can be as generic as statistical regularities.
As a result of these, the generic problem as defined earlier has been altered. Unfortunately it is not easy to formally characterize this altered problem. But it is clear that certain aspects of the problem are lost with these approximations.
Extending to cognition and action
Much of human cognition is action-based and actions need sensory input for their guidance. The kind of visual search described above seems ubiquitous within intelligent behavior. Simple behavior may be conceptualized as :
- localize and recognize a stimulus (with or without a pre-specified target)
- link the stimulus to an applicable action
- decide among all possible actions
- generate actuator commands.
If sensory perception is an integral component of intelligence, sensory search problems are integral sub-problems of intelligent behavior.
The first step is exactly the UVM problem. The second step can be solved by table look-up, the table size being the number of possible stimuli (say 100,000 objects, irrespective of location as argued in Tsotsos 1995) times the number of possible behaviors and where the table entries code strength of applicability, perhaps dependent on stimulus characteristics (thus, each stimulus may be associated with more than one behavior). The third step can be done by choosing the largest applicability strength from among all the entries, and the fourth would involve a function call or table-look-up again.
The problems are formalized in Tsotsos (1995) with proofs for the following:
- Theorem 3
- Unbounded Stimulus-Behavior Search is NP-hard.
- Theorem 4
- Bounded Stimulus-Behavior Search has time complexity linear in the number of image locations.
Extensions to time-varying input, active camera systems, and to the sensor planning problem in visual search have all been presented and mirror the above results (Tsotsos 1992, Ye & Tsotsos 1996). The generic, strictly data-directed versions of these problems all exhibit inherent exponential time complexity and the addition of attentive, task-specific guidance is sufficient to reduce this complexity to linear or better.
Conclusions
We have shown that visual search, and by extension any other intelligent behavior that requires some sensory perception as an integral component, has exponential time complexity where the exponents are too large to enable it to be tractably solved in any realization. By transforming the original problem to a different, simpler one through optimizations and approximations, a very low complexity solution can be obtained. Those optimizations and approximations are all different kinds of attention. Attention, thus, is a set of mechanisms that help optimize the search processes inherent in perception and cognition. What are those mechanisms? Several have already been mentioned, and the full group can be classified as being of three main types with several specializations within each:
Selection spatio-temporal region of interest world/task/object/event model gaze/viewpoint best interpretation/response Restriction task relevant search space pruning location cues fixation points search depth control Suppression spatial/feature surround inhibition suppression of task-irrelevant computations inhibition of return
Details on how such mechanisms may manifest themselves in computational systems appear in many of the works cited earlier.
References
- Anderson, J (1976). Language, Memory and Thought. Erlbaum Associates, Hillsdale, NJ.
- Barstow, D R (1979). Knowledge-Based Program Construction. North-Holland, New York.
- Broadbent, D (1958). Perception and Communication. Pergamon Press, London.
- Davis, M (1958). Computability and Unsolvability. McGraw-Hill, New York.
- Davis, M (1965). The Undecidable. Hewlett Raven Press, New York.
- Erman, L; Hayes-Roth, F; Lesser, V and Reddy, R (1980). The Hearsay-II Speech-Understanding System: Integrating Knowledge to Resolve Uncertainty. ACM Computing Surveys 12: 213-253.
- Feldman(1982). Connectionist models and their properties. Cognitive Science 6: 205-254.
- Garey, M and Johnson, D (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, San Francisco.
- Kant, E (1979). Efficiency Considerations in Program Synthesis: A knowledge-based Approach. Stanford University, Palo Alto, CA.
- Kelly, M (1971). Edge Detection in Pictures by Computer Using Planning. Machine Intelligence 6: 397-409.
- Lowe, D (1990). Probability Theory as an Alternative to Complexity. Behavioral and Brain Sciences 13: 451-452.
- Macmillan, N A and Creelman, C D (2005). Signal Detection Theory: A User's Guide. Routledge, New York.
- Parodi, P; Lancewicki, R; Vijh, A and Tsotsos, J K (1998). Empirically-Derived Estimates of the Complexity of Labelling Line Drawings of Polyhedral Scenes. Artificial Intelligence 105: 47-75.
- Stockmeyer(1988). Intrinsically difficult problems. Scientific American Trends in Computing 1: 88-97.
- Tsotsos, J K (1980). A Framework for Visual Motion Understanding. PhD Thesis, Dept. of Computer Science, University of Toronto, Toronto.
- Tsotsos, J K (1987). A `Complexity Level' Analysis of Vision. Proceedings of International Conference on Computer Vision: Human and Machine Vision Workshop, London.
- Tsotsos, J K (1989). The Complexity of Perceptual Search Tasks. Proc. IJCAI, Detroit. 1571-1577
- Tsotsos, J K (1988). A `complexity level' analysis of immediate vision. International Journal of Computer Vision 1(4): 303-320.
- Tsotsos, J K (1990a). A Complexity Level Analysis of Vision. Behavioral and Brain Sciences 13(3): 423-455.
- Tsotsos, J K (1990b). A Little Complexity Analysis Goes a Long Way. Behavioral and Brain Sciences 13(3): 459-469.
- Tsotsos, J K (1991). Is Complexity Analysis Appropriate for Analyzing Biological Systems? Behavioral and Brain Sciences 14(4): 770-773.
- Tsotsos, J K (1992). On the relative complexity of active vs passive visual search. International Journal of Computer Vision 7(2): 127-141.
- Tsotsos, J K (1995). On Behaviorist Intelligence and the Scaling Problem. Artifical Intelligence 75: 135-160.
- Turing, A M (1936). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2(42): 230-265.
- Turing, A M (1948 (1969 reprint)). Intelligent Machinery (reprint in Machine Intelligence 5). Meltzer, B., Michie, D. Edinburgh University Press, Edinburgh.
- Ye, Y and Tsotsos, J K (1996). 3D Sensor Planning: Its Formulation and Complexity. International Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL.
- Yashuhara, A (1971). Recursive function theory and logic. Academic Press, New York.
Internal references
- Lawrence M. Ward (2008) Attention. Scholarpedia, 3(10):1538.
- Valentino Braitenberg (2007) Brain. Scholarpedia, 2(11):2918.
- Zhong-Lin Lu and Barbara Anne Dosher (2007) Cognitive psychology. Scholarpedia, 2(8):2769.
- Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
- Raymond Klein and Jason Ivanoff (2008) Inhibition of return. Scholarpedia, 3(10):3650.
- Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.
- Peter Jonas and Gyorgy Buzsaki (2007) Neural inhibition. Scholarpedia, 2(9):3286.
- Rodolfo Llinas (2008) Neuron. Scholarpedia, 3(8):1490.
- Jeremy Wolfe and Todd S. Horowitz (2008) Visual search. Scholarpedia, 3(7):3325.
Further reading
- None.