Professional Documents
Culture Documents
Winfried A. Fellenz and Georg Hartmann E-Mail: (getfell)(hartmann)@get.uni-paderborn.de Universit tGH Paderborn, FB Elektrotechnik a Pohlweg 4749, 33098 Paderborn, Germany
Abstract
The segmentation of objects in a real world scene is a prerequisite for any higher level recognition or interpretation process. Biological visual systems exploit efcient mechanisms for object extraction which seem to be mostly data driven. We propose a network for perceptual grouping inspired from neurophysiological and psychophysical ndings, incorporating a phase diffusion process which labels the whole image into its constituent objects and the background, followed by a selective attention stage which sequentially extracts objects in the scene. The image is processed by four successive stages, copying the design of visual cortical mechanisms. Direction specic edge responses are used as starting points for a competitive and cooperative phase process. The resulting Phase image is processed by an attention mechanism, extracting homogeneous regions using both spatial and phase information, followed by the generation of a saccadic signal.
E (u; K ) =
Z
K
1. Introduction
Intermediate vision is concerned with the extraction of objects and their attributes in a scene and serves as a suitable representation, linking low-level processes like feature extraction to higher level processes like object recognition and scene interpretation. Apart from shape, motion, and depth analysis the segmentation of the scene into its basic regions is the most common intermediate level computation. As object boundaries normally coincide with intensity discontinuities in the image, many segmentation methods rely on a low level edge detection stage and group the extracted edges into region boundaries using edge following and linking methods. These edge based methods are opposed by region growing schemes which merge initially small regions with similar intensity values into larger regions by comparing pixel properties in a local neighbourhood or by using a global statistic. It was shown [21] that most of the segmentation methods can be unied using a common variational formulation which is equivalent to the Mumford-Shah energy 1
(1) The functional denes the segmentation problem as a joint smoothing and edge detection problem whereby the rst term imposes the smoothness of the image outside the edges, the second term guarantees that the piecewise smooth image u x; y indeed approximates the image g x; y , and the third term forces the discontinuity set K to have minimal length L. The nal edge Set K marks the boundaries in the image which separate regions with uniform properties. A similar formulation was used in [2, 17, 27] for surface reconstruction where a exible template is tted to the sparse and noisy data, allowing discontinuities in the data to be labeled explicitly. This labeling is introduced by a line process [11] coupled to the regularising approximation process, or a weak continuity constraint allowing occasional cracks in the interpolation process charged in the corresponding energy function by a penalty term. However, a perceptual satisfying segmentation of an intensity image should also result in regions corresponding to the observed perceptual groups and objects, using the discontinuities to indicate region boundaries. Therefore the interpolation term can be neglected and the segmentation can be decoupled from the intensity image to allow the emergent forming of regions in the phase domain, corresponding to perceptual groups in the image domain. Next, a scheme will be proposed which transforms the preattentive grouping and segmentation process into phase space, thereby decoupling the resulting phase image from the intensity image. The responses of an initial edge detection stage are rectied into ON and OFF channels, producing direction selective responses at each position of the intensity image. Simple local constraints between the resulting edge maps are used to relax the associated phase labels into homogeneous regions and phase discontinuities which correspond to zero crossings of the smoothed second intensity derivative, followed by the attentive extraction of the regions.
( )
( )
y x
+ + + + + +
X2
+ + + + + +
+ + + + + + + + + + + +
are the initial data for the relaxation labeling process exposed in the next section.
(2) (3)
stants and specify the envelope of the oriented Gaussian, sets the appropriate frequency of the modulating sinusoidal, and is a normalisation factor. Figure 2 shows the edge detection stage and the model hyper-columns which
Where q+ (x; y) represents the even symmetric function and q?(x; y) its odd symmetric Hilbert-transform. The con-
Intensity f(x) 2
Phase
a)
d)
2
K(x)
e)
ON OFF 0 x 0 x 2
K (x)
_ i;j = !i;j +
Zi;j (n) = g f( ) = vm;n
i;j !i;j Ei;j (m) Zi;j (n) vm;n hk;l
c)
f)
Figure 3. Scheme for relaxation and diffusion of phase labels. The intensity distribution (a) is ltered to extract intensity gradients (b) corresponding to perceived edges in the image. The smoothed derivative of the edge map is rectied into ON and OFF channels (c), allowing simple compatibility constraints between channels to modify an initially uniform phase map (d); (e) intermediate and (f) nal phase distribution of the phase image evolving in parallel over time.
= ( + cos( ))
vm;n Ei;j (m)Zi;j (n) (4) m;n2M X hk;l Ei+k;j +l (n)f ( ) k;l2N <
(5)
ing an incomplete circle are grouped into a synchronised round disk with a discontinuity at the upper right indicating the missing dot in phase space. In gure 4 the results of the proposed segmentation scheme for a scene with three simple objects is shown. Although the objects are dened by different boundary types ranging from intensity discontinuities over lines to dots, the phase gradient shows a common interpretation of all contour types. In gure 3 the general idea of the proposed phase relaxation and diffusion mechanism is depicted. The principal processing is as follows [10]: we dened smoothly varying constraints on the interaction strength between all direction selective responses of the second preprocessing stage. These constraints support orientation continuity by positive interactions between similar directions, and decouple both sides of the contour by negative interactions between opposite directions. The spreading of labels into regions is introduced by synchronising phase oscillators at the contours with oscillators in the interior of objects. This lling in is similar to brightness diffusion [5, 23] allowing the separation of gure and ground [16], but instead uses the coherency of cyclic phases to label the whole scene. The proposed labeling process can be formulated in terms of minimising an explicit functional depending on the basic compatibility relations, using results developed in [14]. The phases ij of each hypercolumnar vector at position i; j are updated according to a GauSeidel procedure, using a sigmoid nonlinearity g for summing up the individual activations, and a shifted cosines for calculating the contributions of neighbouring elements depending on their phase difference:
g(x) f (x)
; ; ; ; m; n 2 M
Notation Phase at position (i,j) Random variable Activity in m-th feature map Contribution of n-th feature map Compatibility constraints Connectivity matrix Sigmoid nonlinearity (tanh x ) Periodic function of phase difference Phase difference Constants Set of discrete directions
()
( )
()
The compatibility function vm;n , depicted in Fig. 5a) is modelled as a shifted Gaussian. A sparse horizontal connectivity scheme hk;l was chosen to improve the synchronisation behaviour. In Figure 5b) the qualitative convergence properties of the system are depicted, showing average phase change and normalised average energy over iteration steps. The periodic function f x can be set to sin x to resemble the Kuramoto oscillator, we instead used formulation 5 to speed up convergence. The zero mean random variable !i;j introduces noise into the decision process, thereby resolving ambiguous situations, and forcing the process to move from the initial equilibrium state with all phases being equal, to a global solution in phase space. As can be seen from the process equation 4, the change in phase at each location is governed by a correlated activity in at least one feature map at neighbouring positions. To allow the spreading of phase labels into regions formed by the oriented contours is added to an additional feaa uniform activity Ei;j m ture map m , to resemble spontaneous neuronal activity. Figure 9b)-d) shows the extracted direction selective edges
()
()
+1
( +1)
3 Y x 10 28.00
26.00
24.00
22.00
20.00
18.00
16.00
14.00
12.00
10.00
50
100
150
200 Steps
Figure 5. a) Competitive/cooperative interaction constraints between direction selective responses; b) Qualitative convergence behaviour of relaxation process, continuous: average phase change - dashed: average energy.
of the test image Paolina, using only odd-symmetric Gabor lters to half-wave rectify the oriented responses into ON and OFF channels. The result of the constraint satisfaction relaxation procedure is shown in 9e), from which the phase gradient 9f) has been computed. To compare the performance of the segmentation, the binarised gradient of the phase image and the edges detected by a Canny edge detector are shown. It can be evaluated, that the contours of the binarised phase gradient in Figure 9g) resemble the Canny edges, although no postprocessing like edge linking and maximum detection was necessary. In gure 10 the same maps are shown for a boat image.
4. Selective Attention
Two types of theories have been suggested to explain how attention is allocated to perform visual tasks. According to region based theories, an attentional spotlight is directed to spatial positions in the visual eld having circular shape with varying diameter. Object based theories, on the other hand, propose that attention is directed to perceptual groups and not just locations. However, the main advantage of an attentional mechanism is the information reduction capability of spatially selecting salient portions of the visual eld, and the possible simplication of the binding problem by linking together the output of cells coding different features of the attended object. Recent research reveals evidence for object-based theories of attention [29], with objects acting as wholes in a slow, competitive process working in parallel across the visual eld [7], although spatial selection and top-down control are part of the attentional system. Figure 6 shows a simplied sketch of the brain maps involved in the segmentation of objects from a complex scene by applying a cortical grouping mechanism and an attentional focus to the early representation of the scene. Both processes are part of early vision mechanisms [15], which operate bottom-up, whereby the attentive control serves the coupling of data driven and cognitive processing streams both possessing cyclic and feedback loops. The visual in4
formation of an image is decomposed into sets of features of multiple feature maps (V1-V5) which interact by excitatory and inhibitory connections between locations (horizontal) and features (vertical). The pre-attentively grouped visual information is further processed by an attention mechanism (pulvinar) which chooses the most salient perceptual group and selectively enhances the responsiveness of neurons to this location at the expense of information from other groups or locations. The target selection map (SC) precomputes the expected saccade in a retinotopic coordinate frame, which is transformed into a spatial attentional map in viewer centred (environmental) coordinates (PP). The spatial modulation map (FEF) integrates information about attentionally relevant locations from PP with recently visited locations (IOR) and cognitive information like expected locations and overall scanning behaviour (compare with [30]).
ing recently attended objects. The dynamics of the system has been adapted from the shunting feedback network proposed by S. Grossberg [13], and has been rewritten for discrete simulation on a computer:
w v X X
(6)
(Cij )2 I = mn i j where Cij corresponds to the map element at position (i; j ), I equals the squared sum over all activations, and Aij corresponds to the normalised result from convoluting C 2 with kernel h at Cij . extij denotes the excitatory and inhij the inhibitoryinput for IOR. B and D are arbitrarily chosen constants for bounding the activation of Cij between D and B. For reasons of simplicity we have chosen D = 0 and B = 1. = 10 =10 = 01 = 10 = 30 = 01
m n XX
In the presented simulations, the constants have been set to :; :; :; :; : , and : . Critical for the overall performance of the network is the size and form of the convolution kernel h, for which we have chosen a Gaussian with diameter ve, and the parameter which inuences the size of the variable attentional spotlight. In the presented simulations the excitatory input consists of two arrays for the phase and activity at each spatial location. In the last processing stage the selected visual information from the feature maps is integrated in a target selection map (SC) which executes a saccade by applying a nonlinear model of local lateral interactions for saccade averaging [28], based on ensemble coding and linear vector addition of movement contributions [20]. In Figure 7 the sequence of attentional foci computed from an objects image, overlayed on its phase image are shown. Figure 8 shows phase and activity maps of the excitatory input and the sequence of inhibitory maps to prevent the system to visit recently attended locations. As can be evaluated, the selected regions are a compromise between spatial and phasic coherence, allowing perceptual groups and objects to be extracted from the input.
6. Conclusion
A four stage processing model for object segmentation and selection has been proposed which combines neurophysiological and psychological data to account for its biological plausibility. We have described a relaxation phase labeling procedure for the preattentive grouping and perceptual segmentation of objects in phase space and an attention mechanism which sequentially extracts perceptual groups in a cluttered scene consistent with an object based theory of 5
visual attention. The original contribution of the presented biological framework for perceptual segmentation and selection of objects in a real world scene is the transformation of the grouping process into phase space, using a simple relaxation labeling procedure. By introducing directional responses and local constraints thereupon, serving the grouping of similar directions and the decoupling of both sides of a contour line, the proposed mechanism is able to detect zero-crossings in phase space without an explicit and biological implausible search. The gradient in phase space is sharpened compared to the edge response or the intensity discontinuity, and the whole scene is labelled into objects and background. Furthermore, the relaxation phase labeling (RPL) process is able to extract the most salient contour lines of perceptual groups in phase space, suppressing false responses generated from the preprocessing stage. Therefore the RPL-process can be used to link edges into object boundaries by closing small gaps in the contour lines of the intensity image, or the groupingof perceptual primitives like dots, points or dashes into perceptual wholes using grouping principles originally proposed by Gestalt-Psychology. For a more complete segmentation scheme involving both different spatial frequencies and multiple feature domains, the system could be expanded by a scale space approach [3, 23] and the integration of parallel texture-, motion-, and colour specic processing channels [25, 1]. An extension on the feature level will be the integration of distinctive maps for two dimensional features like direction of motion, texture, curvature, endstoppings and junctions.
Figure 10. a) Boat scene (Size 200x200); b) Summed responses of six ON channels; c) Summed responses of six OFF channels; d) Phase image after 51 iteration steps; e)-f) same as in Fig. 9.
References
[1] J. Aloimonos and D. Shulman. Integration of Visual Modules: An Extension to the Marr Paradigm. Academic Press, 1989. [2] A. Blake and A. Zisserman. Invariant surface reconstruction using weak continuity constraints. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, pages 6267. IEEE, 1986. [3] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE Trans. on Communications, 31(4):532540, 1983. [4] J. F. Canny. A computational approachto edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679698, 1986. [5] M. A: Cohen and S. Grossberg. Neural dynamics of brightness perception: Features, boundaries, diffusion, and resonance. Perception and Psychophysics, 36:428456, 1984. [6] J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. J. Opt. Soc. Am. A, 2:1160 1168, July 1985.
[7] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18:193222, 1995. [8] R. Eckhorn, R. Bauer, W. Jordan, M. Brosch, M. Kruse, W. Munk, and H. J. Reitboeck. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60:121130, 1988. [9] W. A. Fellenz. A sequential model for attentive object selection. In Proc. 39th IWK, Sept. 27-30, vol. II, pages 109116, TU Ilmenau, 1994. [10] W. A. Fellenz and G. Hartmann. Image segmentation by phase label diffusion. In Proc. of the Int. Conference on Articial Neural Networks, ICANN-95, Paris, vol. II, pages 309314, 1995. [11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721741, 1984. [12] C. M. Gray, P. Konig, A. K. Engel, and W. Singer. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338:334336, 1989. [13] S. Grossberg. Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1:1761, 1988. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 5:267287, 1983. [15] B. Julesz. Foundations of Cyclopean Perception. University of Chicago Press, 1971. [16] P. K. Kienker, G. E. Sejnowski, T. J. Hinton, and L. E. Schumacher. Separating gure from ground with a parallel network. Perception, 15:197216, 1986. [17] C. Koch, J. Marroquin, and A. Yuille. Analog neuronal networks in early vision. Proceedings of the National Academy of Science, 83:42634267, 1986. [18] K. Koffka. Principles of Gestalt Psychology. Harcourt, Brace & World, New York, 1935. [19] D. Marr and E. Hildreth. Theory of edge detection. Proceedings of the Royal Society of London B, 207:187216, 1980. [20] James T. McIlwain. Distributed spatial coding in the superior colliculus: A review. Visual Neuroscience, 6:313, 1991. [21] J.-M. Morel and S. Solimini. Variational Methods in Image Segmentation. Birkh user, Boston, 1995. a [22] M. C. Morrone and D. C. Burr. Feature detection in human vision: a phase-dependent energy model. Proceedings of the Royal Society of London, B 235:221245, 1988. [23] P. Perona and J. Malik. Detecting and localizing edges composed of steps, peaks and roofs. In Proc. of the 3rd Int. Conf. on Computer Vision, pages 5257. IEEE Comp. Soc., Osaka, 1990. [24] P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7):629639, 1990. [25] T. Poggio, E. B. Gamble, and J. J. Little. Parallel integration of visual modules. Science, 242:436242, 1988. [26] A. Rosenfeld, R. A. Hummel, and S. W. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6:420433, 1976. [27] D. Terzopoulos. Regularization of inverse visual problems involving discontinuities. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(4):413424, 1986. [28] A. J. Van Opstal and J. A. M. Van Ginsbergen. A nonlinear model for collicular spatial interactions underlying the metrical properties of electrically elicited saccades. Biol. Cybern., 60:171183, 1989. [29] S. Yantis. Multielement visual tracking: Attention and perceptual organization. Cognitive Psychology, 24:295340, 1992. [30] A. L. Yarbus. Eye movements and vision. Plenum, New York, 1967.