You are on page 1of 6

122 (IJCNS) International Journal of Computer and Network Security,

Vol. 1, No. 2, November 2009

Low Memory Strip Based Visual Saliency


Algorithm for Hardware Constrained Environment
Christopher Wing Hong Ngau, Li-Minn Ang, and Kah Phooi Seng
School of Electrical and Electronic Engineering
The University of Nottingham Malaysia Campus
Jalan Broga, 43500 Semenyih, Selangor Darul Ehsan, Malaysia
{keyx8nwh, kezklma, kezkps}@nottingham.edu.my

Abstract: This paper presents a low memory visual saliency objects with similar features. If an object which is alien to
algorithm for implementation in hardware constrained the algorithm is captured by the sensing devices,
environments such as wireless sensor networks (WSNs). While identification leading to detection would be unsatisfactory
visual saliency found importance in various applications, it since the algorithm is not trained to detect objects apart
suffered from heavy memory requirements since low-level from its given prior data. Another disadvantage would be
information from different image scales are required to be the parameter tuning. Parameters governing the
stored for later computations. Therefore, a low memory
performance of the algorithm have to be tuned for
algorithm is required without compromising the performance of
the saliency model. The proposed approach uses a strip-based
applications in different scenarios.
processing method where an image is first partitioned into Visual saliency (VS) can be used when applications
image strips and then the bottom-up visual saliency is applied to dealing with object detection are involved. The main
each of the strips. The strips are recombined to form the final attribute of a VS model is to detect or locate salient objects
saliency map. To further reduce the memory requirement in the in a given scene. Most VS models operate on easily
standard visual saliency algorithm, the Gaussian pyramid is available low-level features such as intensity, colour, and
replaced with a hierarchical wavelet decomposition using the orientation (bottom-up). As used in normal range of
lifting based 5/3 Discrete Wavelet Transform (DWT). applications, they can be applied to the WSN for
Simulation results verified that the proposed approach managed applications which involves object detection. The advantage
to achieve the same output as its non-strip based counterpart of VS over specifically trained object detection algorithms is
while keeping the memory requirements low.
that in VS, how a human perceive objects is considered.
Objects that are important, or that stands out from the
Keywords: low-memory visual saliency model, saliency map,
strip based, hardware constrained environment.
surrounding or even suspicious moving objects are all easily
captured by the human eyes. Therefore, it can be said that
the detection using visual salience is more generic and
1. Introduction natural. Furthermore, parameters in a VS model are usually
With the advancement in the miniaturization of hardware, global and do not require tuning for different scenarios
minute multimedia sensing devices can be developed to unless top-down tasks are involved.
collect valuable information. The wireless sensor network The advancement in technology has made the sensing
(WSN) utilises such devices to constantly harvest devices in WSN to be embedded with processing capabilities
information and transmitting the collected data by means of [6]. Due to space and size restrictions as well as the cost of
wireless transmission. These sensing devices in the WSN, adding additional of memory; the amount of memory
having multimedia and wireless capabilities, can be available on-chip in the sensing devices are limited. The
deployed almost anywhere to effectively provide a large limited amount of memory is seen as a major constraint
coverage in the area of interest. Therefore, the WSN is when dealing with large or high resolution images. Because
considered versatile and found useful in various of the nature of VS algorithms, implementation in a WSN
applications. Initially developed for military applications can be a challenge. Most VS algorithms are dependent on
[1], the WSNs are now made publicly available to low-level features and information of these features is
residential and commercial applications. Since then, WSNs required to be stored before they are processed stage by
have found their way into non-military applications such as stage. A single scale of information can be as large as the
habitat monitoring, environmental monitoring, object image itself. Therefore, the VS models are actually tied
tracking, surveillance, and traffic monitoring [2]–[4]. down by heavy memory requirements.
Besides the main purpose of information gathering, the In this paper, a low memory VS algorithm for
WSN has the capability to detect object in various implementation in hardware constrained environments such
environment. WSNs are particularly useful in detecting as WSNs is proposed. The low memory VS is implemented
enemy vehicles, detecting illegal border crossings, wildlife using a strip based approach where the input image is first
movement tracking, and locating missing person [2] [5]. For partitioned into image strips before each individual strip is
most detection applications, the algorithm is required to be processed. By doing so, the size of the memory buffer used
trained with a large database beforehand. Although many in storing the image during processing can be significantly
algorithms developed for detection purposes have accurate reduced. To further reduce the memory requirements in the
detection capabilities, they do experience two disadvantages. VS algorithm, a hierarchical wavelet decomposition of
Most algorithms are trained to detect a specific object or Mallet [7] is used instead of the standard dyadic Gaussian
(IJCNS) International Journal of Computer and Network Security, 123
Vol. 1, No. 2, November 2009

pyramid. By using the wavelet decomposition method, a features to form the conspicuity maps and finally, the
lower resolution approximation of the image at the previous saliency map. A WTA neural network is used to select the
level can be obtained along with the orientation sub-bands. most salient location in the input image. The selected
From there, orientation features can be used directly from location is the inhibited using inhibition of return and the
the orientation sub-bands instead of having to compute the WTA process is repeated to select the next most salient
using Gabor filters. location. This model is used as the building block for many
The remaining sections of this paper are organised as VS models today.
follows: Section 2 presents a brief development of visual
saliency models along with an overview of the low memory
2.2 Strip Based Processing
VS model using the strip based approach. Section 3
describes the bottom-up VS algorithm used in the low In the low memory strip based approach, the input colour
memory approach. In Section 4, the simulation results of the image of size Y rows × X columns captured by an optical
low memory VS algorithm are presented along with a device is first partitioned into N number of strips of size
discussion on the performance of the approach. Finally, R rows × X columns where R is the minimum number of
Section 5 concludes the paper. rows which is required in a J level DWT decomposition. An
image strip is processed at a time, going through the
2. Low Memory Strip Based Visual Saliency bottom-up VS model. The output of the VS model is a
saliency strip which contains a part of possible salient
2.1 The Development of Visual Saliency Models location in the actual input image. The processed strip is
then added to an empty array of size
Since the mid 1950s, researchers have been trying to
Y rows × X columns according to its actual strip location in
understand the underlying mechanism of visual salience and
input image. The process is repeated until all strips are
attention [8]. A simple framework of how saliency maps are
processed. The recombined strips will be the final saliency
computed in the human brain was developed by Treisman
map where all possible salient objects are encoded. An
and Gelade (1980) [9]; Koch and Ullman (1985) [10];
overview of the low memory strip based approach is shown
Wolfe (1994) [11]; and Itti and Koch (2000) [12] over the
in Figure 1.
past few decades. In the year 1980, Treisman and Gelade [9]
introduced the idea of the master map of locations which is
the fundamental of the saliency maps used in VS models
until today.
Attention, described by Treisman and Gelade, can be
represented by the spotlight metaphor. In the spotlight
metaphor, our attention moves around our field of vision
like a spotlight beam. Objects which fall within the spotlight
beam are then processed. In this idea of spotlight attention,
attention in humans can be consciously directed of
unconsciously directed.
In the framework of Koch and Ullman [10], the idea of
the saliency map is introduced. The saliency map is a master
map which encodes all the location of salient objects in a
topographic form, similar to the master map in [9]. A
winner-take-all (WTA) is utilized to allow competition
within the maps. The winner will be the most salient Figure 1. Overview of low memory strip based VS approach
location at that moment. The salient point is then extracted
and inhibited using the inhibition of return method In the VS algorithm, the DWT is used to construct the
introduced in the 1985 by Posner et al [13]. image pyramid. There are two approaches to perform the
Wolfe [11] introduced a search model in which the DWT. The first approach uses the convolution based filter
limited resources available are utilized by referring to the bank method [14-16] and the second approach uses the
earlier output. By doing so, the search guide can be more lifting-based filtering method [17]. The lifting based DWT
efficient as the current output is influenced by the previous method is preferred over the conventional convolution based
output; indicating a top-down mechanism. In the model, DWT due to computational efficiency and memory savings
features such as colours and orientation can be process in [18]. In this approach, the reversible Le Gall 5/3 filter is
parallel. The summation of the processed features generates used in the image pyramid construction [18].
an activation map, where the locations of the search objects Although the strip based approach provides a reduction in
are encoded topographically. memory, there are some trade-offs in this method. The first
Recently, Itti and Koch [12] presented a newer trade-off is that if just enough lines are used in a strip for
implementation of the model in [10]. The model provides a the DWT decomposition, horizontal line artifacts will
fast and parallel extraction of low-level visual features such appear in the final output. This is due to insufficient
as intensity, colour, and orientation. The features are information available when the strip is decomposed during
computed using linear filtering (Gaussian filter) and center- the image pyramid construction. To solve this problem,
surround structures. A normalization operator is used to additional overlapping lines are used. Although, there is a
normalize the combined maps according to the three
124 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

slight increase in the memory required, a better output is With the overlapping of lines, the total number of lines
obtained. required is 32 for a three level decomposition. Equation (1)
The second trade-off is that min-max normalisation used relates the number of lines required with strip overlaps.
in standard VS models is now inaccurate if performed on
individual strips. The actual global minimum and maximum Minimum number of lines = 2 × (2 J +1 )
is not known until the final strip is processed. Normalising
using local minimum and maximum values on the other (1)
hand will provide incorrect value range representation and
the contrast of one strip will be different from another, An illustration of the strip overlapping is shown in Figure 2.
giving rise to distortion. Using a saturation function as The overlapping only occurs at the top and bottom border of
another mean of normalisation will not work as well since the strips. The overlaps at the left and right borders shown
the minimum and the maximum values are required as in in Figure 2 are for the sake of clarity.
the case of min-max normalisation.
One possible solution is to store the minimum and the
maximum values and use them for the next input image. If
the optical device captures the image as a continuously
similar to a video stream, then the next input image (frame)
will not differ much from the previous one. As times goes
on, the quality of the output saliency map will be good since
an accurate estimate of the maximum and minimum values
are available. With this solution, only the output from the
first frame will suffer from distortions where the subsequent
output will have reasonable to good quality.

Figure 2. Illustration of strip overlapping


2.3 Image Pyramid Construction and Minimum
Required Lines
3. Bottom-up Visual Saliency Algorithm
Before the feature maps in any VS model can be computed,
image pyramids have to be constructed for each feature. In This section describes the bottom-up visual saliency
Itti and Koch’s model [12] and other model that adapts the algorithm which is used in the original non-strip and the
works of [12], the dyadic Gaussian pyramid is used as the strip based approach. In this section, the term scales is used
image pyramid. The input image is sub-sampled using a in describing the visual saliency algorithm whereas in the
Gaussian filter and then decimated by a factor of two. The DWT computation, level of decomposition is used although
process is repeated until nine levels of the image pyramid both can be used interchangeably.
are obtained. Image pyramids are constructed for the
intensity, colour, and orientation features. In [12], Gabor 3.1 General Visual Saliency Algorithm
filters are used to create four sets of image pyramid for
The input image is first converted to the YCbCr colour
orientation by convolving with each level of the intensity
space. Then, the lifting based 5/3 filter is applied to each of
pyramid.
the Y, Cb, and Cr channels. This will result in four sub-
In the approach presented in this paper, the wavelet
bands: LL (approximate coefficients, A); HL (vertical detail,
decomposition method is used instead of the dyadic
V); LH (horizontal detail, H); and HH (diagonal detail, D).
Gaussian pyramid as discussed in Section 1. In order to use
The HL, LH, and HH bands for the Cb and Cr channel are
the wavelet decomposition method, the number of scales
discarded whereas these bands for the Y channel are used in
(level of decomposition) is required. The number of lines in
the orientation feature computation. The LL bands for all
the strip is dependent on the number of scales where the
three channels are kept to compute the intensity and colour
number of scales is mainly based on the preference of the
features. The DWT process is repeated another two times to
user. Depending on how intense the user wants the salient
form a three level image pyramid (excluding images at level
object to be highlighted, the value J is varied accordingly. A
0).
higher value of J would require more lines in a strip; hence,
After the wavelet decomposition, there will be three
an increment in the required memory but better object
intensity maps, six colour maps, and nine orientation maps
highlighting. In this paper, a value of J = 3 is used.
at scale J = 1 to 3. All maps at all scales are bilinear
For a three level DWT decomposition, a minimum of 8
interpolated to facilitate point-to-point subtraction. A centre-
lines in a strip is required. However, it is always advised to
surround (CS) operation is applied to each of the maps to
include an additional level of decomposition since a single
form feature maps. The CS process is used to enhance
line at the last decomposition level will not provide a
important regions from their surroundings.
The center is a pixel at scale c ∈ {1,2, K J } and the
satisfactory result later in the center-surround operation. By
adding an additional level, the minimum number of lines
now becomes 16. To compensate for the trade-off discussed surround is a pixel at scale J = c + δ where δ ∈ {c + 1, c + 2} .
in the early part of Section 2, two additional lines have to be The CS for the intensity and colour features are computed as
added at the top and at the bottom of the strip for one level shown in Equations (2) to (4).
of decomposition.
(IJCNS) International Journal of Computer and Network Security, 125
Vol. 1, No. 2, November 2009

I C , S (m, n ) = I C (m, n ) − I S (m, n) in (5):


(2)
OC ,S (m, n ) = YDJ (m, n ) − YHJ (m, n ) + YDJ (m, n ) − YVJ (m, n )
CbC ,S (m, n ) = CbC (m, n ) − CbS (m, n )
+ YVJ (m, n ) − YHJ (m, n )
(3)
(5)
CrC ,S (m, n ) = CrC (m, n ) − CrS (m, n )
(4)

For the orientation feature, the CS computation is described

Figure 3. Saliency maps generated by the non-strip based and strip based approaches

The feature maps are then summed across scales and reduced to 16. What are required for computation are the
normalised using the min-max normalisation to form four middle eight lines. Therefore, the top four and bottom four
conspicuity maps for intensity, colour (Cb and Cr), and lines are discarded during the feature computation.
orientation respectively. The saliency map is finally The algorithm is also modified to allow the minimum and
computed by summing all four conspicuity maps and maximum to update itself to be used for subsequent strips
dividing by four as shown in Equation (6). and image frames. The range allowed would be between 0
and 255. The global minimum is firstly initialized to 255
S (m, n ) = (CM I + CM Cb + CM Cr + CM O ) / 4 while the maximum to 0. At the first strip, the global values
(6) are compared with the local minimum and maximum of the
strip. If the maximum value of strip is higher than the
3.2 Modification for the Strip Based Approach current global maximum, the global value will be updated.
For the strip based approach, the overall VS algorithm is the The similar is true for the minimum value. The process is
same as the non strip approach with the exception of the repeated every time a new strip is processed. Initially, the
image pyramid and the normalisation method. Discussed in first few strips will severely distorted but as the process
Section 2, the strip based approach would suffer from line continues, the output strip will be properly normalised.
artifacts and would not be able to know the global minimum
and maximum values unless all the strips are processed. 4. Simulation Results and Discussion
To overcome the problem of line artifacts, overlapping is
done at the pyramid construction level. The overlapping part All results are simulated in Matlab. In the low memory
strip-based approach, three level of decomposition are used.
of the strip has to be removed before they can be used for
20 image strips are used with overlapping, resulting in a 32
computation. For example, a non-overlapping strip of 16
line strip. All test images are of size 320× 480.
lines would result in an eight line approximation after a
level of decomposition. With overlapping, the number of
lines is 32. After a level of decomposition, the lines will be
126 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009

4.1 Results the proposed strip based approach performs as well as the
Simulations are performed to verify the performance of the non-strip approach while saving memory resources up to
proposed approach. Comparisons of saliency maps more than 80% depending on image size.
generated by the non-strip based and strip based algorithms
are both shown in Figure 3. In Figure 3, the saliency maps References
for the first image frame and its successive second image
frame are shown. [1] K. Romer and F. Mattern, “The Design Space of
Wireless Sensor Networks”, Wireless
Table 1: Memory savings according to image size Communication, IEEE, Volume 11, Issue 6, pp. 54-
61, December, 2004.
[2] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, “A
Survey on Wireless Multimedia Sensor Networks”,
Computer Network 51, pp. 921-960, 2007.
[3] N. Xu, “A Survey of Sensor Network Applications”,
University of Southern California, 2003, available on
http://enl.usc.edu/~ningxu/papers/survey.pdf.
[4] A. Mainwaring, J. Polastre, R. Szewezyk, D. Culler,
and J. Anderson, “Wireless Sensor Networks for
4.2 Discussion of Results
Habitat Monitoring”, WSAN’02, Atlanta, Georgia,
By comparing the saliency maps in column 2 and column 4 USA, September, 2002.
of Figure 3, it can be seen that the strip-based approach [5] H-W. Tsai, C.P. Chu, and T.S. Chen, “Mobile Object
provides an identical output as the non-strip based approach. Tracking in Wireless Sensor Networks,” Computer
The fact holds if the optical device provides a continuous Communications, Volume 30, Issue 8, June, 2007.
image stream. With successive image frames, the contents [6] L. W. Chew, L.-M. Ang, and K. P. Seng, “Survey of
in the frame would not change drastically in normal Image Compression Algorithms in Wireless Sensor
conditions; therefore, the min-max updating and Networks”, International Symposium on Information
normalising method will actually provide a rather accurate Technology 2008, Volume 4, pp. 1-9, 2008.
estimation of the required global values. [7] S. Mallat, “A Theory for Multi Resolution Signal
Seen in the third column of Figure 3, the top portion of Decomposition: The Wavelet Representation”, IEEE
the saliency maps are seen to be the most distorted and Transaction on Pattern Analysis and Machine
gradually improves strip moves down towards the bottom of Intelligence, Volume 11, pp. 674-693, 1989.
the image. As more strips are processed, the stored global [8] J. K. Tsotsos, L. Itti, and G. Rees, “A Brief and
minimum and maximum values are updated with a set of Selective History of Attention”, Neurobiology of
more accurate results. Once an image frame is fully Attention, Elsevier Press, 2005.
processed, the stored global values are used for the next [9] A. Treisman and G. Gelade, “A Feature Integration
frame, resulting in an improved performance, near or Theory of Attention”, Cognitive Psychology 12, pp.
identical to the result generated using the non-strip based 97-136, 1980.
approach. If the next frame consists of changes which are [10] C. Koch and S. Ullman, “Shifts in Selective Visual
not present in the prior frame, the stored values could Attention: Towards the Underlying Neural Circuitry”,
provide a good estimate of the min-max value while Human Neurobiology 4, pp. 219-227, 1985.
updating itself as in the previous frame. [11] J. Wolfe, “Guided Search 2.0: A Revised Model of
To investigate the amount of memory saved using the Visual Search”, Psychonomic Bulletin and Review,
approach, consider a memory bank having many memory 1(2), pp. 202-238, 1994.
blocks where a single block holds a single value at [12] L. Itti and C. Koch, “A Saliency-based Search
location (m, n) . For simplicity, let the number of bits in the Mechanism for Overt and Covert Shift of Visual
memory bank be of a certain number B . The actual amount Attention”, Vision Research, Volume 40, pp. 1489-
of bits allocated will not be considered and is assume to be 1506, 2000.
equal in all the memory blocks. [13] M. Posner, I. Rafal, R. D. Choate, L. S., and J.
The values calculated are based on the memory used in Vaughan, “Inhibition of Return: Neural Basis and
storing the images (maps) before they are used for further Function”, Cognitive Neuropsychology, 2(3), pp. 211-
computation at different stages and using the test image of 228, 1985.
size 320× 480. Table 1 shows the amount of memory saved [14] A. Jensen and I. Cour-Harbo, “Ripples in
when the image size is varied for a three level Mathematics: The Discrete Wavelet Transform”,
decomposition. As the image size increases, the pattern of Springer, 2000.
the saving curve tends to level off towards near 100%. [15] M. Weeks, “Digital Signal Processing Using Matlab
and Wavelets”, Infinity Science Press LLC, 2007.
5. Conclusion [16] G. Strang and T. Nguyen, “Wavelets and Filter
Banks”, 2nd Edition, Wellesley-Cambridge, 1996.
A low memory VS algorithm using image strips for [17] W. Sweldens, “The Lifting Scheme: A Custom-
implementation in hardware constrained environments has Design Construction of Biorthogonal Wavelets”,
been proposed in this paper. Simulation results verified that
(IJCNS) International Journal of Computer and Network Security, 127
Vol. 1, No. 2, November 2009

Applied and Computational Harmonic Analysis, No.


2, Volume 3, pp. 186-200, Elsevier, 1996.
[18] T. Archarya and P.-S. Tsai, “JPEG2000 Standards for
Image Compression: Concepts, Algorithms and VLSI
Architectures”, Wiley-Interscience, 2004.

Authors Profile
Christopher Wing Hong Ngau received
his Bachelor degree from the University of
Nottingham Malaysia Campus in 2008. He
is currently pursuing his PhD at the
University of Nottingham Malaysia Campus.
His research interests are in the fields of
image, hardware architectures, vision
processing, and wireless sensor network.

Li-Minn Ang received his PhD and


Bachelor degrees from Edith Cowan
University, Australia in 2001 and 1996
respectively. He is currently an Associate
Professor at the University of Nottingham
Malaysia Campus. His research interests are
in the fields of signal, image, vision
processing, intelligent processing
techniques, hardware architectures, and
reconfigurable computing.

Kah Phooi Seng received her PhD and


Bachelor degrees from the University of
Tasmania, Australia in 2001 and 1997
respectively. She is currently an Associate
Professor at the University of Nottingham
Malaysia Campus. Her research interests are
in the fields of intelligent visual processing,
biometrics and multi-biometrics, artificial
intelligence, and signal processing.

You might also like