You are on page 1of 28

CHAPTER 3

REAL TIME EMBEDDED VIDEO IMAGE PROCESSING SYSTEM


3.1

Introduction

One of the major advantages of embedded video image processing systems in video
surveillance applications, as discussed in the earlier chapters, is that they can capture the
interest event automatically base on some pre-define criteria. However the issue of real
time still becomes major problem during the implementation of the system as raised by
most of the researchers [11, 12, 13, 14, 15]. The development of a real time embedded
video image processing systems is involve an integration process of several electronic
modules, e.g. image acquisition module, video image processing modules, memories and
a display monitor unit. Forcing of market trend to have faster, compact, smarter and
more interactive products, designers are working under pressure to fulfill the demand by
selecting appropriate technologies for each of mentioned modules among of other
alternatives. Concern on speed, area and cost are always a trade-off. In this chapter
several alternative technology for implementation of real time embedded video image
processing system are reviewed and motivate to choose the best architecture for the
implementation. The gradient derivative based edge detection computation design,
controlling part design and integration design core architecture with embedded processor
core also described on this chapter.
3.2

Hardware Architecture Features and the Trend of Processors

Since the most resource demanding operations in terms of required computations and
memory bandwidth involve low-level and intermediate level operations, considerable
research has been devoted to developing hardware architectural features for eliminating
bottlenecks within video image processing chain, freeing up more time for performing
high-level interpretation operations. While the major focus has been on speeding up lowlevel and intermediate level operations, there have also been architectural developments
to speed up high-level operations.
From the literature, one can see there are three major architectural features that are
essential to any video image processing system, namely single instruction multiple data
(SIMD), very long instruction word (VLIW), and an efficient memory subsystem. The

concept of SIMD processing is a key architectural feature found in one way or another in
most modern real time video image processing systems [71, 72, 73]. It embodies
broadcasting a single instruction to multiple processors, which simultaneously execute
the instruction on different portions of data in parallel, thus allowing more computations
to be performed in a shorter time [74]. This mode of processing fits low-level and
intermediate level operations well as they require applying the same operation to different
pixel data. Naturally, SIMD can also be used to speed up matrix-vector operations.
The SIMD concept has been used extensively since the 1980s as evident from its
widespread use in vision accelerator boards, instruction set extensions for generalpurpose processors (GPPs), and packed data processing of digital signal or media
processors. In fact, the most common instantiation of the concept of SIMD in todays
GPPs, digital signal and media processors, is in the form of the packed data processing
extension, which is also known as subword parallelism or worldwide data optimization
[75, 76, 77, 78]. These extensions have primarily been developed to help speed up the
processing of multimedia data. Since pixel data are usually represented by 8-bits or 16bits, and since most modern processors have 32-bit registers, packed data processing
allows packing four 8-bit pixels or two 16-bit pixels into 32-bit registers and then issuing
an introduction to be performed on the individual 8-bit or 16-bit pixels at the same time.
These types of packed data instructions not only alleviate the computation burden of lowlevel and intermediate-level operations, but also help to reduce memory access
bottlenecks because multiple pixel data can be read using one instruction. Packed data
processing is a basic form of SIMD. In general, SIMD is a useful tool for speeding up
low, intermediate, and matrix-vector operations on modern processors. Thus, one can
think of SIMD as a tool for exploiting data level parallelism (DLP).
While SIMD can be used for exploiting DLP, VLIW can be used for exploiting
instruction level parallelism (ILP) [79]., and thus for speeding up high-level operations
[80]. VLIW furnishes the ability to execute multiple instructions within one processor
clock cycle, all running in parallel, hence allowing software-oriented pipelining of
instructions by the programmer. Besides the fact that VLIW to work properly there must
be no dependencies among the data being operated on, the ability to execute more than
one instruction per clock cycle is essential for video image processing applications that

require operations in the order of giga operations per second [81]. Of course, while SIMD
and VLIW can help speed up the processing of diverse video image operations, the time
saved through such mechanism would be completely wasted if there did not exist an
efficient way to transfer data throughput the system [82]. Thus, an efficient memory
subsystem is considered a crucial component of a real-time video image processing
system, especially for low-level and intermediate-level operations that require massive
amounts of data transfer bandwidth as well as high-performance computation power.
Concept such as direct memory access (DMA) and internal versus external memory are
important. DMA allows transferring of data within a system without burdening the CPU
with data transfers. DMA is a well-known tool for hiding memory access latencies,
especially for image data. Efficient use of any available on-chip memory is also critical
since such memory can be accessed at a faster than external memory. More discussion on
memory usage optimization techniques will described in the following subchapter
together with reviewing of different standard processor architectures.
3.2.1 General Purpose Processors
There are two types of General Purpose Processors (GPPs) on the market today, one
geared toward non embedded applications such as desktop PCs and the other geared
toward non embedded applications. Todays desktop GPPs are extremely highperformance processors with highly parallel architectures, containing features that help to
exploit Instruction Level Parallelism (ILP) in control-intensive, high-level video image
operations. Single Instruction Multiple Data (SIMD) extensions have also been
incorporated in their instruction sets allowing such as processors to exploit Data Level
Parallelism (DLP) and enabling moderate acceleration of multimedia operations
corresponding to low-level and intermediate-level video image processing operations.
GPPs have been outfitted with the multilevel cache feature. This feature provides the
potential of having low latency memory accesses for frequently used data. These
processors also require an Real Time Operating System (RTOS) in order to guarantee a
real-time execution. Desktop GPPs are characterized by their large size, requiring a
separate chip set for proper operation and communication with external memory and
peripherals.

Although GPPs have massive general-purpose processing power, they are extremely high
powered devices requiring 100s of watts of power. Clearly such processors are not
suitable for embedded applications. Despite this fact, advances in desktop GPPs have
allowed the standard commercial off-the-shelf desktop PCs to be used for implementing
non embedded real-time video image processing systems. In [16], it is even claimed that
the desktop PC is the de facto standard for industrial machine vision applications where
there is usually enough space and power available to handle a workstation. It should be
noted that such industrial inspection systems usually augment the processing power of the
desktop GPP with vision accelerator boards. These boards often furnish a dedicated
SIMD video image processor for high-performance real-time processing not normally
met by the SIMD extensions to the desktop GPP. Recently, a paradigm shift toward
multicore processor designs for desktop PCs has occurred in order to continue making
gains in processor performance.
On the embedded front, there are also several GPPs available on the market today with
high-performance general-purpose processing capability suitable for exploiting ILP
coupled with low power consumption and SIMD-type extensions for moderately
accelerating multimedia operations, enabling the exploitation of DLP for low-level and
intermediate-level video image processing operations. Embedded GPPs have been used
in multicore embedded System-on-Chips (SoCs), providing the horsepower to cope with
control intensive instructions and branch intensive instructions. Both embedded and
desktop GPPs are supported by mature development tools and efficient compilers,
allowing quick development cycles. While GPPs are quite powerful, they are neither
created nor specialized to accelerate massively data parallel computations.
3.2.2 Digital Signal Processors
The implementation of real time embedded video image processing system involves one
or some of processing unit to support functionality of video image processing. The
processing unit runs programs that implement the functions required of the video image
processing system. Unlike a general-purpose computer, a computer in an embedded
system has only resources-dedicated to support its specialized operation. Dedicated
embedded computer like Digital Signal Processors (DSPs) offer a fast and economic

solution in combination with very low power consumption for executing signal and
image processing algorithms. While it may have been true in the past that DSPs not
suitable for processing image/video data in that they could not meet real-time
requirements for video rate processing, this is no longer the case with newly available
high-performance DSPs that contain specific architectural enhancements addressing the
data/computation throughput barrier. DSPs have been optimized for repetitive
computation kernels with special addressing modes for signal processing such as circular
or modulo addressing. This helps to accelerate the critical core routines within inner
loops of low level and intermediate level image/video processing operations.
In many DSP implementations, it is observed that a large percentage of the execution
time is due to a very low percentage of the code, which simply emphasizes the fact that
DSPs are best for accelerating critical loops with few branching and control operations,
which are best handled by a General Purpose Processor (GPP) [51]. DSPs also allow
saturated arithmetic operations that are useful in image/video processing to avoid pixel
wraparound from a maximum intensity level to a minimum level or vice versa [52]. DSPs
possess either a fixed point or a floating point Central Processing Unit (CPU), depending
on the required accuracy for a given application.
In most cases, a fixed point CPU is more than adequate for the computations involved in
image/video processing. DSPs also have predictable, deterministic execution times that
constitute a critical feature for ensuring that real-time deadlines are met. In addition,
DSPs have highly parallel architectures with multiple functional units and Very Long
Instruction Word (VLIW) / Single Instruction Multiple Data (SIMD) features, further
proving their suitability for image/video processing. DSPs have been designed with high
memory bandwidth in mind, on-chip DMA controllers, multilevel caches, buses, and
peripherals, allowing efficient movement of data on-chip and off-chip from memories
and other devices.
DSPs support the use of Real Time Operating Systems (RTOSs), which again help in
guaranteeing that critical system level hard real-time deadlines are met. Of course, DSPs
are fully programmable, which adds to their inherent flexibility to changes in algorithm
updates. Modern development tools such as efficient C code compilers and use of

hardware-specific intrinsic functions have replaced the need to generate hand-coded


assembly for all but the most critical core loops, leading to more efficient development
cycles and faster time to market. Indeed, DSPs contain specific architectural features that
help one to speed up repetitive, compute-intensive signal processing routines, making
them a viable option for inclusion in a real-time video image processing system. That is
why DSPs have been used in many real time video image processing systems. More
recently, DSPs have been included as a core in dual-core processing system-on-chips for
consumer electronics devices such as PDAs, cell phones, digital cameras, portable media
players, etc.
3.2.3 System-on-Chip
In the consumer electronics market, there has been a drive toward single-chip solutions or
System-on-Chip (SoCs) for portable embedded devices, which require high performance
computation and memory throughput coupled with low power consumption in order to
meet the real-time video image processing constraints of battery-powered products such
as digital cameras, digital video camcorders, cell-phone-equipped cameras, etc. These
systems exhibit elegant designs where one can learn how the industry has approached the
battery-powered embedded real-time video image processing problem. For example,
consider the TMS320DM320 digital media processor manufactured by Texas
Instruments [53]. This is a multiprocessor chip with a reduced instruction set (RISC)
microprocessor coupled with a low-power fixed-point DSP. The RISC microprocessor
serves as the master handling system control, running a RTOS and providing the
necessary processing power for complex control-intensive operations. The DSP, acting as
a slave to the RISC, is a low-power component for performing computationally intensive
signal processing operations.
The presence of a memory traffic controller allows achieving a high-throughput access to
memory. In this device, the RISC and DSP are accompanied by a set of parameter
customizable application-specific processors that provide a boost, that is to say, they
provide the extra computational horsepower that is necessary to perform functions such
as real-time LCD preview (Preview Engine) and real-time computation of low-level
statistics necessary for auto exposure, auto white balance, and autofocus (H3A Engine).

The DSP along with its accelerators and dedicated image processing memory buffers
provides a high-computation throughput and memory bandwidth for performing various
image/video processing related functions such as rendering the final captured image
through the image pipeline and running video image compression routines.
By examining this architecture, one can see that this SoC has been designed with a DSP
plus dedicated hardware accelerators for low-level and intermediate-level operations
along with a GPP hardware for more complex high-level operations. This is an
illustrative example showing that a complete real-time video image processing system
can be characterized as a heterogeneous architecture with a computation-oriented front
end coupled with a general-purpose processing back-end. Of course, the TMS320DM320
is just one good example of many currently available multiprocessor embedded SoCs. In
fact, as it will be seen in the examples section, the low-power, moderate performance
DSPs plus accelerators have been widely used by many research groups in the form of
DSP systems, most likely due to cost issues associated with ASIC development. An
interesting recent hardware development for digital imaging is the Texas Instrument
DaVinci technology that couples an Advanced RISC Machines (ARM) processor with a
high performance C64x DSP core [54]. This technology provides the necessary
processing and memory bandwidth to achieve a complete imaging SoC. Examples of
research performed on multicore embedded SoC for digital camera applications can be
found in the references [55, 56, 57, 58, 59, 60]., which cover the development and
implementations of the automatic white balancing, automatic focusing, and zoom
tracking algorithms encountered in todays digital camera systems.
3.2.4 Field Programmable Gate Arrays (FPGAs)
The growing need for flexibility and cost effective system causes a shift to Field
Programmable Gate Arrays (FPGAs). FPGAs are arrays of reconfigurable complex logic
blocks with a network of programmable interconnect [61]. The amount of gates and
capabilities of FPGAs are expected to continue to grow in future generations. FPGAs
allow fully application specific custom circuits to be designed by using a software
programming language known as a hardware description language (HDL). They provide
precise execution times helping to meet hard real-time deadlines. FPGAs can be

configured to interface with various external devices. Since they are reprogrammable
devices, they are flexible in the sense that they can be reconfigured to form a completely
different circuit. Current generation FPGAs can be either fully reconfigured or partially
reconfigured, with reconfiguration times of less than 1ms, making it possible to have a
dynamic run-time reconfiguration. This configuration is useful for reducing system size
of embedded devices. Due to their programmable nature, FPGAs can be programmed to
exploit different types of parallelism inherent in a video image processing algorithm. This
in turn leads to highly efficient real time video image processing for low-level,
intermediate-level, or high-level operations, enabling an entire imaging system to be
implemented on single FPGA. In general, FPGAs have extremely high memory
bandwidth. As a result, one can use custom memory configurations and/or addressing
techniques to exploit data locality in high-dimensional data. In many cases, FPGAs have
the potential to meet or exceed the performance of a single DSP or multiple DSPs.
FPGAs can be thought of as combining the flexibility of software programmability with
the speed of an application-specific circuit (ASIC) within a shorter design cycle or timeto-market. Often an FPGA implementation is the first step toward transitioning to an
ASIC, or in some cases the final product. However, there is a disadvantage related to
FPGAs, that is, their energy or power consumption efficiency. Lately, low-power FPGAs
are becoming more available.
In essence, FPGAa have high computational and memory bandwidth capabilities that are
essential to real-time video image processing systems. Because of such features, there has
been an increasing interest in using FPGAs to solve real-time video image processing
problems [62]. FPGAs have already been used to solve many practical real-world, realtime video image processing problems, from a preprocessing component to the entire
processing chain. FPGAs have also been used in conjunction with DSPs. A current trend
in FPGAs is to include a GPP core on the same chip as the FPGA for customizable
system-on-chip (SoC) solution.
3.2.5 Memory Performance Gap
Most of hardware designers understand that yearly increases in memory performance
slowly lags behind such increases in computing performance. Due to of this fact, memory

resources management becomes important to carefully consider in real time video image
processing system, especially when a vast amount of data must be dealt with. While
memory management strategies could be regarded as memory optimization strategies, a
distinction is made here between the two because of the overwhelming importance of
memory performance bottlenecks as opposed to computation bottlenecks. Memory
optimizations are meant to alleviate memory performance bottlenecks, while software
optimizations are meant to alleviate computation bottlenecks. An overview of some
memory optimization strategies is described in the following sub chapter.
3.2.5.1 On-Chip Memory Strategy
In video image processing applications, it is beneficial to place the image being operated
on within the on-chip memory to enable processor to quickly access the necessary data
with minimum latencies, reducing the overhead of external memory accesses. Since it is
often the case that an entire image or video frame cannot fit within the available on-chip
memory, the processing has to be reorganized or restructured to enable an efficient
implementation on the target hardware. The important strategy to deal with this issue
involves allocating a buffer section in the available internal memory, partitioning the
image data into blocks the size of the allocated buffer, and performing processing on the
smaller data blocks. Some important image data partitioning schemes include row-stripe
partitioning and block partitioning. The most commonly used partitioning scheme is the
row-stripe scheme where a few lines or rows of image data are pre-fetched to a buffer
within the on-chip memory to enable faster data accesses. The fetching of a few lines to
internal memory before any processing commences also has the benefit of reducing cache
misses for operations, which require 2D spatial locality, since vertically adjacent pixels
would now be located in the cache. Another partitioning scheme is to divide an image
into either overlapping or non overlapping blocks depending on the type of processing
being performed.
In addition to placing image data in internal memory, other frequently used items should
also be placed in internal memory [63]. Since many embedded processors have internal
program and data on-chip memories, critical portions of the code and other frequently
used data items such as tables should also be considered for inclusion into on-chip

memory as space permits. The benefits of on-chip memory over that of external memory
cannot be stressed enough as efficient use and handling of image data and program code
portions within on-chip memory is often critical to achieving a real-time performance.

3.2.5.2 Direct Memory Access (DMA) Strategy


Making efficient use of available internal memory for storing image data is important for
obtaining real-time performance. A key peripheral available in most modern processor
architectures is the Direct Memory Access (DMA) controller, which can manage the
movement of data without CPU assistance, leaving it free to focus on time critical
computations rather than becoming engaged in data management. A DMA controller can
usually manage multiple DMA channels simultaneously so that multiple data transfers
can occur at the same time.
With the availability of DMA, efficient multi buffering strategies have been developed
that allow concurrent processing and movement of data. As the name implies, multi
buffering strategies make use of multiple buffers usually placed within on-chip memory
to allow performing concurrent processing and movement of data. Depending on the type
of processing being performed, usually three buffers are employed including buffer1 and
buffer2 operating in the so-called ping-pong manner and buffer3 operating as an output
buffer. The scheme usually takes the form where a DMA channel is used to store a block
of data in buffer1, while processing proceeds on data in buffer2 and results are placed in
buffer3. After processing on buffer2 has been completed, the results in buffer3 are sent
out to external memory through a DMA channel, while processing proceeds with data in
buffer1 and another DMA channel is used to bring in another block of image data into
buffer2. An important and often overlooked issue regarding memory accesses is the
alignment of data. DMA transfers can benefit from proper data alignment by maximizing
the data bus throughput [64, 65].
3.2.5.3 Spatial and Temporal Locality Strategy
Another method of reducing slow external memory accesses is move from an imagebased processing scheme to a pixel-based processing scheme when multiple operations

have to be performed on image data and there are no data dependencies between the
operations [66]. An image-based processing scheme involves applying one operation to
all the pixels and then applying another operation to all the pixels, etc. A pixel-based
processing scheme on the other hand is one that applies all the operations to one pixel,
and the same is repeated for all the pixels. The problem with an image-based processing
scheme is that it does not make an efficient use of the cache memory scheme, since the
same pixel would have to be read many times to complete the entire processing.
In the pixel-based processing scheme, the pixel is read only once and all the operations
are performed while the pixel resides in the internal on-chip memory. Thus, not only does
pixel-based processing improve spatial and temporal locality of memory accesses, but
also increases the computational intensity of the implementation, a measurement
commonly used to gauge if an implementation is memory limited or not [67].
Computational intensity is defined as the ratio of the number of instructions executed to
the number of memory accesses. If many instructions are being executed per memory
access, then the coded routine is said to have a high computational intensity, while on the
other hand if a small number of instructions are executed per memory access, then the
coded routine is said to have a low computational intensity. In other words, a low
computational intensity means that the coded routine is memory inefficient. Therefore,
since more operations are performed per memory access in a pixel-based processing
scheme, the use of such a scheme is beneficial when it is applicable.

3.3

Modeling and Implementation of Algorithm

As the result of reviewing on several of alternative technology implementation for real


time video image processing system, decision is made to choose FPGA as the target
hardware platform. The benefits like flexibility and easy to prototyping or time-to-market
has triggered this decision. There are two issues need to be address in this sub chapter
i.e. algorithm simplification and algorithm modeling for FPGA implementation. These
issues are described in the following sub chapter.

3.3.1. Algorithm Simplification


An algorithm is simply a set of prescribed rules or procedures that are used to solve a
given problem [68, 69]. Although there may exist different possible algorithms for
solving an video image processing problem, when transitioning to a real-time
implementation, having efficient algorithms takes higher precedence. Efficiency implies
low computational complexity as well as low memory and power requirements. Due to
the vast amounts of data associated with digital video and images, developing algorithms
that can deal with such amounts of data in a computational, memory, and power-efficient
manner is a challenging task, especially when they are meant for real time deployment on
resource constrained embedded platforms.
Algorithms often have to be optimized for achieving real-time performance on a given
hardware platform, since they are usually prototyped in development environments not
suffering from resource constraints. While special hardware and software optimization
technique can be used to realize a real-time version of an algorithm, in general, greater
gains in performance are obtained through simplifications at the algorithmic level [70,
71]. Such modifications or simplifications performed at the algorithmic level help to
streamline the algorithm down to its core functionality, which not only leads to a lower
computational complexity but also to lower memory and power requirements. Thus, the
very first step in transitioning an algorithm from a research environment to a real-time
environment involves applying simplification strategies to the algorithm. It is more
effective to perform these simplifications while still working in the research development
environment, which possesses a higher design flexibility over the implementation. In the
following sub chapter will describe three major strategies for achieving algorithmic
simplifications.
3.3.1.1. Reduction of Computation Strategy
In reduction of computation strategy, there are two popular methods i.e. pure reduction
and reduction through approximations. The strategy of pure computation or operation
reduction has been used by many researchers to simplify algorithms for real-time
implementation. This strategy has been primarily used for algorithms with repetitive,
well-structured computations in low-level operations, such as filtering, transforms,

matrix-vector operations, and local statistics extraction. This sub chapter includes several
examples illustrating the strategy of computation reduction.
Exploiting any available symmetry in the computations involved can often lead to pure
operation reduction. For instance, in [72], the symmetry of the coefficients of linearphase filters in the bi-orthogonal wavelet transform allowed streamlining the computation
of the forward and inverse transform into a single architecture while reducing the number
of expensive multiplication operations. This produced a more efficient transform
computation. Other common techniques used for reducing the number of multiplication
operations in linear filtering include making use of the separability of the kernel involved,
or eliminating multiplications by ones or zeros [73]. Computations can often be cleverly
rearranged or factored to reduce the number of operations. One example of this can be
found in [74], where the symmetry in the elements of the Discrete Cosine Transform
(DCT) matrix allowed rearranging the computations, reducing the number of expensive
multiplication as well as addition operations. As a result, a more efficient transform was
achieved through this operation reduction.
Another possible way to achieve pure operation reduction in matrix computations
encountered in video image processing algorithms it to seek out encoding schemes that
can transform a matrix into a sparse matrix. For example, in [75], an exact twodimensional (2D) polynomial expansion in terms of integer coefficient provided a sparse
representation of the elements of the DCT matrix. This allowed a more efficient
computation by replacing expensive multiplication operations with simple bit-shift and
addition operations and reducing the number of multiplications as well as additions.
Another popular technique to reduce the amount of operations in matrix computations
involves exploiting matrix properties. For example, in [76], a rearrangement of the
equations for a 2.5D affine motion parameter estimation allowed an efficient solution via
an orthogonal matrix factorization using a Householder transformation, thus reducing the
computation operations over that of a 2D affine estimation. It is frequent an efficient
computational structure derived from digital signal processing theory can be utilized to
achieve a reduction in the number of operations. For example, in [77], a one-dimensional
(1D) infinite impulse response filter provided a reduction in the number of expensive
multiplication operations per pixel over that of a 1D finite impulse response filter in

addition to saving memory space via using a lower order filter. These changes led to an
efficient scan-line-based image enhancement at video rates. Another example of this
approach can be found in [78], where the relationship between the computation structure
of discrete, geometric moments and that of all-pole digital filters was exploited, allowing
the computation of any order geometric moment using a series of accumulators. This
resulted in a significant reduction in the number of multiplications and thus allowed realtime computations of geometric moments.
The other method in doing of computation reduction is reduction through
approximations. When it comes to real-time implementation, sometime sacrifices in
accuracy have to be made in order to achieve the required performance. In fact, most
algorithms that are transitioned to a real-time environment are simplified using various
approximations. Approximations are often used for reducing the computations in
transform operations. For instance, in [79], approximating the DCT computations by
using up to three bi-orthogonal matrices led to replacing expensive multiplication
operations with bit-shifting and addition operations. The approximation could be varied
between two or three matrices, producing trade-offs between speed and accuracy. A highperformance version was able to process one 8 x 8 image block using less than one
processor clock cycle per pixel.
Approximating computations by utilizing simple operations is often used to reduce the
amount of processing time. In [80], simplifying the computations for computing the
gradient derivative image via a nonlinear filter led to the speedup needed to achieve an
efficient real-time implementation of the noise-robust image enhancement procedure.
Two key algorithmic simplifications that were made included approximating the
quadratic filter using a normalized squared-gradient computation and replacing the
normalization of the filter output by a power of 2 divisions. The loss in accuracy due to
the division operation was kept low by considering the maximum value of the filter
output for different images in the test set. Also, in [81], a subpixel edge detector was
simplified by approximating a critical but expensive division operation via a simple
iterative minimization procedure using integer arithmetic.

In some cases, the processing resources are so scarce that a desired computation cannot
be fully realized without some simplifying approximation. For example, in [82], the
computations required for applying a large 13 x 13 filter kernel had to be approximated
by using two sequential passes of a smaller 7 x 7 filter kernel due to lack of processing
resources supporting larger kernels. The approximation did not have any detrimental
effect on the outcome of the object tracking system under consideration.

3.3.1.2. Reduction of Data Strategy


Reduction in the amount data to be processed plays a prominent role for bringing video
image processing algorithms in to a real-time, and there are some of methods in the
literature showing how to use this strategy to simplify an algorithm toward reaching a
real-time performance e.g. sub-sampling, partitioning, selective processing and
dimensionality reduction.
Sub-sampling is one of the simplest and most effective approaches for the incoming
image frame or video sequence. The objective here is to reduce the amount of data to be
processed and thus to obtain a speedup for subsequent stages of processing. There are
many examples involving spatial sub-sampling. For example, in [83], sub-images were
reduced in size to 20 x 20 pixels before being subjected o classification processing in
order to meet the hard real-time performance demands of an industrial inspection system.
In another application involving obstacle detection, the input image was sub-sampled to
reduce the size by a factor of 2, after having been spatially low-pass filtered for reducing
the effects of aliasing. Spatial sub-sampling has also been found to be useful for speeding
up face recognition systems on embedded platforms. In [84], to meet the real-time
performance requirements, input images of size 320 x 240 were downscaled to 80 x 60 to
help speed up the discussed face recognition system. Similarly, in [85], input images of
size 640 x 480 were downsized to 128 x 128 to help speed up the subsequent processing.
Naturally, temporal sub-sampling can be applied to the processing of video sequences.
For example, in [86], it was suggested that a complex range or scene dept computation be
performed for every nth incoming frame instead of every frame in order to meet the realtime requirement of a gesture tracking algorithm. Of course, it should be pointed out that

such methods do impart a certain loss in accuracy due to the discarding of some of the
original data. Hence, it is best to explore several different sub-sampling intervals to
determine the right balance between speed and accuracy. Also of note here is that in some
cases, spatial sub-sampling might not be appropriate. In such cases, one may formulate
the problem as a scalable algorithm [87]. This approach involves creating a quality
control module to manage the processing for achieving a certain level of output quality
according to the availability of system resources.
Partitioning image frame into smaller sub-images is another simple method of reducing
the amount of data to be processed. It can be processed each of them at a faster speed
than entire image frame. This is similar to the divide-and-conquer strategy where the
problem is divided into several smaller problems that are easier to solve [88, 89]. The
most popular partitioning schemes include row-wise, column-wise, and block-wise. A
good example appearing in [90] covers the computation of the discrete bi-orthogonal
wavelet transform (DBWT) for high definition TV compression. Due to the large data set
involved, real-time processing was not feasible without partitioning a frame into nonoverlapping, square sub-images and calculating the DBWT separately on each sub-image.
In another application discussed in [91], the image frame was partitioned into vertical
strips in order to process the image in parallel on multiple processors. This generated a
real-time implementation of the a trous wavelet transform. In the face detection
algorithm covered in [92] for an embedded device, the input mage was partitioned into
non-overlapping sub-images, the size of which was chosen to balance the gain in speed
versus the accuracy of detection. In another example mentioned in [93], it was found that
partitioning the input image into sub-images enabled the real-time implementation of the
most time-consuming portion of a color-quantization algorithm. Since edge artifact could
be present between adjacent sub-images, an overlapping partitioning scheme can be
employed, thus producing a trade-off between speed and artifact suppression [94]. In
[95], an overlapped block-partitioning scheme was used to correct for processing across
partition borders in a wavelet transform computation.
Another popular data reduction is selective processing that involves narrowing down the
region of interest before applying any subsequent processing. As a result, only a certain
subset of the entire image is processed, hence the name selective processing. For

example, in [96], in locating an object of interest to sub-pixel accuracy, instead of


applying sub-pixel calculations on the entire image first and then locating the object of
interest, the location was first narrowed down to a specific area, after which appropriate
computations were performed to refine this location to sub-pixel precision. In this case,
narrowing down the area of interest helped to reduce the amount of a data to be processed
by the sub-pixel refinement stage, thus generating a real-time sub-pixel accurate object
detection. Similarly, in another application involving the use of range data for gesture
tracking [97], the range data was calculated only for selected regions of interest and not
for the entire image. Therefore, if computationally complex processing cannot be avoided
to save processing time, it is best to apply such processing only to the areas of interest
and not to the entire image.
After having determined appropriate features to use, it is often helpful to further reduce
the amount of data by applying dimensionality reduction techniques, such as principal
component analysis (PCA), linear discriminant analysis (LDA), or a Kohonen selforganizing map (SOM). There are several examples in the literature that employ such
techniques. For instance, in [98], the Kohonen SOM neural network was used to reduce
the dimension of a feature vector to help speed up the sub-sequent processing, while in
[99], PCA was employed for the same purpose. In [100], a combination of PCA and LDA
was used to reduce the amount of data, providing a better separation of classes and thus
easing the task of the sub-sequent classifier. Also, in [101], the dimensionality of
chrominance feature vectors extracted from a 2D chrominance histogram was reduced by
modeling the histogram as a multivariable Gaussian distribution and then applying PCA
to remove the correlation in the two components. This allowed the use of 1D marginal
statistics of each component instead of 2D statistics.
Dimensionality reduction has also been used when dealing with color images. When
dealing with color image data, researchers have often made use of normalized 2D color
spaces as opposed to 3D color spaces to achieve real-time implementations. For instance,
in the face recognition application discussed in [102], the 2D normalized r-g color space
was used, saving computations by reducing the 3D color space by one dimension. Also,
reducing a three-channel color image to a one-channel image can sometimes serve as a
means of achieving a real-time performance through dimensionality reduction. For

instance, in [103], a moment-preserving threshold scheme was used to reduce a threechannel color image to a one-channel image by taking into account the correlation
amongst the three color channels.

3.3.1.3. Simple Algorithms Strategy


In order to meet real-time requirements, many researchers divide problems into stages
composed of computationally simple operations. Simple operations include the use of
binary, morphological, frame differencing, and various other computationally efficient
operations. Due to the fact that such algorithms are often employed in real-world
situations, they often have to be made robust to noise sources or disturbances in the scene
such as changes in lighting conditions, or low-contrast scenes. The algorithm used to
achieve a robust performance must also be computationally simple, and preferably fully
automatic with little human intervention in the setting of thresholds or parameters. In the
formulation of such an algorithm, a trade-off analysis between speed and accuracy needs
to be performed in order to find a balance between achieving real-time performance and
robustness to disturbances in the scene. Of course, the use of appropriate features that are
robust to disturbances can help. It should be noted that the construction of simple or
simplified algorithms involves both the use of computational and data reduction
simplification strategies mentioned in the previous sections. Four examples are presented
next to illustrate the practical usefulness of such simple algorithms.
The tracking of objects within sequences of images has long been an important problem
in video image processing. For example, in [104], a simple algorithm for tracking an
object based on template matching was presented. To provide a more robust operation to
changes in object shape, a rather simple extended snake algorithm was employed. First,
the object was extracted by using simple frame differencing and applying simple
morphological closing on the result to merge homogenous regions. Then, a data reduction
procedure was done by using an appropriate feature to succinctly represent the object and
provide control points for the snake algorithm. The use of simple algorithms during each
stage allowed achieving robust real-time tracking.

The increase in the practical applications of face detection and recognition has increased
the interest in their real-time implementations. Such implementations are possible via the
use of simple algorithms. For instance, in [105] a simple algorithm for face detection
based on the use of skin color features was discussed. In order to make the detection
algorithm robust to regions in the background with skin-like colors, a simplified method
was developed. First, motion detection via a simple frame differencing operation was
used to distinguish the human face from the background. Then, to reduce the noise in the
difference image, a simple morphological opening operation was used. Finally, a simple
labeling operation was used to determine the region where the face was located. Again,
the use of simple algorithms during the various stages of the image processing chain
allowed achieving robust real-time face detection.
Industrial inspection systems are known for their strict hard real-time deadlines, forcing
developers to devise algorithms with simple operations. For example, in [106], a simple
defect location and feature extraction algorithms were employed to meet the hard realtime constraints in an industrial inspection system. In order to speed up the location
detection process, the image was first binarized to find a rough estimate of the location of
defect. This rough location was then propagated to the full-resolution image, and a region
around this location was selected for feature extraction. A simple gray-level difference
method was used to extract texture features from the narrowed down defect region. The
use of relatively simple operations enabled the real-time detection of defects and
extraction of features for the subsequent classification.

3.3.2. VHDL based Algorithm Modeling


The design of a Digital Image Processor (DSP) with FPGA often utilizes both high-level
algorithm development tools and Hardware Description Language (HDL) tools. This sub
chapter just concentrates on VHDL as one of HDL tool to model a DSP. VHDL is a
language widely used to model and design digital hardware. VHDL is the subject of
IEEE standards 1076, 1164 and is supported by numerous Computer Aided Design
(CAD) tool and programmable logic vendors. VHDL is an acronym for VHSIC
Hardware Description Language. VHSIC, Very High Speed Integrated Circuits was a

USA Department of Defense Program in the 1980s that sponsored the early development
of VHDL.
The fundamental of VHDL based algorithm modeling is datapath. Datapath is a circuit
that allows performing operations involving multiple steps. It is responsible for the
manipulation of data. For a DSP functioning, it includes functional units such as adders,
shifters, multipliers, comparators, registers and other memory elements for the temporary
storage of data. In order for the datapath to function correctly, appropriate control signals
must be asserted at the right time. Control signals are needed for all of the select and
control lines for all of the components used in the datapath. This includes all of the select
lines for multiplexers and other functional units having multiple operations; all of the
read / write enable signals for registers and register files; address lines for register files;
and enable signals for tri-state buffers. Thus, the operation of the datapath is determined
by which control signals are asserted or de-asserted and at what time. In a DSP system,
these control signals are generated by the control unit that will be discussed in the sub
chapter 3.3.3.
The objective for designing a dedicated datapath is to build a circuit for solving a single
specific DSP algorithm. In a register-transfer level design, the focus is how data move
from register to register via some functional units where they are modified. In the design
process, some issues need to be solved; what kind of registers to use and how many are
needed; what kind of functional units to use and how many are needed; can a certain
functional unit be shared between two or more operations; and how are the registers and
functional units connected together so that all of the data movements specified by the
algorithm can be realized. Since the datapath is responsible for performing all of the data
operations, it must be able to perform all of the data manipulation statements and
conditional tests specified by the algorithm. For example, the assignment statement:

A=A+3

(1)

Takes the value that is stored in the variable A, adds the constant 3 to it, and stores the
result back into A. Note that whatever the initial value of A is here is irrelevant since that
is a logical issue. In order for the datapath to perform the data operation specified by this
statement, the datapath must have a register for storing the value A. Furthermore, there
must be an adder for performing the addition. The constant 3 can be hardwired into the
circuit as a binary value.
The next question to ask is how to connect the register, the adder, and the constant 3
together so that the execution of the assignment statement can be realized. The value
stored in a register is available at the Q output of the register. Since the requirement of A
+ 3, the output of Q is connected to the first operand input of the adder, and connects the
constant 3 to the second operand input. The result addition will store back into A or back
into the same register.
The storing of adder result into the register is accomplished by asserting the Load control
signal of the register (i.e., asserting Aload). This Aload signal is an example of what we
have been referring to as the datapath control signal. This control signal controls the
operation of this datapath. The actual storing of the value into he register, however, does
not occur immediately when Aload is asserted. Since the register is synchronous to the
clock signal the actual storing of the value occurs at the next active clock edge. Because
of this, the new of A is not available at the Q output of the register during the current
clock cycle, but is available at the beginning of the next clock cycle.
In most situations, one register is needed for each variable used by the algorithm.
However, if two variables are not used at the same time, then they can share the same
register. If two or more variables share the same register, then the data transfer
connections leading to the register and out from the register usually are made more
complex, since the register now has more than one source and destination. Having
multiple destinations is not too big of a problem, since all of them are connected to the
same source. However, having multiple sources will require a multiplexer to select one of
the several sources to transfer to the destination. A multiplexer is needed in order to
select which one of these two sources is to be the input to the register.

After deciding how many registers are needed, the next issue is to determine whether to
use a single register file containing enough register locations, separate individual
registers, or a combination of both for storing the variables in. Furthermore, registers with
built-in special functions, such as shift registers and counters, can also be used. For
example, if the algorithm has a FOR loop statement, a single counter register can be used
to not only store the count variable but also to increment the account. This way, not only
do we reduce the component count, but the amount of datapath connections between
components is also reduced. Decisions for selecting the type of registers to use will affect
how the data transfer connections between the registers and functional units are
connected.
It is fairly straightforward to decide what kind of functional units are required. For
example, if the algorithm requires the addition of two numbers, then the datapath must
include an adder. However, some options are available to choose whether to use a
dedicated adder or an adder-subtractor combination. Of course, these questions can be
answered by knowing what other data operations are needed by the algorithm. If the
algorithm has only an addition and a substraction, then the decision may come to use the
adder-subtarctor combination unit. On the other hand, if the algorithm requires several
addition operations, the options are to use one adder or several adders. Using one adder
may decrease the datapath size in terms of number of functional units, but it may also
increase the datapath size because more complex data transfer paths are needed. For
example, if the algorithm contains the following two condition operations:

g=h+i

(2)

j=k+l

(3)

Using two separate adders will result in the datapath shown in figure 3.1 (a); whereas,
using one adder will require the use of two extra 2-to-1 multiplexers to select which
register will supply the input to adder operands, as shown in figure 3.1 (b).

10

10
+

Figure 3.1 (a) Separate adder

Figure 3.1 (b) One adder

Furthermore, this second datapath requires two extra control signals for the two
multiplexers. In terms of execution speed, the datapath on the left can execute both
addition statements simultaneously within the same clock cycle, since they are
independent of each other. However, the datapath on the right will have to execute these
additions sequentially in two different clock cycles, since there is only one adder
available. The final decision as to which datapath to use is up to the designer.
There are several methods in which the registers and functional units can be connected
together so that the correct data transfer between the different units can be made e.g.
multiple sources, multiple destinations and tri-state bus.
For the multi sources, if the input to a unit has more than one source, then a multiplexer
can be used to select which one of the multiple sources to use. The sources can be from
registers, constant values, or outputs from other functional units. On the others hand, a
source having multiple destinations does not require any extra circuitry. The one source
can be connected directly to the different destinations, and all of the destinations where
the data is not needed would simply ignore the data source. For example, in figure 3.1
(b), the output of the adder has two destinations: register g, and register j. If the output of
the adder is for register g, then the Load line for register g is asserted, while the Load line
j is not; and if the output of the adder is for register j, then the Load line for register j is
asserted, while the Load line for register g is not. In either case, only the correct register

will take the data while the other units simply will ignore the data. This also works if one
of the destinations is a combinational functional unit. In this case, the functional unit will
take the source data and manipulates it. However, the output of the functional unit will
not be used (that is not stored in any register) so functionally, it doesnt matter that the
functional unit worked on the source, because the result is not stored. However, it does
require power for the functional unit to manipulate the data, so if the design considers
less power consumption, the functional unit must not manipulate the data at all.
Another scheme where multiple sources and destinations can be connected to the same
data bus is through the use of tri-state buffers. The point to note here is that when
multiple sources are connected to the same bus, only one source can output at any one
time. If two or more sources output to the same bus at the same time, then there will be
data conflicts. This occurs when one source outputs a 0 while another source outputs a 1.
By using tri-state buffers to connect between the various sources and the common data
bus, ensure only one tri-state buffer is enabled at any one time, while the rest of them are
all disabled. Tri-state buffers that are disabled output high-impedance Z values, therefore
no data conflicts occur. An advantage of using a tri-state bus is that the bus is
bidirectional, so that data can travel in both directions on the bus. Connections for data
going from a component to the bus need to be tri-stated, while connections for data going
from the bus to a component need not be. Notice also that data input and output of a
register both can be connected to the same tri-state bus; whereas, the input and output of
a functional unit cannot be connected to the same tri-state bus.

3.4

Controlling Unit Design

The control unit for DSP system is a finite state machine (FSM). There are two type of
Finite State Machine (FSM) e.g. Mealy FSM and Moore FSM. In a Mealy FSM, the
output function depends on both the current state and the values of the input. In such a
FSM, the connection drawn with a dashed line in figure 3.2 is shown. If the input values
change during a clock cycle, the output values may change as a consequence. In a Moore
FSM, on the other hand, the output function depends only on the current state, and not on
the input values. The dashed connection in figure 3.2 is absent in a Moore FSM. If the

input values change during a clock cycle, the outputs remain unchanged. In theory, for
any Mealy FSM, there is an equivalent Moore FSM, and vice versa. However, in
practice, one or the other kind of FSM will be most appropriate. A mealy FSM may be
able to implement a given control sequence with fewer states, but it may be harder to
meet timing constraints, due to delays in arrival of inputs used to compute the next state.
In many Finite State Machines (FSMs), there is an idle state that indicates that the system
is waiting to start a sequence of operations. When an input indicates that the sequence
should start, the FSM follows a sequence of states on successive clock cycles, with the
output values controlling the operations in a datapath. By stepping through a sequence of
states, the control unit controls the operations of the datapath. For each state that the
control unit is in, the output logic that is inside the control unit will generate all of the
appropriate control signals for the datapath to perform one data operation. These data
operations are referred to as register-transfer operations. Each register-transfer operation
consists of reading a value from a register, modifying the value by one or more functional
units, and finally, writing the modified value back into the same or a different register.
Eventually, when the sequence of operations is complete, the FSM returns to the idle
state.

Figure 3.2 Mealy Finite State Machine

3.5

Embedded Processor Integration

An alternative way of verification a hardware function is by programming an existing


processor to perform the function. The existing processor used as such is the embedded
processor. Design with an embedded processor requires several steps e.g. processor
selection, design of its interfaces and writing the program that it runs, more detail about
this steps will be described in the following sub chapter:
3.5.1 Processor Selection
Selection of an embedded processor is like selecting a microcontroller for a specific
function. The main difference is that there are more options when it comes to embedded
processors. These options allow a hardware designer to tailor the embedded processor
and perform optimizations to best fit the hardware function being designed.
Generally optimizations available to a hardware designer include elimination of
instructions that are not needed, use of just enough memory, memory mapping, data and
address lengths, memory structure, use of proper cache size or elimination of it, and use
or elimination of hardware processors like multipliers. Depending on a specific
application, a hardware designer selects an embedded processor and tailors it to best
satisfy design constraints.
An embedded processor may come as a softcore, a hardcore, or it might be user selfdesign. A softcore may be available in pre-synthesis HDL description or as a postsynthesis description for specific target. A hardcore is fixed in a chip or a layout and, like
a microcontroller, offers very little, if any, customization or flexibility in the processor
core hardware. An embedded processor can also designed by a designer who is using it in
a larger system. In this case, the designer uses Hardware Description Language (HDL) to
design the processor and has all the freedom to choose the functionality of the processor.
3.5.2 Processor Interface Design
The next step after processor selection is configuring the external bus structure. This
includes tasks such as design of memory together with device selection logic, interrupt
handling hardware, design of priority together with bus arbitration; memory related
hardware components and other Input / Output (I/O). For large systems this part becomes

more complex that shifts the focus of the hardware designer from implementing the
hardware function of an embedded processor to designing the interface logic and external
processor bus structure. In the simple interfacing, memory mapped I/O, use of fast single
cycle dedicated memories, and limiting a design to a single processor simplifies a design
to the level that one can design an entire embedded system without needing complex
hardware configuration tools. The implementation of interrupt, use of complex arbitration
schemes, and using multiple bus masters, are factors that complicate design of an
embedded system.
3.5.3 Software Development
The last step in the embedded design of a hardware function is the development of the
software to run on the embedded processor. This step involves writing the program,
which is generally done in assembly language or C language, and compiling it to the
machine language of the newly configured embedded processor.

3.6

Summary

In this chapter, a DSP algorithm modeling with innovate techniques which is leading
towards the design of the real time embedded video image processing is presented. The
VHDL is used to design of DSP algorithm model. Different hardware architecture
features and trend of processor are presented in this chapter. The memory performance
gap and various strategies to solve this problem are also summarized in this chapter. The
controlling units which are represented by two models of Finite State Machine (FSM)
together with their characteristics are discussed. The embedded processor integration
together with its procedure of implementation is also presented. The complete flow chart
of real time embedded video image processing system is shown below:

Start

A DSP algorithm
Hardware System
Design

No
Meet hardware
optimization
design?
Yes
Proceed for
hardware target
implementation

Proceed for
embedded processor
integration process
No

Is software able to
control the
hardware?

Yes

Real time embedded


design is completed

Figure 3.3 Real time embedded design flow

You might also like