You are on page 1of 5

2011 Seventh International Conference on Natural Computation

Dual Image Processing Algorithms and Parameter Optimization


Wael Wasfy
School of Automation Science and Electrical Engineering Beijing University of Aeronautics and Astronautics, Beijing, 100191, P. R. China Email: wwassfy@yahoo.com

Hong Zheng
School of Automation Science and Electrical Engineering Beijing University of Aeronautics and Astronautics, Beijing, 100191, P. R. China Email: julyanna@vip.sina.com

Abstract Increasing calculation speed without affecting pixel calculation accuracy in fast image processing algorithms using parallel computation was always needed but controlled by Amdahls Law. In fact increasing number of processors uses same data bus reduces the speed and not allowing us to get the 20 times faster as we expected. As number of processors are limited due to sharing processors same data bus. Discussing in this paper dividing image frame into numbers of sections and discuss the ability to apply two different algorithms at the same time. Using same FPGA resources and keeping high accuracy and speed as a main target. Processors communication protocol traditionally depends upon main data bus to manage sending or receiving data each processor at a time. Our research in this paper avoided this traditional communication technique. It has been divided into 3 steps. First step was to use one image frame to be calculated on FPGA in a Data Flow computation technique and calculate the timing results and formulate the equation that govern it. Second step was making FPGA parallelization computation and also formulate its equation and verify it practically. Third step was to enhance the result of step 2 by making data overlap to over come the differences between divided image frame and a complete image frame calculation (Step 1) accuracy. Image frame were divided into 4 equal parts (64 row by 256 column) as mention in step 2. Image local operation algorithm is depending upon neighbor pixels. Edges accuracy in step 1 is low only in the whole frame edges but in the step 2 we have it ar all 4 image parts edges. Overcoming this drawback was done by immersing overlap parallel calculations between the half of upper part of RAM and the half of the lower part (step 3). Comparing the data results from the whole image data and enhanced parallel data was made to ensure correct and same Gaussian output inside the Image frame, step 1&3. Formulating an equation for FPGA parallel computing is very useful for performance evaluation and design time calculation as well as the advantage of not using data bus as a common data transfer. This parallel design gives the the advantage of applying different algorithms at each part of image frame. Index TermsKeywords: Fast Image processing; DSP slice; Embedded systems

for one task). Increasing clocking frequency draw back were power consumption and high temperature, researchers eyes went again to Parallel computation. Concurrent advantage of VLSI technology is allowing very large number of components to t on chip. Changing computing architecture for parallelism is one of the methods for parallel processing techniques. Moores Law; is the empirical observation that transistor density in a microprocessor doubles every 18 to 24 months [1]. Our contribution in this paper is: dening a loosed coupled parallel computing design method for fast image processing in which processing time machine cycle will not exceeds 1.5 convolution of image frame size. Second formulating the equation that make the exact machine cycle calculations (from it we can get the exact processing time). third optimizing the selection of image kernel size according to gaussian algorithm parameter study . Studying parallel computation capabilities Dening parallel computing as many calculations are carried out simultaneously [2] each in a Processing Element (PE), considering in principle that large problems can often be divided into independent smaller ones and each processing element can execute its part, which are then solved concurrently (in parallel). Also processing element can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above. [3], concluding of the above in other words would say that, parallel computing is a collection of processing elements that communicate and cooperate to solve large problems fast. Studying kernel size optimization Kernel size reduces number of inputs and operations calculations more than half of inputs and consequently number of operations needed. It was the start of our research to discuss the difference in pixel output when using different kernel sizes such as kernel size 3x3 (9 inputs) versus kernel size 5x5 (25 inputs). Studying parameter optimization versus kernel size The discussion will lead to the optimization of number of processing operation volume and optimizing number of

I. I NTRODUCTION Researchers main aim from long time ago was reducing processing time and increasing data computation speed, reaching real time or even near real time still the main goal. Two different trends tried to achieve the same target; rst method tried by increasing clocking frequency for the operating processor as maximum as possible, other trend took another methodology by using parallelism (using multiple processors element

978-1-4244-9953-3/11/$26.00 2011 IEEE

946

operation needed. Sigma parameter in Gaussian Filter algorithms determines kernel size needed. Consequently affecting number of operations volume for pixel output calculations. Study applying different algorithms application The proposed design allows two different image processing algorithms. Gaussian lter and Sobelx edge detector were imposed in the design with one bit select option to decide which algorithm to calculate. Logic 0 allows Gaussian Filter to operate and Logic 1 allows Sobel x edge detector algorithm. Solving drawbacks and verications The design drawback occurred from calculating Gaussian lter and sobel x edge detector algorithms. As image signal processing Local operation algorithm output depends upon the values of the surrounding pixel elements. Usually this effect appears one time when processing the frame in sequential mode as rst step does. In parallel computing it will appear according to how many branches we divide the image frame as in our case 4 branches, at all the surrounding edges. it has been solved and veried by taking a copy from the lling data and processing separately with the all correct surrounding pixels. then immersing it with the suitable multiplexer. II. PARALLEL C OMPUTING S TUDY A. Amdahls law B. Parallel computing dependencies The computation speeds up is strongly attached to data dependency, so understanding data dependencies is fundamental in implementing parallel algorithms. No program can run more quickly than the longest chain of dependent calculations (known as the critical path), since calculations that depend upon prior calculations in the chain must be executed in order. However, most algorithms do not consist of just a long chain of dependent calculations; there are usually opportunities to execute independent calculations in parallel if it satises the following; Let us consider Pi and Pj are two program fragments. Bernsteins conditions [4]. Describing when these two fragments are independent and can be executed in parallel or not. For Pi , let Ii be all of the input variables and Oi is the output variables, and likewise for Pj . So Pi and Pj are independent if they satisfy the following: Ij Oi = Ii Oj = Oi Oj = (1) (2) (3)

C. Flynns taxonomy Michael J. Flynn created one of the earliest classication systems for parallel and sequential computers and programs, now known as Flynns taxonomy. Flynn classify programs and computers by whether they were operating using a single set or multiple sets of instructions[5]. III. PARALLEL PROCESSING ON FPGA FOR I MAGE
PROCESSING ALGORITHM

Reviewing parallelization classications and trying to nd the suitable method for image processing as the various architectures can be used to solve vision problems, as the pipeline, Multiple Instruction and Multiple Data streams MIMD machines [6], Single Instruction Stream-Multiple Data stream, SIMD machines [7] or processors with extended instruction sets. Parallel programming languages considered to be an open issue and direction for future research work [8]. The parallel architectural for Image Processing, Image processing community has been relying for many years on special purpose computers to accelerate their computations; they start to turn towards parallel processing technology. They were searching, if they can get benet simultaneously from reliable and fast hardware, and software programming tools that allow implementing more complex image and vision applications. The commercial success of any computer architecture signicantly depends on the availability of tools that simplify software development. Consequently, many different programming tools have been developed that attempt to solve difculties and obstacles facing software programming languages design for parallel and distributed computers [8]. However, our research study uses the ability of connecting Matlab 2007a with FPGA software tool ISE 10.1 backage from XILINX . A. step 1 using one Image frame in data ow computing technique In our research by constructing a structure using DSP slice in FPGA Virtex4, XC4VSX55-12FF1148 to build a module that can calculate Gaussian Filter Algorithm and from this module we can get the gaussian difference code by getting the difference between input pixel and its output gaussian result taking into consideration the delay it will take and its number of Machine cycle related to it and time calculation at running frequency 100Mhz. 1) xilinx FPGGA DSP Slice: So after making the design using Matlab simulink from mathwork and verify it using matlab simulation capabilities. Converting the design into Xilinx Integrated Software Environment (ISE 10.1) for design simulation before downloading to FPGA chip, also for checking the delayed number of machine cycle needed to get the corresponding output pixel value for a certain input. output pixel brightness will represent the gaussian low path lter for the input frame which was found 9 machine cycle.

947

Fig. 1.

Parameter Study sigma () for kernel size optimization

During work it cames out the issue of image local operation kernel size, its drawback that increasing kernel size increase complexity for kernel size 5x5 relative to more simple module for kernel size 3x3. the other drawback as it has higher number of calculation process and increasing the delayed number of machine cycle. taking into consideration that more kernel size is proportionally related to FPGA utilization resources consumption. 2) Gaussian Filter: Gaussian lter low pass lter can be used with many kernel as 3x3, 5x5 and 7x7 or other kernel size depend upon the 2D Gaussian equation we use it for the weight matrix calculations which x and y vary for kernel 3x3 as (-1, 0, 1) and for kernel 5x5 as (-2,-1,0,1.2) and for kernel size 7x7 as (-3, -2,-1,0,1.2,3) g(x, y) = 1 x2 +y2 e 22 2 2 (4)

3) Gaussian Filter 3X3 kernel size: Low-level Vision 3x3 Gaussian low pass lter Algorithm [9], [10]: is two dimensional low-pass lter uses 3x3 kernel, calculating the value for a destination pixel using number of neighboring source pixels will give the region of interest (ROI) reduced by the source image by 1 row and one column. 4) Gaussian Filter 5x5 kernel size: Using 5x5 kernel size will not only increase number of calculations but also will reduce the region of interest (ROI) reduced by the source image by 2 rows and 2 columns from each size. in kernel size 7x7 it will be the same concept it will increase calculations steps and reduce region of interest (ROI) reduced by the source image by 4. Frame size used is 256x256 as a calculated unit, it can be used to measure other frame size, delay line used for 3x3 kernel will be 256x3 and 5x5 kernel delay line is 256x5 and consequently for 7x7 will be 256x7 which will increase the utilization resources of FPGA used. The number of delay blocks depends on the size of the convolution kernel, while delay line depth depends on the number of pixels in each line. Each incoming pixel is at the center of the mask and the line buffers produce the neighboring pixels in adjacent rows and columns [11]. Our goal is this paper is a sigma parameter study

in the Gaussian low pass lter and its affect for different kernel size. Letting the designer choose by himself the optimum kernel size needed for his Gaussian low pass lter depending upon sigma value. For achieving this goal we start to calculate the weight matrix for each sigma value as (0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1) the rst value with 0.01 as we considered it almost zero value because in the equation cannot be divide on zero (undened) also choosing 0.01 was enough because the out pixel result exceeds 255 which is our limit in each pixel brightness. So table 1 represents the weighting matrix for sigma = 0.01 this matrix include all the three weighting matrixes 3x3, 5x5 and 7x7 and by using different sizes and multiply it by data matrix which varies from matrix 1 to matrix 255 using Matlab Math-work, will give us three output series of results for the different kernels and by subtract 3x3 series output data from 5x5 series output data, we get the comparison chart of the two kernel size at all data matrix range from 1:255. Doing similar action for kernel size 7x7 and 5x5 series output data, by getting the difference value between the two calculations and vary data matrix from 1:255, we will get the comparison chart between kernel size 7x7 and 5x5. As a result, having two comparison chart at each sigma step equal 0.1 value will guide the designer to choose the best kernel ts the sigma value. 5) First method using Gaussian theory equation: calculate kernel weight coefcients at the minimum sigma parameter near to zero but not zero, x any are step 1 kernel size parameter and it depends upon the kernel size as an example for kernel size 3x3 the parameter are x=[-1 0 1], y=[-1 0 1] and for kernel size 5x5 the parameter are x=[-2 -1 0 1 2], y=[-2 -1 0 1 2] and for kernel size 7x7 the parameter are x=[-3 -2 -1 0 1 2 3], y=[-3 -2 -1 0 1 2 3] and so on. In that case it mean kernel 5x5 is equal to kernel 3x3 plus the boundary weight coefcients around kernel 3x3 when in the equation x=-2,2 for all y and y=-2,2 for all x, also kernel size 7x7 is equal to kernel 5x5 plus the boundary weight coefcients around kernel 5x5 when in the equation x=-3,3 for all y and y=-3,3 for all x. using this theory is used for weight calculation then multiply each weight coefcient by pixel brightness from 0 255 and add all together to get the nal pixel data. So from the theory that kernel size 5x5 is equal to kernel size 3x3 if the summation of the total boundary weight coefcients around kernel 3x3 multiply by the maximum brightness value (255) is less than or equal 0.444. We choose 0.444 because in grayscale pixel brightness is only integer numbers from 0 255 and if we round this 0.444 it will give zero value also we choose to multiply it by 255 to make sure that the maximum value multiplied will not exceed 0.444. After making this rule verication we started to increase parameter sigma step by step each step equal to 0.01 in iteration method with the previous condition. As a result we get the exact sigma value that makes kernel 5x5 equal to kernel 3x3 at sigma range less than or equal to 0.53, and for kernel 7x7 equal to kernel 5x5 at sigma range less than or equal to 0.81, and for kernel 9x9 equal to kernel 7x7 at sigma range less than or equal to 1.11, and for kernel 11x11 equal to kernel 9x9 at sigma range less than or equal

948

to 1.41 as shown in gure 1. 6) Second method by practical calculations: The second method to dene and proof at which sigma values different adjacent kernel size are equal to each other, is by taking sigma values starting from sigma near zero till 1, as follows in between sigma step is 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. For each sigma values we get the kernel weight coefcients, Then we get a matrix of data that vary from all of the matrix are ones and all the matrix increase by step of 1 till it reaches the maximum brightness value 255, we will get 255 matrix for each matrix we calculate the brightness value for different kernel size, by the comparison between the actual output result for each kernel size we compare each two adjacent kernel size. This way of calculation is not precise method to get the exact sigma value but it will give us the near sigma values according to our assumption by taking step of 0.1, for sigma = 0.5 we got the weight values coefcient as shown in the following table 1, as we can see only at kernel 3x3 there are weighted coefcients values and at the kernel boundaries is zeroes, these zeroes comes from the mathematical formula and its physical meaning that these cells is not included in the calculation process because its weight is zeros. starting from sigma = 0.6 as shown in table 2, these boundaries have a weighted values which means it has an effect of the calculated output pixel brightness for difference between kernel 5x5 and 3x3, for more conrmation we calculate the Gaussian low pass lter for kernel 3x3 from data range 1 255 and calculate same Gaussian low pass lter 5x5 for the same data range and subtract these data to see the difference values at the whole data range. 7) Sigma parameter optimization: In the study value of Parameter Sigma () in Gaussian lter has a great affect on kernel size. As a result we get the exact sigma value that makes kernel 5x5 equal to kernel 3x3 at sigma range less than or equal to 0.53, and for kernel 7x7 equal to kernel 5x5 at sigma range less than or equal to 0.81, and for kernel 9x9 equal to kernel 7x7 at sigma range less than or equal to 1.11, and for kernel 11x11 equal to kernel 9x9 at sigma range less than or equal to 1.41 as shown in gure 1. The advantage of this parameter study that the designer could choose the minimum kernel size that will have the same affect of higher kernel size, which will give advantage for reducing the utilization resources in the target FPGA, and reduce number of unused calculation which will save time, which as processing time is the most important target in image processing. This Gaussian low pass lter is the essential low level image processing algorithm and it is commonly used as a preprocessing technique. Studying sigma parameter effect can also save complicated unneeded design connections and keep it as minimum as possible. as we are using kernel size 3x3 as it is enough for our sigma parameter and by 9 machine cycle for an image size 256x256 the concluded formula to calculate the time taken for calculating one image frame will be as follow: total number of machine cycles = delayed for one pixel + data ow of number of elements of one frame

total number of machine cycle = 9 + 256 * 256 = 65,545 Machine cycle at 100Mhz frequency (10nSec) total time will be = 65,545/100Mhz = 0.65545 mSec/for one frame B. step 2 make parallel computing and getting the formula, get the output result 1) block diagram for it: as we mentioned above when we take about DSP, the best parallel computing structure should be Loosely coupled as it can be expands to number of processor elements without sharing same data bus or any similar communication port as HPI in DSP. we used the facility of creating RAM inside FPGA to make the separation while using the technique of write while read option in RAM in order to save calculation time. 2) Design drawback: calculation gaussian lter as a Local operation algorithm in Image processing needs the values for all surrounding pixel elements, when the data ows as in the rst step we have only one image frame but when we divided the image into 4 parts the result will have edges miss calculations problem not only in the boundary of the frame but also in the boundary of the 4 parts as we see in gure 2 C. step 3 performance enhancing and nalization of the formula avoiding of edging drawback Figure 3 can be resolved by dividing the problem into two parts, rst one is to immerse another RAM between RAM1 and RAM2 we call it RAM1 2, it will be responsible for copy 8 rows from the above RAM data and also from the bottom RAM data then it will be calculated separately in that case we will have three output from RAM1 , RAM2 and RAM1 2 we use an decoding system as at the edges of RAM1 and RAM2 the data will be taken from RAM1 2. we veried the output data and compare it with the rst case we found that immersing ram technique was able to overcome and correct the drawback of dividing data in order to get parallel processing facility. As a result of using RAM in Splitting and combining data the cost is 2 machine cycle added to the formula so the conclusion result will be as follows: total number of machine cycles = delayed for one pixel + data ow of number of elements of one frame total number of machine cycle = 11 + 256 * 256 = 65,547 Machine cycle at 100Mhz frequency (10nSec) total time will be = 65,547/100Mhz = 0.65547 mSec/for one frame IV. P ROPOSED C OMPUTATION E QUATION processing time tP roc. = N1 + N2 + N3 + N4 F requency(HZ ) (5)

where: N1 is No. of Machine Cycle taken for serial to parallel converting; N2 is No. of Machine Cycle taken for serial input to reach the output in the specic algorithm; N3 is Image size (total number of pixels); N4 is No. of Machine Cycle taken for parallel to serial converting

949

in our Case: N1 = 1, N2 = 9, N3 = 65547, N4 = 1 Frequency = 100 Mhz the proposed design for parallel computing gets that N1 = N4 = 1 and it is xed for different algorithm and N3 = 256*256 (m*n) as the Image frame size, nally the only variable is N2 as for Gaussian lter algorithm is 9 , also for different algorithms it will be always very small compared with image size N3 , so roughly we can see processing time will be around 1msec for 100Mhz. 1 + 9 + 65536 + 1 = 0.65547msec (6) tP roc. = 100000000

will give advantage for reducing the utilization resources in the target FPGA, and reduce the number of unused calculations for time saving, which as processing time is the most important target in fast image processing. This Gaussian low pass lter is the essential low level image processing algorithm and it is commonly used as a preprocessing technique. Studying sigma parameter effect can also save complicated unneeded design connections and keep it as minimum as possible. The other advantage that we can split the frame image int 4 divisions and with minimum time cost, 2 machine cycle, which allows us to have more freedom in data calculation. and nally we got the formula which govern the fast image processing data ow calculations. the limitations due to xed point in FPGA can be overcomed by hardware experiance of the designer. from my point of view FPGA and image processing algorithms applications will proof more suitability and more stability adding to the higher speed processing techniques. R EFERENCES
[1] Moore, Gordon E. (1965). Cramming more components onto integrated circuits. Electronics Magazine. pp. 4. Retrieved 2006-11-11. [2] Almasi, G.S. and A. Gottlieb. Highly Parallel Computing. Benjamin Cummings publishers, Redwood City, CA, 1989. [3] A b Barney, Blaise. Introduction to Parallel Computing. Lawrence Livermore National Laboratory, 2007. [4] Bernstein, A. J. (October 1966)., Analysis of Programs for Parallel Processing, IEEE Trans. on Electronic Computers. EC-15, pp. 75762. [5] Hennessy, John L. and David A. Patterson. Computer Architecture: A Quantitative Approach. 3rd edition, Morgan Kaufmann, p. 43. ISBN 1558607242, 2002. [6] R. C. Pearce, Jayanti C. Majithia., Analysis of a Shared Resource MIMD Computer Organization. IEEE Transactions on computers, vol. C-27, No. 1, 1978. [7] Leah J. Siegel, Howard Jay Siegel, Philip H. Swain. Performance Measures for Evaluating Algorithms for SIMD Machines., IEEE Transactions on software engineering, vol. SE-8, no. 4, 1982. [8] Alain Merigot, Alfredo Petrosino. Parallel processing for image and video processing: Issues and challenges [J]. Elsevier B.V. Parallel Computing 34, 694-699, 2008. [9] R. C. Gonzalez and R. E. Woods. Digital Image Processing, Second Edition. Pearson Education International, 2002 [10] D. Baumgartner, P. Rossler, W. Kubinger, Perferomance Benchmark of DSP and FPGA Implementations of low level vision algorithms, IEEE, 2007 [11] J.A. Kalomiros, J. Lygouras, Design and evaluation of a hardware/software FPGA based system for fast image processing, Microprocessors and Microsystems, P. 95-106, 2008.

(b) Original image


Fig. 2.

(c) Image side effect result

Four branches parallel computing.

(a) Result elimination of the design side affect 1

(b) SobelX Algorithm

(c) SobelX/Gaussian Algorithms

Fig. 3. Algorithms Manipulation by Sobelx edge detector and Gaussian lter.

C ONCLUSION The advantage of this parameter study in Gaussian lter algorithm that the designer could choose the minimum kernel size that will have the same affect of higher kernel size, which

950

You might also like