You are on page 1of 32

Feature-Based Aerial Image Registration and Mosaicing

A Report Submitted in Partial Fulllment of the Requirements for the Degree of Bachelor of Technology

by Gaurav Gupta Y2150

to the Department of Electrical Engineering

Indian Institute of Technology, Kanpur


April, 2006

Certificate

This is to certify that the work contained in the thesis entitled Feature-Based Aerial Image Registration and Mosaicing, by Gaurav Gupta, has been carried out under my supervision and that this work has not been submitted elsewhere for a degree.

April, 2006

-----------------------------------------(Dr. Sumana Gupta) Department of Electrical Engineering, Indian Institute of Technology, Kanpur.

------------------------------------------(Dr. Amitabha Mukerjee) Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur.

ACKNOWLEDGEMENT I am extremely thankful to Dr. Amitabha Mukerjee and Dr. Sumana Gupta who provided me with support and guidance which was inevitable for my work. All my doubts were welcome. I also would like to thank Dr. Jharna Majumdar under whom guidance I gained knowledge and experience while my short stay at ADE, Bangalore which helped me lot working on this project. I would like to thank Dr. A. K. Ghosh for organizing ight for collecting data required for this project. Furthermore, I would like to extend my sincere gratitude to Mr. Shobhit Niranjan, M.Tech(Dual) student, Dept. of Electrical Engineering, IIT Kanpur, who sat with me to sort out problems in my code and algorithms. I also would like to thank Mr Subhranshu Maji, B.Tech student, Dept. of Computer Science and Engineering, who provided me help in coding some part. I also thank my teammates from Aerospace Engineering and Computer Science and Engineering who are with me in UAV Project Group at IIT Kanpur for motivation, support and company in critical times.

iii

Contents
1 Introduction 1.1 1.2 1.3 1.4 Geo-Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geometric Transformations . . . . . . . . . . . . . . . . . . . . . . . . Aerial Image Registration . . . . . . . . . . . . . . . . . . . . . . . . Image Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 5 7 8 10 10 13 18 18 24 25

2 Feature Extraction 2.1 2.2 Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saliency Map and Salient Point based on Ittis Model . . . . . . . . .

3 Registration Technique 3.1 Registration without using Correspondence and Mosaicing . . . . . .

4 Discussion 5 Conclusion and Further Study

iv

Chapter 1 Introduction
The ability to locate scenes and objects visible in aerial video imagery with their corresponding locations in a reference coordinate system is becoming increasingly important in visually-guided navigation, surveillance and monitoring systems [2] [11]. The availability of low-cost, lightweight video camera systems, high-bandwidth VHF communications links and a growing inventory of Unmanned Aerial Vehicles (UAVs) and Mini Aerial Vehicles (MAVs) has resulted in dramatic new opportunities for surveillance and sensing applications of such algorithms. The typical mission of such UAV/MAV will consist video registration of particular surveillance ight with reference image (alignment of video frames with pre-calibrated reference imagery-DEM and satellite data). Frame-to-reference registration of video is complex due to lack of stable features/saliant points because typical frame does not cover large enough area and computationally expensive also. Possible solution is registration of mosiac with reference image created from typical MAV mission ight data. The objective of this work is to study feature based techniques to do aerial registration of typical MAV mission ight data in semi-urban environment and create mosaic using estimated geometric transformation parameters. The next task is to register this generated mosaic with high resolution satellite reference image which would allow us to locate any given landmark in a MAV mission video in world co-ordinate system. Two aspects to this work are: 1. Feature extraction 2. Registration In this semester till now we have studied various important features which can be used for registration which are Harris corner detector, KLT corner detector and 1

Visually salient points using Ittis model. Features are extracted successfully. Rotation and scale parameters are determined using recent registration technique devised by Xiong et. al. [1]. The algorithm in [1] is dierent from conventional feature based image registration algorithms, which does not need image matching or correspondence.We compare dierent feature extraction algorithms and use those features for registration and compare the results. We extend the approach in [1] for nding translation also and use estimated parameters to generate mosaic. Code for mosaicing is also written and some results are obtained although the algorithm for mosaicing is not robust enough to the noise in the registration parameters. Implementation is done in Windows platform on Visual C++ 6.0 with OpenCV [29] libraries. Next semester tasks will be to nd translation parameters robustly and creating mosaic from a MAV video data. In the future the mosaic will be geo-registered with ortho-rectied satellite image data (reference image data). We are planning to buy satellite image data of Kanpur area for this purpose from NRSA (National Remote Sensing Agency). Some preprocessing is also needed because the collected aerial data generally has noise due to weather conditions, motion compensation needs to be there and camera parameters are to be estimated for pre-warping the images; these factors aect the registration procedure and its accuracy.

1.1

Geo-Registration

Computer vision techniques can be used to successfully align any given video frame with pre-calibrated reference imagery. This kind of registration is known as georegistration [2]. The Reference Imagery is a high-resolution orthographic image, usually with a Ground Sampling Distance of 1 (meaning a pixel corresponds to 1m2 on ground). This Reference Imagery is geodetically aligned, and has an associated Digital Elevation Map (DEM), so that each pixel of the Reference Imagery has a precise longitude, latitude, and height associated with it. The Reference Imagery, which covers a substantial area, can be cropped on the basis of the telemetry data (Telemetry is an automatic measurement of data that denes the position of the camera in terms of nine parameters: vehicle latitude, vehicle longitude, vehicle height, vehicle roll, vehicle pitch, vehicle heading, camera elevation, camera scan angle and camera focal length.) to a smaller area corresponding to Ivideo (x) which denotes video frame or mosaic created from aerial data. This cropped reference image can be referred as Iref (x).

Two transformation functions exist between reference and aerial image. If x is a image point (in pixels), I1 (x) is the aerial image array and I2 (x) is reference image array then, 1. freg (x) is the geometric transformation. Finding this mapping is the problem of image registration, I1 (x) I2 (freg (x)) (1.1)

2. fcolor is intensity /colour mapping between aerial and reference image, I1 (x) = fcolor (I2 (freg (x))) (1.2)

Finding freg (x) is the larger challenge. Shah et. al. [2] identies the following diculties: 1. The two imageries are in dierent projection views: Ivideo (x) is an image of perspective projection, whereas Iref (x) is an image of orthographic projection. While the telemetry information can be used with a sensor model to bring both images into a single projection view, 2. Because of the large duration of time that elapses between the capturing of the two images, data distortions like severe lighting and atmospheric variations and object changes in the form of forest growths or new construction cause a high number of disjoint features (features present in one image but not in the other). 3. Remotely sensed terrain imagery, in particular, has the property of being highly self-correlated both as image data and elevation data. This includes rst order correlations (locally similar luminance or elevation values in buildings), second order correlations (edge continuations in roads, forest edges, and ridges), as well as higher order correlations (homogeneous textures in forests and homogenous elevations in plateaus). In the past, substantial research has been directed towards determining the geolocation of objects from an aerial view. Several systems such as Terrain Contour Matching (TERCOM) [3], SITAN, Inertial Navigation/Guidance Systems (INS/IGS), Global Positioning Systems (GPS) and most recently Digital Scene-Matching and Area Correlation (DSMAC) have already been deployed in applications requiring geo-location. While each of these systems has had some degree of success, several shortcomings and deciencies have become increasingly apparent. By understanding 3

the limitations of these systems, we can acquire a better appreciation for the need of eective image based systems. Two types of approaches can be distinguished for geo-registration problem: ElevationBased Correspondence and Image-Based Correspondence. Elevation based algorithms attempt to achieve alignment by matching the DEM with an elevation map recovered from video data. Aggarwal et. al. in [4] perform pixel-wise stereo analysis of successive frames to yield a recovered elevation map or REM, as the initial Data Rectication step. In [5], Sim and Park propose another geo-registration algorithm that reconstructs a REM from stereo analysis of successive video frames. Normalized Cross Correlation based point-matching is used to recover the elevation values. ElevationBased approaches (based on DEMs) have the general drawback that they rely on the accuracy of recovered elevation from two frames, a task found to be notoriously dicult. Intensity-based approaches to geo-registration use intensity properties of both imageries to achieve alignment. Work has been done developing image based techniques towards registration of two sets of reference imageries [6], as well as the registration of two successive video images ( [7], [8]). In [9], Cannata et al use the telemetry information to bring a video frame into an orthographic projection view, by associating each pixel with an elevation value from the DEM. By ortho-rectifying the aerial video frame, the process of alignment is simplied to a strict 2D registration problem. Correspondence is achieved by taking 32 32 pixel patches uniformly over the aerial image and correlating them with a larger search patch in the Reference Image, using Normalized Cross Correlation. Finally, the sensor parameters are updated using a conjugate gradient method, or by a Kalman Filter to stress temporal continuity. An alternate approach is presented by Kumar et al in [10] where instead of ortho-rectifying the Aerial Video Frame, a perspective projection of the associated area of the Reference Image is performed. In [10], two further data rectication steps are performed. Video frame-to-frame alignment is used to create a mosaic providing greater context for alignment than a single image. For data rectication, a Laplacian lter at multiple scales is then applied to both the video mosaic and reference image. To achieve correspondence, two stages of alignment are used: coarse followed by ne alignment. For coarse alignment salient (feature) points are dened as the locations where the response in both scale and space is maximum. Normalized correlation is used as a match measure between salient points and the associated reference patch. One feature point is picked as a reference, and the correlation surfaces for each feature point are then translated to be centered at the reference feature point. 4

In the subsequent work, [11], the lter is modied to use the Laplacian of Gaussian lter as well as Hilbert Transform, in four directions to yield four oriented energy images for each aerial video frame, and for each perspectively projected reference image. Instead of considering video mosaics for alignment, the authors use a mosaic of 3 key-frames from the data stream, each with at least 50 percent overlap. The major limitation of the intensity based approaches are the assumptions that are made. The research literature of image-based correspondence is quite vast; [12] is a general survey of some of these registration techniques. Alignment by maximization of Mutual Information [13] is another frequently used registration approach, and while it provides high levels of robustness it also allows many false positives when matching over a search area of the nature encountered in Geo-Registration. In addition to working with no GPS, it is also possible to consider situations where telemetry data is not available or corrupted. This is also possible, but due to lack of initial point in the visual search, it results in signicant increase in computational time.

1.2

Geometric Transformations

A Geometric Transformation is a mapping that relocates image points. Transformations can be global or local in nature. Global transformations are usually dened by a single set of parameters, which is applied to the whole image. Some of the most common global transformations are ane, perspective, and polynomial transformations. The ane transformations include translation, scaling and shear motion parameters. Translation and rotation transforms are usually caused by the dierent orientation of the sensor, while scaling transform is the eect of change in altitude of the sensor. The sensor distortion or the viewing angle may cause stretching and shearing. Rigid transformations account for object or sensor movement in which objects in the images maintain their relative shape and size [14]. A rigid-body transformation is composed of a combination of rotation , translation in x direction tx translation in y direction ty , and scale s. It can be written as, x2 y2 = tx ty +s cos sin sin cos x1 y1 (1.3)

where (x2 , y2 ) is the new transformed coordinate of (x1 , y1 ), tx and ty are x-axis and y-axis translations, and s is a scale factor. The general 2D ane transformation can be expressed as shown in the following equation:

(a)

(b)

(c)

Figure 1.1: (a) Based on the telemetry data, that species the corresponding area of the Reference Imagery the camera is capturing, the Reference Image is cropped., (b) The aerial video frame before and (c)after geo-registration with the Cropped Reference Image. It should be noted that the Reference Image is an Orthographic Image while the Aerial Video Frame is a Perspective Image. Images are taken from [2] for elaboration

x2 y2

tx ty A=

a11 a12 a21 a22

x1 y1

(1.4)

a11 a12 a21 a22

where (x2 , y2 ) is the new transformed coordinate of (x1 , y1 ). The matrix A can be combination of rotation, scale, or shear. The rotation matrix is similar to 1.3. The scale for both x and y axes can be expressed as: Scale = Sx 0 0 Sy (1.5)

However Local distortions may be present in the scenes due to a motion parallax, movement of object, etc. The parameters of a local mapping a transformation vary across the dierent regions of the image to handle local deformations. These parameters can be determined by subdividing the image into small image parts.

1.3

Aerial Image Registration

Image registration is the process of determining the geometric transformation dened above between a newly sensed image, called input image, and a reference image of the same scene that could possibly be taken at dierent times, from dierent sensors, or from dierent viewpoints. The current automated registration techniques can be classied into two broad categories: area-based and feature-based techniques. In the area-based algorithms, a small window of points in the sensed image is compared statistically with windows of the same size in the reference image [15]. Window correspondence is based on the similarity measure between two given windows. The measure of similarity is usually the normalized cross correlation. Area-based techniques can be implemented by the Fourier transform using the fast Fourier transform (FFT) [16]. A majority of the area-based methods have the limitation of registering only images with small misalignment, and therefore, the images must be roughly aligned with each other initially. The correlation measures become unreliable when the images have multiple modalities and the gray-level characteristics vary (e.g., TM and synthetic aperture radar (SAR) data). In contrast, the feature-based methods are more robust and more suitable in these cases. There are two critical procedures generally involved in the feature-based techniques: feature extraction and feature correspondence. The basic building block 7

of feature based image registration scheme involves matching feature points that are extracted from a sensed image to their counter parts in a reference image. Features may be control points, corners, junctions or interest points. These features are also known as visually salient point. Feature matching overcomes the inabilities of are based signal correlation by attempting matching only information rich points.

1.3.1

Image Integration

It deals with nding fcolor between two aerial images or two dierent imaging system. Various techniques have been developed for modifying the image grey levels in the vicinity of a boundary to obtain a smooth transition between images by removing these seams and creating a blended image. These mainly consist in choosing a frontier, which induces a minimum of discontinuity [18]. [17] proposed using a polynomial curve. In each line of the common area, one point is retained and the curve is dened by a minimum squared error procedure. These methods are appropriate if the common area is quite identical (plane support). Heitz presented a simplication using a parametric plane function s = ax + by + c determined by a mean square error procedure. More important transformation must be applied when the common regions includes large dierences, Westerkamp et. al. [19] described a polynomial function to assemble distorted microscopic images.

1.4

Image Mosaicing

An Image mosaic is a synthetic composition generated from a sequence of images and it can be obtained by understanding geometric relationship between images. The geometric relations are coordinate transformations that relate the dierent image coordinate systems. By applying the appropriate transformations via a warping operation and merging the overlapping regions of warped images, it is possible to construct a single image indistinguishable from a single large image of the same object, covering the entire visible area of the scene. This merged single image is the called mosaic. The basic scheme for Mosaicing comprises of two main steps, which are outlined below. 1. Image registration using Geometric Transformations derived from image data and/or camera models. 2. Image integration or blending. 8

We have adopted Feature Based approach to solve registration problem. Dierent from conventional feature based image registration algorithms, our approach is based on the work done by Xiong et. al. [1] which does not need image matching or correspondence. In [1] they only consider Harris corners as features. We compare dierent feature extraction algorithms and use those features for correspondence and compare the results. We extend the approach in [1] for nding translation also and use estimated parameters to generate mosaic.

Chapter 2 Feature Extraction


A feature is the result of an interpretation of n pixels, usually in a compact support, in a window of p p. An important step in almost all machine as well as biological vision systems is to process the input image(s) to extract features or primal sketches. In general, the feature detection process involves computing the response R of one or multiple detectors (lters/operators) to the input image(s), followed by the analysis of R to isolate points (or regions) that satisfy certain constraints. In-fact, the best denition of a feature is the operator itself. There are several kinds of feature used for matching. They may be divided into four grouped as follows: Visual features (edges, textures junctions and corners) Transform Coecient Features: Fourier descriptors, Hadamard coecients. Algebraic Features (based on matrix decomposition of an image) Statistical Features (moment invariants)

2.1

Corner Detection

Corners are dened as the junction point of two straight line edges. Most existing edge detectors perform poorly at corners, because they assume an edge to be an entity with innite extent, an assumption, which is violated at the corners. Since, most of the gray-level based corner detectors are based on existing edge detectors, the performance of such corner detectors is not satisfactory. For example the Canny edge detector [20] is found incapable of accurately locating edges near a corner due to the well-known rounding eect. Harris corner detector [21] and KLT feature detector [22] are the most widely used corner detectors so we compare them for our application. 10

2.1.1

Harris Corner Detector

Harris corner detector [21] algorithm computes a matrix, which is related to the autocorrelation function of Image intensity. This matrix averages the rst derivatives of the signal on a window:
2 x2 + y 2 Ix Ix Iy exp 2 2 I I Iy x y 2

(2.1)

where Ix and Iy are the gradient (derivatives) in the x and y direction. The eigen values of this matrix are the principal curvatures of the auto-correlation function. If these two curvatures are high, an interest point is present. The Algorithm for Harris corner detection is as follows: Algorithm for Harris Corner Detector 1. Compute Matrix C for each pixel of the input image. 2. The standard Harris Corner detection algorithm proposes two dierent criterions for corner point selection. The rst is to compare the value of (det(C ) k trace(C )2 ) with a threshold and the second way is to compare the value of R =
det(C ) trace(C )

with a threshold. Where C is the covariance matrix of gradient

computed above. We have used the second way while implementing the Harris algorithm because rst method highly depends on the chosen value of constant k. Feature Reduction While selecting corner points using the Harris algorithm we have applied a two level corner strength comparison. Suppose 1 and2 are the two eigen values of the covariance matrix C then criteria for feature reduction is as follows, First we compare the value of norm = greater, it is a rst level corner point. Then we divide the image into 25 25 grids and in each grid we select at maximum one corer point that has highest value of norm of eigen values dened above. Fig. 2.1(a) shows the corners detected using Harris corner detection algorithm [21] and g. 2.1(b) shows the corners after applying feature reduction algorithm. 11
2 2 1 + 2 with a threshold and if it is

(a)

(b)

Figure 2.1: (a) Corners obtained after using Harris Corner Detector algorithm, (b) Detected features after feature reduction algorithm.

2.1.2

KLT Corner Detector

KLT features [22] are geometrically stable under dierent transformations. Hence features detected by KLT have high repeatability factor and have high information content. It is also based on auto-correlation function of image intensity. KLT Corner Detector Algorithm 1. Compute Matrix C for each pixel of the input image and let 1 and 2 denotes its eigen values. 2. The KLT corner has rst level corner detection based on the value of smaller eigen value. It is computed in a window about the point under consideration, and is compared with threshold: if it is greater than threshold it is a rst level corner point. Then the array of all corner points is sorted in decreasing order of minimum of eigen values of windows about points under considerations. 3. Moving from top to down we delete all the points lying below the point under consideration in the array and satisfy 8-neighborhood criterion. In g. 2.2(a) corners are shown using KLT algorithm and g. 2.2(b) shows the reduced KLT corners.

12

(a)

(b)

Figure 2.2: (a) Corners obtained after using KLT Corner Detector algorithm, (b) Detected features after feature reduction.

2.2

Saliency Map and Salient Point based on Ittis Model

Visual attention is basically a biological mechanism used essentially by primates to compensate for the inability of their brains to process the huge amount of visual information gathered by the two eyes. Early works on attention modeling were mostly inspired by the biological model of the Brain. The Caltechs Hypothesis [23] elaborated by Itti-Koch [24] represents one of the rst concrete descriptions on how the visual attention model works. According to the hypothesis the elementary features are extracted in a unique map of attention, the saliency map, which resides either in LGN (lateral geniculate nucleus) or in the V1 (Primary Visual Cortex). Finally, the Winner Take All(WTA) network [27] which is responsible for detecting the most salient scene location is located around the thalamic reticular nucleus. One of the rst and the most popular of the computation models of saliency and is based on the Caltechs thesis. It is based on four main principles: Visual attention is based on multi-featured inputs; saliency of a region is aected by the surrounding context; the saliency of locations is represented by a saliency map, and the Winner Take All and Inhibition of return are suitable mechanisms to allow attention shifts

13

Figure 2.3: Schematic model for Ittis model [23]

2.2.1

Feature Maps for Static Images

First, a number of features (1....j.....n) are extracted from the scene by computing the so called feature maps Fj . Such a map represents the image of the scene, based on a well-dened feature, which leads to a multi-featured representation of the scene. In his implementation, Itti considered seven dierent features which are computed from an RGB color image and which belong to three main cues, namely intensity, color, and orientation. Intensity Feature F1 = I = 0.3 R + 0.59 G + 0.11 B (2.2)

Two chromatic features based on the two color opponency lters R+ G and B + Y where the yellow signal is dened as Y = nency exists in human visual cortex. RG I BY F2 = I F2 = 14 (2.3) (2.4)
R+G . 2

Such chromatic oppo-

The normalization of the features with I decouples hue from intensity. Four local orientation features F4...7 according to the angles{0; 45; 90; 135}. Gabor lters which represent a suitable mathematical model of the receptive eld impulse response of orientation-selective neurons in primary visual cortex [25], are used to compute the orientation features. In this implementation of the model, it is possible to use an arbitrary number of orientations. However, it has been noticed that using more than four orientations does not improve the performance of the model drastically.

2.2.2

Center-Surround Receptive Field Proles

In a second step, each feature map is transformed in its conspicuity map which highlights the parts of the scene that strongly dier, according to a specic feature, from their surroundings. In biologically plausible models, this is usually achieved by using a center-surround mechanism. Practically, this mechanism can be implemented with a dierence-of-Gaussians- lter, DoG, which can be applied on feature maps to extract local activities for each feature type. A visual attention task has to detect conspicuous regions, regardless of their sizes. Thus, a multiscale conspicuity operator is required. Applying variable size center-surround- lters on xed size images,has a high computational cost. This method is based on a multiresolution representation of images. For a feature j , a gaussian pyramid Ij is created by progressively lowpass ltering and sub-sampling by factor 2 the feature map Fj , using a gaussian lter G: Ij (0) = Fj Ij (i) = (Ij (i 1) G) (2.5) (2.6)

where () refers to the spatial convolution operator and refers to the downsampling operation. Center-Surround is then implemented as the dierence between ne (c for center) and coarse scales (s for surround). Indeed, for a feature j (1...j...n), a set of intermediate multiscale conspicuity maps Mj,k (1...k.....K ) are computed according to the equation below, giving rise to (n*K) maps for n considered features. Mj,k = |Ij (ck ) where Ij (sk )| (2.7)

is a cross-scale dierence operator that rst interpolates the coarser scale

to the ner one and then carries out a point-by-point substraction. The absolute 15

value of the dierence between the center and the surround allows the simultaneous computing of both sensitivities, dark center on bright surround and bright center on dark surround (red/green and green/red or blue/yellow and yellow/blue for color).

2.2.3

Saliency Map

The purpose of the saliency map is to represent the conspicuity or saliency at every location in the visual eld by a scalar quantity, and to guide the selection of attended locations, based on the spatial distribution of saliency. At each spatial location, all the feature maps consequently needs to be combined into a unique scalar measure of salience. In the implementation all the feature maps are normalized to the same total dynamic range (e.g., between 0 to 255), and to sum all feature maps into the saliency map. This operation is dened as N (.).

2.2.4

Selection of the point of Attention

Once the saliency Map has been computed the Winner Take All (WTA) and Inhibition of Return are Suitable Mechanisms to imitate the eye movements and the focus of attention [27]. The WTA will select the point with maximum salience at each iteration. However The movement of the attention point can be done by inhibiting the saliency of the current object being attended [26] [27]. At each iteration the saliency of the object being attended to is decayed, thus eventually the objects not being attended to will increase in saliency and take the focus of attention. Other approach could be to divide the saliency map into sucient grids and take local maxima of the intensity of saliency map image in the grid above certain threshold value. Fig. 2.4(a) shows the saliency map generated by combining all the feature maps based on the Ittis model for nding visually salient points. It clearly shows that the salient locations have larger intensity in the image. Fig. 2.5(a) and g. 2.5(b) shows the visually salient points detected on a pair of images.

16

(a)

Figure 2.4: Saliency map obtained from normalized summation of all feature maps

(a)

(b)

Figure 2.5: (a) Visually salient points in rst image obtained, (b) salient points in second image obtained.

17

Chapter 3 Registration Technique


3.1 Registration without using Correspondence and Mosaicing
This approach is based on the Xiong et. al. [1] which studies the problem of aerial image registration without any correspondence, by a novel algorithm. Features are detected on pair of images (observed and reference) using either any corner detector or saliency map approach based on Ittis model. Image patches are created using these features as positions. Circle is used as the shape of image patches to deal with rotation situation. By changing the size of image patches, we can handle scaling situation. Orientations of image patches are computed with eigenvector approach. With the orientation dierences of patches between reference and observed images, an angle histogram is created by a voting procedure. The orientation dierence corresponding to the maximum peak of the histogram is the rotation angle between reference and observed images. Dierent sizes of image patches are used to create dierent angle histograms. The scaling value between the two images can be determined by the angle histogram which has the highest maximum peak. In the following subsections, the approach is described as follows.

3.1.1

The Orientation of an Image Patch

For a given patch p(i, j )(i = 1, 2, ..., m), the covariance matrix is dened as COVp = E (X mx )(X mx)T (3.1)

18

i mxi is the position of the pixel; mx = is the centroid of the j mxj image patch p(i, j ), the rst order moment. The eigenvalues can be found by solving: where X = |COVp I | = 0 (3.2)

Equation 3.2 will give us two eigenvalues. Suppose 1 is the largest eigenvalue and 2 is the smallest eigenvalue. The normalized eigenvectorsV1 and V2 that correspond to the eigenvalues 1 and 2 are of course orthogonal. The direction of eigenvector V1 is dened as the orientation of image patch p(i, j ). By applying this approach, we can compute orientations for all image patches on reference and observed images.

3.1.2

Angle Histogram for Image Rotation and Scaling

For observed image, we can create a patch set Pt = {pj t , j = 1, 2, ..., nt } and obtain an orientation set t = {j t , j = 1, 2, ..., nt }. Similarly, for reference image, we can create
i a patch set Pf = {pi f , i = 1, 2, ..., nf } and an orientation set f = {f , i = 1, 2, ..., nf }

. Suppose that the rotation angle between observed and reference images is and both images cover same scene. For an image patch pj t on observed image, we compute orientation dierences with all patches Pf = {pi f , i = 1, 2, ..., nf } on reference image.
i l = |j t f |, i = 1, 2, ..., nf

(3.3)

If we do the same computation for all patches pj t , j = 1, 2, ..., nt on observed image, we will obtain a set of orientation dierences = {l , l = 1, 2, ..., nt nf } and nd nt correspondence patches on reference image. For these nt pairs of correspondence patches, the value of orientation dierences will be the rotation angle.
i l = |j t f |, i = 1, 2, ..., nf

(3.4)

If we create a histogram for the orientation dierences, the counts which the value of orientation dierence between correspondence patches appears will be the highest. For nding the scale, we can obtain the value of the scaling through a series of voting processes. By changing the size of the image patches and computing the angle histograms, we can obtain a serial of angle histograms. Choose the one which has the highest maximum peak. Let At and Af denote the patch sizes on observed and reference images corresponding to the histogram Hh which has the highest maximum peak. The value of the scaling between observed and reference images can be

19

(a)

Figure 3.1: Typical angle histogram. In the X-axis each column of image represents bin of angle dierence and in the Y-axis occurrence of that bin is shown. The peak for the angle is very dominant assuming same scale. computed by s= At Af (3.5)

In the mean time, the orientation dierence corresponding to the histogram Hh is the rotation angle between observed and reference images.

3.1.3

Finding Translation

Algorithm 1. Select one feature point in reference image: manually done 2. Then nd similar points according to similarity measure dened in [28] threshold = 0.85 3. Then take a window about the interest point in reference image and calculate normalized cross correlation for each similar point and the particular similar point in target image is the corresponding point/patch. Then nding translation is trivial by using simple transformation.

20

(a)

(b)

Figure 3.2: (a) a test image, (b) test image rotated by 110 and result shows 11.460 rotation.

3.1.4

Results of Registration

After extracting the features we use these features for nding rotation, scale and translation parameters and register the images according to above described algorithm. Table 3.1 shows the results obtained for harris features, Table 3.2 shows for KLT features and Table 3.3 shows the results obtained for salient features using Ittis model. For validation of algorithm we applied it to the a pair of test images such that g. 3.2(a) is the image without any rotation and g. 3.2(b) is the 110 rotated anticlockwise. The results obtained shows 11.460 rotation and 1.0 scale which is a very good result. We did similar experiments with other images also. For validation of nal correspondence after nding the translation we highlighted the found corresponding feature in destination image with black blob as shown in g. 3.3(a) and 3.3(b). Fig. 3.4(a) and g. 3.4(b) are showing the pair of images (source and destination images for registration) for which above registration parameters are found. Fig. 1.1(c) shows the mosaic by combining the source and destination images using the registration parameters.

21

(a)

(b)

Figure 3.3: (a) source image with a feature selected manually, (b) destination image with estimated corresponding feature to the feature of source image

Table 3.1: Results - Using Harris Corners Features Angle of Rotation 3.6492o Scale 0.973329 Translation Tx 57.547512 pixel Translation Ty 1.562878 pixel

Table 3.2: Results - Using KLT Corners Features Angle of Rotation 5.1567o Scale 1.000000 Translation Tx 48.427681 pixel Translation Ty 8.734428 pixel

Table 3.3: Results - Using Visually Salient Points Angle of Rotation 1.43312o Scale 1.000000 Translation Tx 50.427681 pixel Translation Ty 1.734428 pixel

22

(a)

(b)

Figure 3.4: (a) Image One, (b) Second Image.

(a)

Figure 3.5: Merged Image Using Corner Features

23

Chapter 4 Discussion
KLT features are more stable than harris features. Salient points are also found to be stable. A degree of blurring in the images aects the detected features because presence of noise aects the texture (orientation). The blurring is clearly due to shaking of the mount, which should be minimized. Registration parameters are comparable using any of the features. The rotation is very small as expected because at the time of data collection motion was mainly translational. Mosaic generated by combining the pair of images does not have perfect overlap of inlier region of images and is not robust enough to the noise in the registration parameters.

24

Chapter 5 Conclusion and Further Study


Features are extracted successfully. As of now algorithm for nding translation after nding the rotation and scale is manual. The algorithm for nding translation needs to be statistical and automatic without any human intervention. There is need to detect key-frames from any video mission data which can be done only when we will have sucient mission data. In future more ight data has to be collected. The algorithm for mosaicing is not robust enough to the noise in the registration parameters, this artifact has to be taken care of. Some preprocessing is also needed because the collected aerial data generally has noise due to weather conditions, motion compensation needs to be there and camera parameters are to be estimated for prewarping the images; these factors aects the registration procedure and its accuracy. In the future the mosaic needs to be geo-register with ortho-rectied satellite image data (reference image data). We are planning to buy satellite image data of Kanpur area for this purpose from NRSA (National Remote Sensing Agency). For funding we have submitted detailed research proposal to ARDB (Aeronautics Research & Development Board).

25

Bibliography
[1] Xiong, Y., and Quek, F., Automatic Aerial Image Registration Without Correspondence, The 4th IEEE International Conference on Computer Vision Systems (ICVS2006), January 5-7, 2006. St. Johns University, Manhattan, New York City, New York, USA. [2] Sheikh, Y., and Khan, S. and Shah, M. and Cannata, R.W., Geodetic Alignment of Aerial Video Frames, VideoRegister03, 2003,Chapter 7 [3] Golden, J.P., Terrain Contour Matching (TERCOM): A cruise missile guidance aid, Proc. Image Processing Missile Guidance, vol. 238, pp. 10-18, 1980. [4] Rodriquez, J., and Aggarwal, J., Matching Aerial Images to 3D terrain maps, IEEE PAMI, 12(12), pp. 1138-1149, 1990. [5] Sim, D., and Park, R., Localization based on the gradient information for DEM Matching, Proc. Transactions on Image Processing, 11(1), pp. 52-55, 2002. [6] Zheng, Q., and Chellappa, R., A computational vision approach to image registration, IEEE Transactions on Image Processing, 2(3), pp. 311 -326, 1993. [7] Bergen, J., Anandan, P., Hanna, K., and Hingorani, R., Hierarchical modelbased motion estimation, Proc. European Conference on Computer Vision, pp. 237-252, 1992. [8] Szeliski, R., Image mosaicing for tele-reality applications, IEEE Workshop on Applications of Computer Vision, pp. 44-53, 1994. [9] Cannata, R., Shah, M., Blask, S., and Workum, J. V., Autonomous Video Registration Using Sensor Model Parameter Adjustments, Applied Imagery Pattern Recognition Workshop, 2000.

26

[10] Kumar, R., Sawhney, H., Asmuth, J., Pope, A., and Hsu, S., Registration of video to georeferenced imagery, Fourteenth International Conference on Pattern Recognition, vol. 2. pp.1393-1400, 1998. [11] Wildes, R., Hirvonen, D., Hsu, S., Kumar, R., Lehman, W., Matei, B., and Zhao, W., Video Registration: Algorithm and quantitative evaluation, Proc. International Conference on Computer Vision, Vol. 2, pp. 343 -350, 2001. [12] Horn, B., and Schunk, B., Determining Optical Flow, Articial Intelligence, vol. 17, pp. 185-203, 1981. [13] Lucas, B., and Kanade, T., An Iterative image registration technique with an application to stereo vision, Proceedings of the 7th International Joint Conference on Articial Intelligence, pp. 674-679, 1981. [14] Brown, L.G., A Survey of Image Registration Techniques, ACM Computing Surveys, Vol. 24, No. 4, pp. 325-376, December 1992. [15] Li, H., Manjunath, B.S., and Mitra, S.K., A contour based appraoch to multisensor image registration, IEEE Trans. Image Processing pp. 320334, March. 1995. [16] Cideciyan, A. V., Registration of high resolution images of the retina, in Proc. SPIE, Medical Imaging VI: Image Processing, Feb. 1992, vol. 1652, pp. 310322. [17] Tainxi, W., A New Mosaicing Method for Landsat Remote Sensing Images, Kexue Tongbao (Science Bulletin), 32(12): 854-859, 1987. [18] Herbert, P., and Rouge, B., Digital Image Mosaics, Prentice Hall, Englewood clis, New Jersey, 1979. [19] Westerkamp, D., and Gahm, T., Non-Distorted Assemblage of the Digital Images of Adjacent Fields in Histological Sections, Universitat Hannover, Appelstrasse 9A, 3000 Hannover 1, Germany March 1992. [20] Canny, J., A computational approach to edge detection, IEEE PAMI 679 698 (1986). [21] Harris, J. C., and Stephens, M., A combined corner and edge detector, In Proc. 4th Alvey Vision Conf, pages 189 192,1988.

27

[22] Tomasi, C., and Kanade, T., Detection and Tracking of Point Features, CMU Technical Report CMU-CS-91-132, April 1991. [23] Itti, L., Models of Bottom-Up and Top-Down Visual Attention, PhD thesis, Pasadena, California, 2000. [24] Itti, L., and Koch, C., Nature Reviews Neuroscience, 2(3), 194-203, 2001 [25] Ouerhani, N., Visual Attention: Form Bio-Inspired Modelling to Real-Time Implementation , PhD thesis, 2003 [26] Backer, G., and Mertsching, B., Two selection stages provide ecient objectbased attentional controlfor dynamic vision, in International Workshop on Attention and Performance in Computer Vision, 2004. [27] Maji, S., and Mukerjee, A., Motion Conspicuity Detection: A Visual Attention model for Dynamic Scenes, Report on CS497, IIT Kanpur, avialable at www.cse.iitk.ac.in/report-repository/ 2005/Y2383 497-report.pdf [28] Kyung and Lacroix, S., A Robust Interest Point Matching Algorithm, IEEE, 2001 [29] Intel Open Source Computer Vision Libraray.

http://www.intel.com/technology/computing/opencv/index.htm

28

You might also like