You are on page 1of 17

remote sensing

Article
Remote Sensing Image Scene Classification Using
Multi-Scale Completed Local Binary Patterns and
Fisher Vectors
Longhui Huang 1 , Chen Chen 2 , Wei Li 1, * and Qian Du 3
1 College of Information Science and Technology, Beijing University of Chemical Technology, 100029 Beijing,
China; 15117950611@163.com
2 Department of Electrical Engineering, University of Texas at Dallas, Dallas, TX 75080, USA;
chenchen870713@gmail.com
3 Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762,
USA; du@ece.msstate.edu
* Correspondence: liwei089@ieee.org; Tel.: +86-1814-6529-853

Academic Editors: Gonzalo Pajares Martinsanz, Xiaofeng Li and Prasad S. Thenkabail


Received: 18 February 2016; Accepted: 30 May 2016; Published: 8 June 2016

Abstract: An effective remote sensing image scene classification approach using patch-based
multi-scale completed local binary pattern (MS-CLBP) features and a Fisher vector (FV) is proposed.
The approach extracts a set of local patch descriptors by partitioning an image and its multi-scale
versions into dense patches and using the CLBP descriptor to characterize local rotation invariant
texture information. Then, Fisher vector encoding is used to encode the local patch descriptors (i.e.,
patch-based CLBP features) into a discriminative representation. To improve the discriminative
power of feature representation, multiple sets of parameters are used for CLBP to generate multiple
FVs that are concatenated as the final representation for an image. A kernel-based extreme learning
machine (KELM) is then employed for classification. The proposed method is extensively evaluated
on two public benchmark remote sensing image datasets (i.e., the 21-class land-use dataset and
the 19-class satellite scene dataset) and leads to superior classification performance (93.00% for the
21-class dataset with an improvement of approximately 3% when compared with the state-of-the-art
MS-CLBP and 94.32% for the 19-class dataset with an improvement of approximately 1%).

Keywords: remote sensing image scene classification; completed local binary patterns; multi-scale
analysis; fisher vector; extreme learning machine

1. Introduction
Remote sensing is an effective tool for Earth observation, which has been widely applied in
surveying land-use and land-cover classifications and monitoring their dynamic changes. With the
improvement of spatial resolution, remote-sensing images present more detailed information such
as spatial arrangement information and textural structures, which are of great help in recognizing
different land-use and land-cover scene categories. The goal of image scene classification is to recognize
the semantic categories of a given image based on some priori knowledge. Due to intra-class variations
and wide range of illumination and scale changes, scene classification of high-resolution remote
sensing images remains a challenging problem.
The last decade saw considerable efforts to employ computer vision techniques to classify aerial
or satellite image scenes. The bag-of-visual-words (BOVW) model [1], which is one of the most popular
approaches in image analysis and classification applications, provides an efficient approach to solve
the problem of scene classification. The BOVW model, derived from document classification in text
analysis, represents an image as a histogram of frequencies of a set of visual words by mapping the

Remote Sens. 2016, 8, 483; doi:10.3390/rs8060483 www.mdpi.com/journal/remotesensing


Remote Sens. 2016, 8, 483 2 of 17

local features to a visual vocabulary. The vocabulary is pre-established by clustering the local features
extracted from a collection of images. The traditional BOVW model ignores spatial and structural
information, which severely limits its descriptive ability. To overcome this issue, a spatial pyramid
matching (SPM) framework was proposed in [2]. This approach partitions an image into sub-regions,
computes a BOVW histogram for each sub-region, and then concatenates the histograms from all
sub-regions to form the SPM representation of an image. However, SPM only considers the absolute
spatial arrangement, and the resulting features are sensitive to rotation variations. Thus, a spatial
co-occurrence kernel, which is general enough to characterize a variety of spatial arrangements,
was proposed in [3] to capture both the absolute and relative spatial layout of an image. In [4],
a multi-resolution representation was incorporated into the bag-of-features model and two modalities
of horizontal and vertical partitions were adopted to partition all resolution images into sub-regions
to improve the SPM framework. In [5], a concentric circle-structured multi-scale BOVW model was
presented to incorporate rotation-invariant spatial layout information into the original BOVW model.
The aforementioned BOVW variants focus on capturing the spatial layout information of scene
images. However, the rich texture and structure information in high-resolution remote sensing images
has not been fully exploited since they merely use the scale-invariant feature transform (SIFT) [6]
descriptors to capture local features. There is also a great effort to evaluate various features and
combinations of features for scene classification. In [7], a local structural texture similarity descriptor
was applied to image blocks to represent structural texture for aerial image classification. In [8],
semantic classifications of aerial images based on Gabor and Gist descriptors [9] were evaluated
individually. In [10], four types of features consisting of raw pixel intensity values, oriented filter
responses, SIFT-based feature descriptors, and self-similarity were used within the framework of
unsupervised feature learning. In [11], global features extracted using the enhanced Gabor texture
descriptor (EGTD) and local features extracted using the SIFT descriptor were fused in a hierarchical
approach to improve the performance of remote sensing image scene classification.
Recently, deep learning has received great attention. Different from the afore-mentioned BOVW
and its variants that are considered mid-level representations, deep learning is an end-to-end
feature learning method (e.g., from an image to semantic label). Among deep learning-based
networks, convolutional neural networks (CNNs) [12,13] may be the most popular for learning visual
features in computer vision applications, such as remote sensing and large-scale visual recognition.
K. Nogueira et al. [14] presented the PatreoNet, which has the capability to learn specific spatial
features from remote sensing images without any pre-processing step or descriptor evaluation.
AlexNet, proposed by Krizhevsky et al. [15], was the first to employ non-saturating neurons, GPU
implementation of the convolution operation and dropout to prevent overfitting. GoogLeNet [16]
deployed the CNN architecture and utilized filters of different sizes at the same layer to reduce
the number of parameters of the network. However, CNNs have an intrinsic limitation, i.e., the
complicated pre-training process to adjust parameters.
In [17], multi-scale completed local binary patterns (MS-CLBP) features were utilized for remote
sensing image classification. The extracted features can be considered global features in an image.
However, the global feature representation may not able to characterize detailed structures and
distinct objects. For example, some land-use and land-cover classes are defined mainly by individual
objects, e.g., baseball fields and storage tanks. In this paper, we propose a local feature representation
method based on patch-based MS-CLBP, which can be used to extract complementary features to
global features. Specifically, the CLBP descriptor is applied to partition dense image patches and
extract a set of local patch descriptors, which characterize the detailed local structure and texture
information in high-resolution remote sensing images. Since the CLBP [18] operator belongs to
a gray-scale and rotation-invariant texture operator, the extracted local descriptors are robust to rotation
image transformations. Then, the Fisher kernel representation [19] is employed to encode the local
descriptors into a discriminative representation (i.e., Fisher vector (FV)). FV describes patch descriptors
by their deviation from a universal generative Gaussian mixture model (GMM). To improve the
Remote Sens. 2016, 8, 483 3 of 17

discriminative power of the feature representation, multiple sets of parameters for the CLBP operator
(i.e., MS-CLBP) were utilized to generate multiple FVs. The final representation for an image is
achieved by concatenating all the FVs. For classification, the kernel-based extreme learning machine
(KELM) [20] is adopted for its efficient computation and good classification performance.
There are two main contributions from this work. First, a local feature representation method
using patch-based MS-CLBP features and FV is proposed. The MS-CLBP operator is applied to
the partitioned dense regions to extract a set of local patch descriptors, and then the Fisher kernel
representation is used to encode the local descriptors into a discriminative representation of remote
sensing images. Second, the two implementations of MS-CLBP are combined into a unified framework
to build a more powerful feature representation. The proposed local feature representation method is
evaluated using two public benchmark remote sensing image datasets. The experimental results verify
the effectiveness of our proposed method as compared to state-of-the-art algorithms.
The remainder of the paper is organized as follows. Section 2 presents the related works including
CLBP and the Fisher vector. Section 3 describes two implementations of MS-CLBP, patch-based
MS-CLBP feature extraction, and the details of the proposed feature representation method. Section 4
provides the experimental results. Finally, Section 5 concludes the paper.

2. Related Works

2.1. Completed Local Binary Patterns


Local binary patterns (LBP) [21,22] are an effective measure of spatial structure information of
local image texture. Consider a center pixel and its gray value, tc . Its neighboring pixels are equally
spaced on a circle of radius r with the center at location tc . If the coordinates of tc are p0, 0q and m
neighbors tti uim01 are considered, the coordinates of ti are denoted as prsinp2i{mq, rcosp2i{mqq.
Then, the LBP is calculated by thresholding the neighbors tti uim01 with the center pixel tc to generate
an m-bit binary number. The resulting LBP for tc in decimal number can be expressed as follows:
#
m
1 m
1
i i 1, x0
LBPm,r ptc q s pti tc q 2 s pdi q 2 , spxq (1)
i 0 i 0 0, x0

where di pti tc q represents the difference between the center pixel and each neighbor, which
characterizes the spatial local structure at the center location. Further, the resulted di is robust to
illumination changes and they are more efficient than the original image in pattern classification due
to the fact that the central gray level tc is removed. The difference vector di can be further decomposed
into two components: the signs and magnitudes (absolute values of di , i.e., |di |). However, the original
LBP only uses the sign information of di while ignoring the magnitude information. In the improved
CLBP [18], the signs and magnitudes are complementary, from which the difference vector di can
be perfectly reconstructed. Figure 1 illustrates an example of the sign and magnitude components
of the CLBP extracted from a sample block, where Figure 1ad denote original 3 3 local structure,
difference vector, sign vector and magnitude vector, respectively. Note that 0 is coded as 1 in
CLBP (as seen in Figure 1c). Two operators, CLBP-Sign (CLBP_S) and CLBP-Magnitude (CLBP_M),
are used to encode these two components. CLBP_S is equivalent to the traditional LBP operator while
the CLBP_M operator can be expressed as,
#
m
1
1, xy
CLBP_Mm,r f p|di | , cq2i , f px, yq (2)
i 0 0, xy

where c is a threshold that is set to the mean value of |di |. Using Equations (1) and (2), two binary
strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively. Two ways to
combine the CLBP_S and CLBP_M codes are presented in [18]. Here, the first way (concatenation) is
Remote Sens. 2016, 8, 483 4 of 17
Remote Sens. 2016, 8, 483 4 of 17

where c is a threshold that is set to the mean value of di . Using Equations (1) and (2), two binary
where c is a threshold that is set to the mean value of di . Using Equations (1) and (2), two binary
Remote
stringsSens.
can be 8,produced
2016, 483 and denoted as CLBP_S and CLBP_M codes, respectively. Two ways 4 of 17
to
strings can be produced and denoted as CLBP_S and CLBP_M codes, respectively. Two ways to
combine the CLBP_S and CLBP_M codes are presented in [18]. Here, the first way (concatenation) is
combine the CLBP_S and CLBP_M codes are presented in [18]. Here, the first way (concatenation) is
used, in which the histograms of the CLBP_S and CLBP_M codes are calculated separately, and
used,
used, in
in which
whichthethehistograms
histogramsofof thethe
CLBP_S
CLBP_Sandand
CLBP_M
CLBP_M codes are calculated
codes separately,
are calculated and then
separately, and
then the two histograms are concatenated. Note that there is also the CLBP-Center part which codes
the two histograms are concatenated. Note that there is also the CLBP-Center part which
then the two histograms are concatenated. Note that there is also the CLBP-Center part which codes codes the
the values of the center pixels in the original CLBP. Here, only the CLBP_S and CLBP_M operators
values of the
the values of center pixels
the center in the
pixels in original CLBP.
the original Here,
CLBP. onlyonly
Here, the CLBP_S and and
the CLBP_S CLBP_M operators
CLBP_M are
operators
are considered for computational efficiency.
considered for computational efficiency.
are considered for computational efficiency.

Figure 1.1. (a)


Figure (a)3 3 3 sample
3 sample block; (b) (b)
block; the local differences;
the local (c) the(c)sign
differences; thecomponent of CLBP;
sign component of (d) the
CLBP;
Figure 1.value
absolute (a) 3of local
3 sample block;
differences; (e) (b)
the the local differences;
magnitude component (c)
of the sign component of CLBP;
CLBP.
(d) the absolute value of local differences; (e) the magnitude component of CLBP.
(d) the absolute value of local differences; (e) the magnitude component of CLBP.
Figure22presents
Figure presentsananexample
example of of
thethe CLBP_S
CLBP_S andand CLBP_M
CLBP_M coded
coded images
images corresponding
corresponding to an
to an input
Figure 2 presents an example of the CLBP_S and CLBP_M coded images corresponding to an
input scene
aerial aerial (viaduct
scene (viaduct
scene). scene). It observed
It can be can be observed that CLBP_S
that CLBP_S and CLBP_M
and CLBP_M operators
operators both canboth can
capture
input aerial scene (viaduct scene). It can be observed that CLBP_S and CLBP_M operators both can
capture
the thepattern
spatial spatialand
pattern and the contrast
the contrast of local
of local image image such
texture, texture, such as
as edges andedges and corners.
corners.
capture the spatial pattern and the contrast of local image texture, such as edges and corners.

Figure 2. (a) Input image; (b) CLBP_S coded image; (c) CLBP_M coded image.
Figure 2. (a) Input image; (b) CLBP_S coded image; (c) CLBP_M coded image.

2.2. Fisher Vector


2.2. Fisher Vector
After local feature extraction (especially for patch-based feature extraction), the popular
After localfeature
local featureextraction
extraction (especially patch-based
(especially for patch-based feature extraction), the popular
BOVW model is usually employed to encodefor feature
features into histograms. extraction),
However,the thepopular
BOVWBOVW model
BOVW ismodel
model usuallyis usually
employed employed
to encode to encode
features features into histograms.
into histograms. However,However, the BOVW
the BOVW model model
has
has an intrinsic limitation, namely the computational cost in assignment of local features to visual
hasintrinsic
an an intrinsic limitation,
limitation, namely namelythe the computational
computational cost incost in assignment
assignment of local offeatures
local features
to visualto words,
visual
words, which scales as the product of the number of visual words, the number of regions and the
words,scales
which which scales as the of
product of the number of visualthe words, the numberand of regions and the
local featureasdimensionality
the product the number
[23]. Severalofextensions
visual words, to the number
basic BOVWof regions model tothe localcompact
build feature
local feature dimensionality
dimensionality [23]. Several [23]. Severaltoextensions
extensions to the basic BOVW model to build compact
vocabularies have been proposed. The mostthe basic BOVW
appealing one is model
the Fisher to build
kernelcompact vocabularies
image representation
vocabularies
have been have beenThe
proposed. proposed.
most The most appealing
appealing one is the one is the
Fisher Fisherimage
kernel kernelrepresentation
image representation [19,24],
[19,24], which uses high-dimensional gradient representation to represent an image. Due to
[19,24],uses
which which uses high-dimensional
high-dimensional gradient gradient representation
representation to represent to anrepresent
image. an image.
Due to Due to
informative
informative representations with compact vocabularies, this representation contains more
informative representations
representations with compact with compact
vocabularies, vocabularies,contains
this representation this representation
more information contains more
than a simple
information than a simple histogram representation.
information
histogram than a simple histogram representation.
representation.
An FV is a special case of Fisher kernel construction. Let X = {xt , t = 1 ... T } be the set of
An FV is is aa special
special case
case of of Fisher
Fisher kernel
kernel construction.
construction. Let Let XX = {tx xtt,, tt = 11......TTu
} be be the set of
local patch descriptors extracted from an image. A Gaussian mixture model (GMM) is trained on
local patch descriptors extracted from an image. A Gaussian mixture
patch descriptors extracted from an image. A Gaussian mixture model (GMM) is trained model (GMM) is trained on the
on
the training images using Maximum Likelihood (ML) estimation [25,26]. Let denote the
training images
the training using using
images Maximum Likelihood
Maximum (ML) estimation
Likelihood [25,26]. Let
(ML) estimation P denote
[25,26]. Let the denoteprobabilitythe
probability density function of the GMM with parameters = {i , i , i , i = 1...K } , where is the
density function of function
the GMM of with parameters ti , i , = i{,i i
, 1...Ku, where K is the number
probability density the GMM with parameters i , i , i = 1...K } , where is the
number
of of components.
components. i , i and
ii ,,arei and i are the mixture weight, mean vector, and covarianceof the ith
number of components.
th i i the
and mixture
i areweight, mean vector,
the mixture weight,and mean covariance
vector, matrix
and covariance
Gaussian the i Gaussian
matrix ofcomponent, respectively. component,
The imagerespectively. The image
can be characterized by thecan be characterized
gradient of the log-likelihood by the
matrix of the i th Gaussian component, respectively. The image can be characterized by the
gradient
of the dataofon
thethelog-likelihood
model: of the data on the model:
gradient of the log-likelihood of the data on the model:
GX logP pX|q (3)
(3)
(3)
The gradient describes the direction along which parameters are to be adjusted to best fit the
data. Under an independence assumption, the covariance matrices are diagonal, i.e., i diag i .
` 2

Then following [27], LpX|q logPpX|q is written as,


Remote Sens. 2016, 8, 483 5 of 17

T

LpX|q logPpxt |q (4)
t 1

The probability density function of xt generated by the GMM is

k

Ppxt |q i pi pxt |q (5)
i 1

Let t piq be the occupancy probability, i.e., the probability of descriptor xt generated by the
i-th Gaussian.
i pi pxt |q
t piq Pp i| xt , q (6)
k

j p j pxt q
j 1

with the Bayes formula mathematical derivations providing the following results,

T
BLpX|q t piq t p1q
for i 2 (7)
Bi i 1
t 1
ff
T
BLpX|q xtd id
t piq 2
(8)
Bid t 1 pid q
T
2 ff
BLpX|q pxtd id q 1
t piq 3
d (9)
Bid t 1 pid q i

where d denotes the dth dimension of a vector. The diagonal closed-form approximation in [27] is used
to normalize the gradient vector by multiplying the square-root of the inverse of the Fisher information
1{2
matrix, i.e., F . Let f i , f d , and f d denote the diagonal of F corresponding to BLpX|q {Bi ,
i i
BLpX|q {Bid , and BLpX|q {Bid , respectively, and we have the following approximation,

1 1
f i Tp ` q (10)
i 1

Ti
f d 2
(11)
i
pid q
2Ti
f d 2
(12)
i
pid q

Thus, the normalized partial derivatives are f i 1{2 BLpX|q {Bi , f d 1{2 BLpX|q {Bid , and
i
f d 1{2 BLpX|q {Bid . The final gradient vector (i.e., FV) is just a concatenation of all the partial
i
derivative vectors. Therefore, the dimensionality of FV is p2D ` 1q K, where D denotes the size of
the local descriptors.

3. Proposed Feature Representation Method


Inspired by the success of CLBP and FV in computer vision applications, we propose an effective
image representation approach for remote sensing image scene classification based on patch-based
MS-CLBP features and FV. The patch-based MS-CLBP is applied as the local patch descriptors and then
the FV is chosen as the encoding strategy to generate a high-dimensional representation of an image.
Remote Sens. 2016, 8, 483 6 of 17
Remote Sens. 2016, 8, 483 6 of 17
descriptors and then the FV is chosen as the encoding strategy to generate a high-dimensional
descriptors
representation andofthen the FV is chosen as the encoding strategy to generate a high-dimensional
an image.
Remote Sens. 2016, 8,
representation of483 an image. 6 of 17

3.1. Two Implementations of Multi-Scale Completed Local Binary Patterns


3.1. Two Implementations of Multi-Scale Completed Local Binary Patterns
3.1. Two
CLBPImplementations
features computed of Multi-Scale
from aCompleted
single-scaleLocal
mayBinary
not Patterns
be able to detect the dominant texture
CLBP
features features
from an computed
image. A from
possible a single-scale
solution is may
to not be able the
characterize to detect
imagethethe dominant
texture texture
in multiple
CLBP features computed from a single-scale may not be able to detect dominant texture
features
resolutions,from an image. A possible solution is to characterize the image texture in multiple
features fromi.e.,
an MS-CLBP. There
image. A possible are two
solution implementations
is to characterizefor thethe MS-CLBP
image texturedescriptor
in multiple[17].
resolutions,
resolutions,
In the i.e.,
first MS-CLBP.
implementation, There are
the two
radiusimplementations
of a circle for
is the
alteredMS-CLBP
to change descriptor
the spatial[17].
resolution.
i.e., MS-CLBP. There are two implementations for the MS-CLBP descriptor [17].
The In In the first implementation,
multi-scale analysis is the radius of
accomplished by a combining
circle is the
altered to changeprovided
information the spatialbyresolution.
multiple
the first implementation, the radius of a circle r is altered to change the spatial resolution.
The multi-scale analysis is accomplished by combining the information provided by multiple
operators
The multi-scale analysis( m
of varying is,accomplished
r ) . For simplicity, the number
by combining of neighbors
the information is fixed
provided byto m and operators
multiple different
operators
values of
of r pm,varying
are rq.
tuned ( m , r ) . For simplicity, the number of neighbors is fixed to m and different
r of
of varying For to achieve the
simplicity, theoptimal
numbercombination.
of neighbors An example
is fixed to m ofanda 3-scale (three
different values values)
r are
values
CLBP to
tuned ofachieve
r are tuned
operator isthe to achieve
illustrated
optimal in the optimal
Figure
combination. combination.
3. The
An CLBP_S
exampleand ofAn example
a CLBP_M
3-scale of a r3-scale
histogram
(three CLBPrextracted
(three
features
values) values)
operator
CLBP
from
is operator
each
illustrated in is
scale illustrated
are
Figure 3. TheinCLBP_S
concatenated Figure
to form3. The
and an CLBP_S
MS-CLBP
CLBP_M and CLBP_M
representation.
histogram featureshistogram features
One disadvantage
extracted from eachextracted
ofscale
this
from each
multi-scale
are scale
analysis
concatenated are concatenated
implementation
to form an MS-CLBP to form an MS-CLBP
is representation.
that the computational representation. One
complexityofincreases
One disadvantage disadvantage of
due to multiple
this multi-scale this
analysis
multi-scale
resolutions.analysis
implementation is thatimplementation
the computational is that the computational
complexity increases complexity
due to multiple increases due to multiple
resolutions.
resolutions.

Figure 3.
Figure 3. An
An example
example of
of the
the first
firstimplementation
implementationof
ofaa3-scale
3-scaleCLBP
CLBPoperator (m= 8, 8,r1 r=11
operator( m r2 ,=r22 ,
, 1 and
2,
Figure 3. An example of the first implementation of a 3-scale CLBP operator ( m = 8 , r = 1 , r = 2 , and
r3 = 3r3). 3 ).
and 1 2

r3 = 3 ).

the second
In the second implementation,
implementation, the original
original image isis down-sampled
down-sampled usingusing the
the bicubic
bicubic interpolation
interpolation
In themultiple
to obtain second implementation,
images at different the original image
scales. The is down-sampled
value using0the
of scale is between andbicubic interpolation
1 (here, 1 denotes
denotes
to obtain multiple images
the original image). Then, at
Then, the different
the CLBP_S scales. The value
CLBP_S and CLBP_M operators of scale is between 0 and 1 (here,
operators of fixed radius and the number of1 denotes
the originalare
neighbors image).
applied
appliedThen,
to the
to themultiple-scale
the CLBP_S and images.
multiple-scale CLBP_M
images. The operators
CLBP_Sofand fixed radius and
CLBP_M
CLBP_M the number
histogram
histogram of
features
features
neighbors
extracted are applied
extracted from
from eachscale
each to image
scale the
imagemultiple-scale images.
areconcatenated
are concatenated Thean
totoform
form CLBP_S
an andrepresentation.
MS-CLBP
MS-CLBP CLBP_M histogram
representation. AnAn features
example
example of
extracted
of the
the from
second
second each scale
implementation
implementation image are
of the
of the concatenated
MS-CLBP
MS-CLBP to form
descriptor
descriptor an MS-CLBP
is shown
is shown representation.
in Figure
in Figure 4. 4. An example
of the second implementation of the MS-CLBP descriptor is shown in Figure 4.

Figure 4. An example of the second implementation of a 3-scale CLBP operator ( m = 8 , r = 2 ).


Figure 4. An
Figure 4. An example
example of
of the
the second
second implementation
implementationof
ofaa3-scale
3-scaleCLBP
CLBPoperator
operator(m
( m= 88,, rr
= 22 ).
).
3.2. Patch-Based MS-CLBP Feature Extraction
3.2. Patch-Based MS-CLBP Feature Extraction
Given an image, the CLBP [18] operator with a parameter set ( m, r ) is applied to generate
Given an
an image,
image, the CLBP
thewith
CLBP [18] operator
operatorwith a parameter pm,(rq
setset m,(i.e.,
isr )applied to generate two
two CLBP coded images one[18]
corresponding with a parameter
to the sign component is applied
CLBP_S to generate
coded image)
CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded
two CLBP coded images with one corresponding to the sign component (i.e., CLBP_S coded image) image) and
the other the magnitude component (i.e., CLBP_M coded image). Two complementary components of
CLBP (CLBP_S and CLBP_M) can capture the spatial patterns and contrast of local image texture, such
Remote Sens. 2016, 8, 483 7 of 17

and the other the magnitude component (i.e., CLBP_M coded image). Two complementary
Remote Sens. 2016,
components of8,CLBP
483 (CLBP_S and CLBP_M) can capture the spatial patterns and contrast of 7local of 17
image texture, such as edges and corners. Then, the CLBP coded images are partitioned into
overlapped patches in an image grid. For simplicity, the overlap between two patches is half of the
as edges and corners. Then, the CLBP coded images are partitioned into B B overlapped patches in
patch size (i.e., B 2 ) in both horizontal and vertical directions. To incorporate spatial structures of
an image grid. For simplicity, the overlap between two patches is half of the patch size (i.e., B{2) in
an image
both at different
horizontal scales
and vertical (or sizes)
directions. and createspatial
To incorporate morestructures
patch descriptors,
of an image here the second
at different scales
(or sizes) and create more patch descriptors, here the second implementation of MS-CLBP isscales
implementation of MS-CLBP is employed by resizing the original image to different (e.g.,
employed
1 2 and 1 3 of the original image). Specifically, the CLBP operator with the same parameter
by resizing the original image to different scales (e.g., 1{2 and 1{3 of the original image). Specifically, set is
applied
the CLBP to operator
the multi-scale images
with the same to generate set
parameter patch-based
is appliedCLBPto thehistogram
multi-scalefeatures.
imagesFor to patch
generatei,
two occurrence
patch-based CLBP histograms
histogram(i.e., the nonparametric
features. For patch i, twostatistical estimate)
occurrence are computed
histograms from the sign
(i.e., the nonparametric
component (CLBP_S) and the magnitude component (CLBP_M). A histogram
statistical estimate) are computed from the sign component (CLBP_S) and the magnitude component feature vector
denoted by h
(CLBP_M). A histogram
i is formed by concatenating the two histograms. If patches are extracted
feature vector denoted by hi is formed by concatenating the two histograms. from
theMmulti-scale
If patches areimages, a feature
extracted from thematrix denoted
multi-scale by H
images, = [h1 , h 2matrix
a feature ,..., h M ]denoted
is generated
by H torhrepresent
1 , h2 , ..., hthe
Ms
is generated
original to represent
image. the original
Each column image.
of the EachHcolumn
matrix of the matrix
is a histogram H is avector
feature histogram
for feature
a patch. vector
The
for a patch. The proposed patch-based CLBP feature extraction method
proposed patch-based CLBP feature extraction method is illustrated in Figure 5. is illustrated in Figure 5.

Figure 5.
Figure 5. Patch-based
Patch-based CLBP feature extraction.
CLBP feature extraction.

As noted in [21], LBP features computed from a single scale may not be able to represent
As noted in [21], LBP features computed from a single scale may not be able to represent intrinsic
intrinsic texture features. Therefore, different parameter sets ( m, r ) are utilized for the CLBP
texture features. Therefore, different parameter sets pm, rq are utilized for the CLBP operator to achieve
operator to achieve the first implementation of the MS-CLBP as described in [17]. Specifically, the
the first implementation of the MS-CLBP as described in [17]. Specifically, the number of neighbors
number of neighbors
(m) is fixed ( )radii
and multiple is fixed andused
(r) are multiple
in theradii ( ) are used
patch-based CLBP in feature
the patch-based
extractionCLBP feature
as shown in
extraction
Figure 5. Ifas shown in Figure
q!parameter sets (i.e.,5. Ifpm, r1)parameter
q, pm, r2 q, ...,sets
pm, (i.e.,
(
{(m,considered,
rq q ) are r1 ), (m, r2 ),...,a(set
m, rof}
q ) q) feature
are considered,
matrices
denoted
a set of byfeature q , Hpm,r2 qdenoted
Hpm,r1matrices , ..., Hpm,rqby
q can {
H ( mbe obtained for an image. }
, r1 ) , H ( m , r2 ) ,..., H ( m , rq ) can be It is worth
obtained fornoting that Itthe
an image. is
proposed patch-based MS-CLBP feature extraction effectively unifies the two implementations of the
worth noting that the proposed patch-based MS-CLBP feature extraction effectively unifies the two
MS-CLBP [17].
implementations of the MS-CLBP [17].
3.3. A Fisher Kernel Representation
3.3. A Fisher Kernel Representation
Fisher kernel representation [19] is an effective patch aggregation mechanism to characterize
Fisher
a sample of kernel
low-level representation
features, and[19] is an effective
it exhibits superiorpatch aggregation
performance over mechanism
the BOVW model. to characterize a
Therefore,
sample
the Fisher ofkernel
low-level features, isand
representation it exhibits
employed superior
to encode performance
the dense over the BOVW model.
!local patch descriptors.
Therefore, the Fisher kernel representation is employed to encode the dense local patch descriptors.
)
Given NT training images with NT feature matrices, H , H s , ..., Hr NT s representing
r 1 s r2

Given patchtraining
the local descriptorsimages with
(i.e., featureCLBP
patch-based matrices, {H[1] , H
features)
[2]
of each }
,..., H[ Nimage
T ]
representing the
are obtained
using the feature
local patch descriptors extraction method illustrated in Figure 5. Since q parameter sets (i.e.,
((i.e., patch-based CLBP features) of each image are obtained using the
pm, r1 q, pm, r!2 q, ..., pm, rq q ) are employed ) for the CLBP operator, each image yields q feature matrices
feature extraction method illustrated in Figure 5. Since parameter sets (i.e.,
r js r js r js
denoted by Hpm,r q , Hpm,r q , ..., Hpm,r q , where j P r1, 2, ..., NT s. For each CLBP parameter set, the
1 2 q
corresponding feature matrices ) are employed
of the trainingfor data
the are
CLBPusedoperator,
to estimate each the image yields
GMM parameters feature
via the
Expectation-Maximization {
matrices denoted by H[( mj ](EM) [ j]
, r1 ) , H ( algorithm. }
[ j]
m , r2 ) ,..., H ( m , rqTherefore, j q[1,
) , where for 2,..., parameter
CLBP NT ] . For each
sets,CLBP parameter
q GMMs set,
are created.
After estimating the GMM parameters, q FVs are obtained for an image. Then, the q FVs are simply
the corresponding feature matrices of the training data are used to estimate the GMM parameters
concatenated as the final feature representation. Figure 6 shows the detailed procedure for generating
via the Expectation-Maximization (EM) algorithm. Therefore, for q CLBP parameter sets, q
FVs. As illustrated in Figure 6, the stacked FVs (f) from the q CLBP parameter sets serve as the final
feature representation of an image before being fed into a classifier.
Remote Sens. 2016, 8, 483 8 of 17

GMMs are created. After estimating the GMM parameters, q FVs are obtained for an image. Then,
the FVs are simply concatenated as the final feature representation. Figure 6 shows the detailed
procedure
Remote Sens.for generating
2016, 8, 483 FVs. As illustrated in Figure 6, the stacked FVs ( ) from the q CLBP
8 of 17
parameter sets serve as the final feature representation of an image before being fed into a classifier.

Figure 6. Fisher
Figure vector
6. Fisher representation.
vector representation.

4. Experiments
4. Experiments
TwoTwostandard
standardpublic domain datasets
public domain datasetsare
are used
used to demonstrate
to demonstrate the effectiveness
the effectiveness of the
of the proposed
proposed image representation method for remote sensing land-use scene classification.
image representation method for remote sensing land-use scene classification. In the experiments, In the
experiments,
KELM withKELM with
a radial a radial
basis basis
function function
(RBF) (RBF)
kernel kernel is for
is employed employed for classification
classification due to
due to its generally
its excellent
generally excellent classification performance and low computational cost. The classification
classification performance and low computational cost. The classification performance of the
performance of the proposed
proposed method is comparedmethod
withisthe
compared with the
state-of-the-art in state-of-the-art
the literature. in the literature.

4.1.4.1.
Experimental Data
Experimental andand
Data Setup
Setup
The
The first
firstdataset
datasetisis the well-known
well-known UC-Merced
UC-Mercedland-use
land-use dataset
dataset [28].
[28]. It isItthe
is first
the public
first public
ground
ground truth land-use
truth land-use scene scene
image image
datasetdataset that consists
that consists of 21 land-use
of 21 land-use classesclasses
and each andclass
eachcontains
class
contains 100 images
100 images with awith
sizea size
of 256of 256 256
256 pixels.The
pixels. The images werewere manually
manuallyextracted extractedfromfrom aerial
aerial
orthoimagery
orthoimagery downloaded
downloaded fromfromthethe
United
UnitedStates Geological
States Geological Survey
Survey (USGS)
(USGS) National
National Map.
Map. This is is
This
a challenging
a challenging dataset duedue
dataset to atovariety of spatial
a variety patterns
of spatial in those
patterns 21 classes.
in those SampleSample
21 classes. images images
of each of
land-use class areclass
each land-use shownarein Figurein7.Figure
shown To facilitate a fair comparison,
7. To facilitate the same the
a fair comparison, experimental setting
same experimental
reported in [28] isinfollowed.
setting reported Five-fold
[28] is followed. cross-validation
Five-fold is performed,
cross-validation is performed, in whichin which the the
dataset
datasetis is
randomly
randomly partitioned
partitionedinto
intofive
five equal Thereare
equal subsets. There are2020images
images from
from each each land-use
land-use classclass
in a in a
subset.
subset. Four subsets
Four subsets arefor
are used used for training
training and theand the remaining
remaining subset forsubset for The
testing. testing. The classification
classification accuracy is
accuracy is theover
the average average overcross-validation
the five the five cross-validation evaluations.
evaluations.
Remote Sens. 2016, 8, 483 9 of 17
Remote Sens. 2016, 8, 483 9 of 17

Figure 7. Examples from the 21-class land-use


7. Examples land-use dataset:
dataset: (1) agricultural; (2) airplane; (3) baseball
Figure 7. Examples from the 21-class land-use dataset: (1) agricultural; (2) airplane; (3) baseball
diamond; (4) (4)beach;
beach;(5)(5)
buildings; (6) chaparral;
buildings; (7) dense
(6) chaparral; residential;
(7) dense (8) forest;
residential; (8) (9) freeway;
forest; (10) golf
(9) freeway;
diamond; (4) beach; (5) buildings; (6) chaparral; (7) dense residential; (8) forest; (9) freeway;
course;
(10) golf(11) harbor;
course; (11) (12) intersection;
harbor; (13) medium
(12) intersection; densitydensity
(13) medium residential; (14) mobile
residential; home home
(14) mobile park;
(10) golf course; (11) harbor; (12) intersection; (13) medium density residential; (14) mobile home
(15)
park;overpass; (16) parking
(15) overpass; lot; (17)
(16) parking lot;river;
(17) (18)
river;runway; (19) sparse
(18) runway; residential;
(19) sparse (20) storage
residential; tanks;
(20) storage
park; (15) overpass; (16) parking lot; (17) river; (18) runway; (19) sparse residential; (20) storage
(21) tennis
tanks; (21) courts.
tennis courts.
tanks; (21) tennis courts.

The second
The second dataset used in our experiments
is the is the 19-class
satellite satellite scene[29].
dataset [29]. of
It
The seconddatasetdatasetused in our
used in experiments
our experiments 19-class
is the 19-class scene dataset
satellite It consists
scene dataset [29]. It
consists
19 of 19
classesofof19 classes of
high-resolution high-resolution satellite
satellite scenes. scenes.
Therescenes. There
are 50 images are 50 images
with50sizes of 600with sizes
600 of
pixels 600 600
foreach
consists classes of high-resolution satellite There are images with sizes of 600 600
pixelsThe
class. for images
each class. The images are extracted from large satelliteEarth.
images on Google Earth. An
pixels for each class. The images are extracted from large satellite images on Google Earth.class
are extracted from large satellite images on Google An example of each An
example
is shown of
in each
Figureclass
8. is
The shown
same in Figure
experimental8. The same
setup in experimental
[30] was used.setup
Here,in [30]
30 was
images used.
are Here,
randomly 30
example of each class is shown in Figure 8. The same experimental setup in [30] was used. Here, 30
images
selected are randomly
per randomly selected
class as training per
dataperandclass as training
the remaining data and the remaining images as testing data.
images are selected class as trainingimages
data andas testing data. Theimages
the remaining experiment is repeated
as testing data.
Thetimes
10 experiment
with is repeated
different 10 times
realizations of with different
randomly realizations
selected training of randomly
and testing selectedClassification
images. training and
The experiment is repeated 10 times with different realizations of randomly selected training and
testing images.
accuracy Classification
is averaged over the accuracy
10 trials. is averaged over the 10 trials.
testing images. Classification accuracy is averaged over the 10 trials.

Figure 8. Examples from the 19-class satellite scene dataset: (1) airport; (2) beach; (3) bridge;
Figure 8. Examples from the 19-class satellite scene dataset: (1) airport; (2) beach; (3) bridge;
(4) commercial;
Figure 8. Examples(5) desert; (6) farmland;
from the (7) football
19-class satellite scenefield; (8) forest;
dataset: (9) industrial;
(1) airport; (10)(3)meadow;
(2) beach; bridge;
(4) commercial; (5) desert; (6) farmland; (7) football field; (8) forest; (9) industrial; (10) meadow;
(11) mountain; (12)
(4) commercial; park; (13)
(5) desert; (6) parking;
farmland;(14)
(7) pond; (15)
football port;
field; (8)(16) railway
forest; station; (17)
(9) industrial; (10)residential;
meadow;
(11) mountain; (12) park; (13) parking; (14) pond; (15) port; (16) railway station; (17) residential;
(18)
(11) river; (19) viaduct.
mountain; (12) park; (13) parking; (14) pond; (15) port; (16) railway station; (17) residential;
(18) river; (19) viaduct.
(18) river; (19) viaduct.
Remote Sens. 2016, 8, 483 10 of 17

Remote Sens. 2016, 8, 483 10 of 17


Note that the original images in these two datasets are color images; the images are converted
from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for
sceneNote that the original images in these two datasets are color images; the images are converted
classification.
from the RGB color space to the YCbCr color space, and the Y component (luminance) is used for
4.2. Parameters
scene Setting
classification.
The number
4.2. Parameters of neighbors ( m ) in the CLBP operator has a direct impact on the dimensionality
Setting
of the FV since patch-based CLBP features are used as local patch descriptors. A large value of
The number
will increase of neighbors
the feature (m) in the and
dimensionality CLBP operator
then increase hasthe a direct impact on complexity.
computational the dimensionality
Based on of
the
the FV since patch-based
parameter CLBPinfeatures
tuning results [17], mare= 8used as local patch
is empirically descriptors.
chosen for bothAthe large value of
21-class m will
land-use
increase the feature dimensionality and then increase the computational complexity.
dataset and the 19-class satellite scene dataset as it balances the classification performance and Based on the
parameter
computationaltuning results inIn
complexity. [17], m 8 the
addition, is empirically
parameter chosen
settingsfor bothare
in [17] theadopted
21-class for
land-use dataset
the MS-CLBP
and the 19-class satellite scene dataset as it balances the classification performance
descriptor. Specifically, 6 radii (i.e., r = [1 : 6] ) are used for the MS-CLBP descriptor, resulting 6and computational
complexity. In addition, the parameter settings in [17] are adopted for the MS-CLBP descriptor.
parameters sets {(m = 8, r1 = 1),..., (m = 8, r6 = 6)} .
Specifically, 6 radii (i.e., r r1 : 6s) are used for the MS-CLBP descriptor, resulting 6 parameters sets
Then, the number of scales is studied for the first implementation of the MS-CLBP operator for
tpm 8, r1 1q, ..., pm 8, r6 6qu.
generating multi-scale images and the number of Gaussians ( K ) in the GMM. For the 21-class
Then, the number of scales is studied for the first implementation of the MS-CLBP operator
land-use dataset, 80 images are randomly selected per class for training and the remaining images
for generating multi-scale images and the number of Gaussians (K) in the GMM. For the 21-class
for testing. For the 19-class satellite scene dataset, 30 images per class are randomly selected for
land-use dataset, 80 images are randomly selected per class for training and the remaining images for
training and the remaining images for testing. Different numbers of Gaussians are examined along
testing. For the 19-class satellite scene dataset, 30 images per class are randomly selected for training
with different choices of multiple scales including {1,1 [1: 2],...,1 [1: 6]} . For instance, 1 [1: 2]
and the remaining images for testing. Different numbers of Gaussians are examined along with
indicates choices
different that scale = 1 (original
of multiple scalesimage) and t1,
including scale
1{r1= 1/2
: 2s,(down-sampled
..., 1{r1 : 6su. Forimage
instance,at half
1{r1 of
: 2sthe size of
indicates
the original
that scale = 1image) areimage)
(original used to generate
and scale = two images with two
1/2 (down-sampled scales.
image at Figures 9 and
half of the size 10 present
of the the
original
classification results with different numbers of Gaussians in the GMM and
image) are used to generate two images with two scales. Figures 9 and 10 present the classificationdifferent numbers of
scales for the two datasets, respectively.
results with different numbers of Gaussians in the GMM and different numbers of scales for the two
datasets, respectively.

Figure 9. Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed
Figure 9.
method forClassification accuracy
the 21-class land-use (%) versus varying numbers of Gaussians and scales for our
dataset.
proposed method for the 21-class land-use dataset.
Remote Sens. 2016, 8, 483 11 of 17
Remote Sens. 2016, 8, 483 11 of 17
Remote Sens. 2016, 8, 483 11 of 17

Figure 10. Classification accuracy (%) versus varying numbers of Gaussians and scales for our
Figure 10. Classification accuracy (%) versus varying numbers of Gaussians and scales for our proposed
proposed
Figure method for the 19-class satellite scene dataset.
method10. Classification
for the accuracy
19-class satellite (%)
scene versus
dataset. varying numbers of Gaussians and scales for our
proposed method for the 19-class satellite scene dataset.
Thus, the optimal number of Gaussians for the 21-class land-use dataset is 35 and the optimal
Thus,
multiple the optimal
scales number
is 1 [1: number of Gaussians
4] simultaneously for the 21-class
considering land-use dataset isand
classification 35 and the optimal
Thus, the optimal of Gaussians for the 21-class land-use accuracy
dataset is 35 and computational
the optimal
multiple scales is 1{r1 : 4s simultaneously considering classification accuracy and computational
complexity.
multiple scales is 1 [1:
Similarly, 4] optimal
the number of
simultaneously Gaussians classification
considering for the 19-class satellite and
accuracy scenecomputational
dataset is 20
complexity. Similarly, the optimal number of Gaussians for the 19-class satellite scene dataset is 20 and
and the optimal
complexity. multiple
Similarly, thescale is 1 [1:
optimal 4] . of Gaussians for the 19-class satellite scene dataset is 20
number
the optimal multiple scale is 1{r1 : 4s.
and Since the proposed
the optimal method
multiple scale [1: 4] . dense local patches, the size of the patch ( B B ) is
is 1extracts
Since the proposed method extracts dense local patches, the size of the patch (B B) is determined
determined
Since the
empirically.
empirically.
Theproposed
The
method
classification
classification accuracies
extractswith
accuracies dense localwith
varying
varying
patches,
patch sizesthe
patch of
aresize
sizes
the are
illustrated patch
in Figure B It
illustrated
( B11. ) inis
is
Figure 11.
determined It is obvious that B = 32 achieves the best classification performance for the 21-class
obvious that empirically.
B 32 achieves Thethe classification accuracies
best classification with varying
performance for thepatch
21-classsizes are illustrated
land-use dataset. The in
land-use dataset. The sizethat of the=images in the 19-class
best dataset is 600 performance
600 pixels, for which is21-class
about
Figure 11.
size of the images in the 19-classBdataset
It is obvious 32 achieves
is 600 the
600 pixels, classification
which is about twice the size ofthe the images
twice the size
land-use dataset. of the
Theimages
size ofinthethe images
21-class dataset. Therefore, theispatch size set a B = 64 for the
600ispixels,
in the 21-class dataset. Therefore, the patchinsize
the is19-class
set a B dataset 600
64 for the 19-class dataset.which is about
19-class
twiceInthedataset.
size of the
addition, images
to gain in the 21-classefficiency,
computational dataset. Therefore, the patch size
principal component B = 64[31,32]
is set a(PCA)
analysis for the is
In addition,
19-class dataset. to gain computational efficiency, principal component analysis (PCA) [31,32] is
employed to reduce the dimensionality of FV features. The PCA projection matrix was calculated
employed to reduce the dimensionality ofefficiency,
FV features. The PCA projection matrix(PCA)
was calculated
usingIntheaddition,
featurestoofgain computational
the training data, and principal
the principal component
components that analysis
accounted for 95%[31,32]
of the is
using the
employed features
to reduce of the training data,
the dimensionality and the principal components that accounted for 95% of the
total variation of the training features areof FV features.
considered Theexperiments.
in our PCA projection matrix was calculated
total
usingvariation
the featuresof theoftraining features
the training are
data, andconsidered in our
the principal experiments.
components that accounted for 95% of the
total variation of the training features are considered in our experiments.

Figure11.
Figure Classificationaccuracy
11.Classification accuracy(%) versusvarying
(%)versus varyingpatch
patchsizes
sizesfor
forthe
the21-class
21-classland-use
land-usedataset.
dataset.
Figure 11. Classification accuracy (%) versus varying patch sizes for the 21-class land-use dataset.
Remote Sens. 2016, 8, 483 12 of 17
Remote
RemoteSens.
Sens.2016,
2016,8,8,483
483 12
12of
of17
17

4.3. FV
4.3.
4.3.FVRepresentation
FV Representationvs.
Representation vs.BOVW
vs. BOVW Model
BOVWModel
Model
ToToverify
To verifythethe
verify advantage
the advantageof FV
advantage of as
FVcompared
of FV as
as compared
compared to thetoBOVW
to the
the BOVW
BOVW model, the MS-CLBP+BOVW
model,
model, the
the MS-CLBP+BOVW
MS-CLBP+BOVW is applied
isis
toapplied
both
applied to both the 21-class land-use dataset and the 19-class satellite scene dataset and the is
the 21-class
to both land-use
the 21-classdataset
land-useand the
dataset19-class
and satellite
the 19-classscene dataset
satellite and
scene the performance
dataset and the
compared
performance
performancewithisour approach.
is compared
compared withTheour
with same
our parameters
approach.
approach. Theare
The sameused
same for the MS-CLBP
parameters
parameters are
are used
used feature.
for theIn
for the the BOVW
MS-CLBP
MS-CLBP
feature.
model, 30,000
feature. In
In the BOVW
patches
the BOVW model,
are randomly
model, 30,000
30,000 patches
selected
patchesfromare
are randomly
all patchesselected
randomly from
from all
and K-means
selected patches
patches and
clustering
all K-means
is employed
and K-means to
clustering
1024isisvisual
clustering
generate employedwordsto
employed as generate
to agenerate 1024
1024 visual
typical setting. visual words
words as as aa performance
The classification typical
typical setting.
setting. The
of theThe classification
classification
proposed method
performance
performance of
of the
the proposed
proposed method
method and
and MS-CLBP+BOVW
MS-CLBP+BOVW is
is
and MS-CLBP+BOVW is evaluated over each category of the two datasets as shown in Figures evaluated
evaluated over
over each
each category
category of
of the
the 12
two
andtwo datasets
13, datasets as shown
as shown
respectively. in Figures
As caninbeFigures
seen from12 and
12 Figure 13,
and 13,12, respectively.
respectively.
the proposed As can
Asmethod be seen
can be provides from
seen from Figure
Figure
better 12, the
12, the
performance
proposed
proposed
than method
method provides
MS-CLBP+BOVW provides
in almostbetter
all performance
better performance
categories than MS-CLBP+BOVW
thantwo,
except MS-CLBP+BOVW
medium density in
in almost
almost all
residentialall categories
categories
and parking
except
except two,
two, medium
medium density
density residential
residential and
and parking
parking lot,
lot, and
and two
two categories
categories
lot, and two categories (agricultural and forest) have equal performance. In Figure 13, the proposed (agricultural
(agricultural and
and
forest)
forest) have
have equal
equal performance.
performance. In
In Figure
Figure 13,
13, the
the proposed
proposed method
method achieves
achieves
method achieves greater accuracy than all classes except beach and industrial for the 19-class satellite greater
greater accuracy
accuracy than
than
all
allclasses
classesexcept
exceptbeach
beachand
andindustrial
industrialfor forthe
the19-class
19-classsatellite
satellitescene
scenedataset.
dataset.
scene dataset.

Figure
Figure
Figure 12.Per-class
12.12. Per-classaccuracy
Per-class accuracy(%)
accuracy (%) of
(%) of the proposed
the proposed method
proposed method andMS-CLBP+BOVW
method and
and MS-CLBP+BOVW
MS-CLBP+BOVW onon
on thethe
the 21-class
21-class
21-class
land-use
land-usedataset.
land-use dataset.
dataset.

Figure
Figure
Figure 13.13. Per-class
13.Per-class accuracy
Per-classaccuracy (%)
accuracy(%)
(%) of the proposed
of the proposed method
proposed method and
method andMS-CLBP+BOVW
and MS-CLBP+BOVW
MS-CLBP+BOVW on the
ononthe 19-class
the19-class
19-class
satellite
satellite
satellite scene
scene
scene dataset.
dataset.
dataset.

4.4. Comparison to the State-of-the-Art Methods


4.4.4.4. Comparison
Comparison toto
thetheState-of-the-Art
State-of-the-ArtMethods
Methods
In
Inthis
thissection,
section,thetheeffectiveness
effectivenessofofthe
theproposed
proposedimage
imagerepresentation
representationmethod
methodisisevaluated
evaluatedby by
In this section, the effectiveness of the proposed image representation method is evaluated by
comparing
comparingits itsperformance
performancewith withpreviously
previouslyreported
reportedperformance
performancein inthe
theliterature.
literature.Specifically,
Specifically,the
the
comparing its performance with previously reported performance in the literature. Specifically, the
proposed
proposed method
method isis compared
compared withwith the
the MS-CLBP
MS-CLBP descriptor
descriptor [17]
[17] applied
applied toto an
an entire
entire remote
remote
proposed method is compared with the MS-CLBP descriptor [17] applied to an entire remote sensing
sensing
sensing image
image toto obtain
obtain aa global
global feature
feature representation.
representation. The
The comparison
comparison results
results are
are reported
reported in
in
image
Table to obtain athe global feature representation. The comparison results are reported in Table 1.
Table 1.1. From
From the comparison
comparison results,
results, the
the proposed
proposed method
method achieves
achieves superior
superior classification
classification
From the comparison results, the proposed method achieves superior classification performance over
Remote Sens. 2016, 8, 483 13 of 17

other existing methods, which demonstrates the effectiveness of the proposed image representation
for remote sensing land-use scene classification. The improvement of the proposed method13over
Remote Sens. 2016, 8, 483 of 17
the
global representation developed in [17] is 2.4%. This improvement is mainly due to the proposed local
featureperformance
representation overframework which
other existing unifieswhich
methods, the two implementations
demonstrates of the MS-CLBP
the effectiveness descriptor.
of the proposed
image
Moreover, therepresentation
proposed approach for remote
is ansensing land-use4.7%
approximately sceneimprovement
classification. over
The the
improvement
MS-CLBP of the
+ BOVW
proposed method over the global representation developed in [17] is 2.4%.
method, which verifies the advantage of the Fisher kernel representation as compared to the BOVW This improvement is
mainly due to the proposed local feature representation framework which
model. Figure 14 shows the confusion matrix of the proposed method for the 21-class land-use dataset. unifies the two
implementations of the MS-CLBP descriptor. Moreover, the proposed approach is an
The diagonal elements of the matrix denote the mean class-specific classification accuracy (%). We find
approximately 4.7% improvement over the MS-CLBP + BOVW method, which verifies the
an interesting phenomenon from Figure 14 that diagonal elements for beach and forest are extremely
advantage of the Fisher kernel representation as compared to the BOVW model. Figure 14 shows
large but
thediagonal
confusionelements
matrix of forthe
storage tankmethod
proposed is relatively small.
for the The land-use
21-class reasons are that images
dataset. of beach
The diagonal
and forest present
elements richmatrix
of the texturedenote
and structures
the mean information;
class-specific within-class similarity(%).
classification accuracy for We
the beach
find anand
forest categories is high but relatively
interesting phenomenon from Figurelow14for
thatcategory
diagonalof storagefor
elements tank; and
beach andsome images
forest of storage
are extremely
tank are similar
large to otherelements
but diagonal class such
for as buildings.
storage tank is relatively small. The reasons are that images of beach
and forest present rich texture and structures information; within-class similarity for the beach and
forest categories is high but of
Table 1. Comparison relatively low for
classification category
accuracy (%)offorthe
storage tank; and
21-class somedataset.
land-use images of storage
tank are similar to other class such as buildings.
Method Accuracy(Mean std)
Table 1. Comparison of classification accuracy (%) forthe 21-class land-use dataset.
BOVW Method[28] 76.8
Accuracy(Mean std)
SPM BOVW
[28] [28] 76.8
75.3
BOVW + Spatial Co-occurrence SPM [28] Kernel [28] 75.377.7
BOVWColor Gabor
+ Spatial [28]
Co-occurrence Kernel [28] 77.780.5
Color histogram Color (HLS) [28]
Gabor [28] 80.581.2
StructuralColor histogram
texture (HLS) [7]
similarity [28] 81.286.0
UnsupervisedStructural texture
feature similarity
learning [7]
[33] 86.0
81.7 1.2
Saliency-Guided Unsupervised
unsupervised feature learning [33]
feature learning [34] 81.7 1.2
82.7 1.2
Saliency-Guided unsupervised feature learning [34] 82.7 1.2
Concentric circle-structured multiscale BOVW [5] 86.6 0.8
Concentric circle-structured multiscale BOVW [5] 86.6 0.8
Multifeature concatenation [35]
Multifeature concatenation [35]
89.5 0.8
89.5 0.8
Pyramid-of-Spatial-Relatons (PSR)
Pyramid-of-Spatial-Relatons [36]
(PSR) [36] 89.189.1
MCBGPMCBGP + E-ELM [37][37]
+ E-ELM 86.52
86.52 1.3 1.3
ConvNet ConvNet
with specific spatial
with specific features
spatial [38]
features [38] 89.39
89.39 1.10
1.10
gradient boosting randomconvolutional
gradient boosting randomconvolutional network
network[39]
[39] 94.53
94.53
GoogLeNet [40] [40]
GoogLeNet 92.80
92.80 0.61
0.61
OverFeatConvNets
OverFeatConvNets [40][40] 90.91 1.19
90.91 1.19
MS-CLBP MS-CLBP
[17] [17] 90.6 1.4 1.4
90.6
MS-CLBP MS-CLBP
+ BOVW + BOVW 89.27 2.9 2.9
89.27
The Proposed 93.00 1.2
The Proposed 93.00 1.2

14. Confusion
FigureFigure matrix
14. Confusion of proposed
matrix method
of proposed methodfor
forthe
the 21-class land-usedataset.
21-class land-use dataset.
RemoteSens.
Remote Sens.2016,
2016,8,8,483
483 14of
14 of17
17

When compared with CNNs, it can be found that the classification accuracy of CNNs is close
When
to that of our compared
method. with
EvenCNNs,
thoughit can
thebe found that the
performance classification
of some CNNs isaccuracy of CNNs
better than is close
the proposed
to that of our method. Even though the performance of some CNNs is better
method, they need a pre-training process with a large amount of external data. Thus our method is than the proposed
method, they need
still competitive in aterms
pre-training process
of limited with a large
requirement amount
for external of external data. Thus our method is
data.
still competitive in terms of limited requirement for external data.
The comparison results for the 19-class satellite scene dataset are listed in Table 2. It indicates
that The comparison
the proposed resultsoutperforms
method for the 19-class
othersatellite
existingscene dataset
methods andare listed inthe
achieves Table
best2.performance.
It indicates
that
The the proposed
proposed method
method outperforms
provides about 7%other existing methods
improvement andmethod
over the achieves inthe
[31]best performance.
which utilized a
The
combination of multiple sets of features, indicating the superior discriminative powerutilized
proposed method provides about 7% improvement over the method in [31] which of the
aproposed
combination of multiple
feature sets of features,
representation. indicatingmatrix
The confusion the superior
of thediscriminative
proposed methodpower for
of the
theproposed
19-class
feature
satelliterepresentation.
scene dataset isThe confusion
shown matrix
in Figure of thediagonal
15. From proposed methodof
elements forthe
thematrix,
19-class
thesatellite scene
classification
dataset is shown in Figure 15. From diagonal elements of the matrix, the classification
accuracy for bridges is relatively small because some texture information in the images of bridges accuracy for
is
bridges
similar is
torelatively
those in thesmall because
images some texture information in the images of bridges is similar to those
of ports.
in the images of ports.
Table 2. Comparison of classification accuracy (%) for the 19-class satellite scene dataset.
Table 2. Comparison of classification accuracy (%) for the 19-class satellite scene dataset.
Method Accuracy (Mean std)
Bag of colors [25]
Method 70.6 (Mean
Accuracy 1.5 std)
Tree of c-shapes [25] 80.4 1.8
Bag of colors [25] 70.6 1.5
Bag of SIFT [25] 85.5 1.2
Tree of c-shapes [25] 80.4 1.8
Multifeature concatenation [25] 90.8 0.7
Bag of SIFT [25] 85.5 1.2
LTP-HF [23]
Multifeature concatenation [25] 77.6 0.7
90.8
SIFT + LTP-HF + Color
LTP-HF [23]histogram [23] 93.6
77.6
MS-CLBP
SIFT + LTP-HF + Color[1]
histogram [23] 93.493.6
1.1
MS-CLBP
MS-CLBP+ BOVW
[1] 89.29
93.4 1.3
1.1
MS-CLBP + BOVW
The Proposed 89.291.2
94.32 1.3
The Proposed 94.32 1.2

Figure 15. Confusion matrix of proposed method for the 19-class satellite scene dataset.
Figure 15. Confusion matrix of proposed method for the 19-class satellite scene dataset.
5. Conclusions
5. Conclusions
In this paper, an effective image representation method for remote sensing image scene
In this paper,
classification an effective
was introduced. Theimage representation
proposed method
representation for remote
method is basedsensing image scene
on multi-scale local
classification
binary patternswas introduced.
features and FisherThe proposed
vectors. representation
The MS-CLBP methodtoisthebased
was applied on multi-scale
partitioned local
dense regions
binary
of patterns
an image features
to extract a setand Fisher
of local vectors.
patch The MS-CLBP
descriptors, was appliedthe
which characterize todetailed
the partitioned
structuredense
and
regions of an image to extract a set of local patch descriptors, which characterize
texture information in high-resolution remote sensing images. The Fisher vector was employed to the detailed
structure and texture information in high-resolution remote sensing images. The Fisher vector was
Remote Sens. 2016, 8, 483 15 of 17

encode the local descriptors into a high-dimensional gradient representation, which can enhance the
discriminative power of feature representation. Experimental results on two land-use scene datasets
demonstrated that the proposed image representation approach obtained superior performance as
compared to the existing methods for scene classification, with an obvious improvement such as 3%
for the 21-class land-use dataset compared with the state-of-the-art MS-CLBP and 1% for the 19-class
satellite scene dataset. In future work, combining global and local feature representations for remote
sensing image scene classification will be investigated.

Acknowledgments: This work was supported by the National Natural Science Foundation of China under Grants
No. NSFC-61571033, 61302164, and partly by the Fundamental Research Funds for the Central Universities under
Grants No. BUCTRC201401, BUCTRC201615, XK1521.
Author Contributions: Longhui Huang, Chen Chen and Wei Li provided the overall conception of this research,
and designed the methodology and experiments. Longhui Huang and Chen Chen carried out the implementation
of the proposed algorithm, conducted the experiments and analysis, and wrote the manuscript. Wei Li and
Qian Du reviewed and edited the manuscript.
Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:
LBP Local binary patterns
CLBP Completed local binary patterns
MS-CLBP Multi-scale completed local binary patterns
FV Fisher vector
ELM Extreme learning machine
KELM Kernel-based extreme learning machine
BOVW Bag-of-visual-words
SPM Spatial pyramid matching
SIFT Scale-invariant feature transform
EGTD Enhanced Gabor texture descriptor
GMM Gaussian mixture model
CLBP_S Completed local binary patterns sign component
CLBP_M Completed local binary patterns magnitude component
RBF Radial basis function
USGS United States Geological Survey
PCA Principal component analysis

References
1. Yang, J.; Jiang, Y.-G.; Hauptmann, A.G.; Ngo, C.-W. Evaluating bag-of-visual-words representations
in scene classification. In Proceedings of the International Workshop on Workshop on Multimedia
Information Retrieval, the 15th ACM International Conference on Multimedia, Augsburg, Bavaria, Germany,
2328 September 2007; pp. 197206.
2. Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: spatial pyramid matching for recognizing natural
scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
New York, NY, USA, 1722 June 2006; pp. 21692178.
3. Yang, Y.; Newsam, S. Spatial pyramid co-occurrence for image classification. In Proceedings of the
International Conference on Computer Vision, Barcelona, Spain, 613 November 2011; pp. 14651472.
4. Zhou, L.; Zhou, Z.; Hu, D. Scene classification using a multi-resolution bag-of-features model.
Pattern Recognit. 2013, 46, 424433. [CrossRef]
5. Zhao, L.-J.; Tang, P.; Huo, L.-Z. Land-use scene classification using a concentric circle-structured multiscale
bag-of-visual-words model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 46204631. [CrossRef]
6. Lowe, D.G. Distinctive image features from scale-invariant key points. Int. J. Comput. Vis. 2004, 60, 91110.
[CrossRef]
7. Risojevic, V.; Babic, Z. Aerial image classification using structural texture similarity. In Proceedings of the
IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Bilbao, Spain,
1417 December 2011; pp. 190195.
Remote Sens. 2016, 8, 483 16 of 17

8. Risojevic, V.; Momic, S.; Babic, Z. Gabor descriptors for aerial image classification. In Adaptive and Natural
Computing Algorithms; Springer: Berlin, Germany; Heidelberg, Germany, 2011; pp. 5160.
9. Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope.
Int. J. Comput. Vis. 2001, 42, 145175. [CrossRef]
10. Zheng, X.; Sun, X.; Fu, K.; Wang, H. Automatic annotation of satellite images via multifeature joint sparse
coding with spatial relation constraint. IEEE Geosci. Remote Sens. Lett. 2013, 10, 652656. [CrossRef]
11. Risojevic, V.; Babic, Z. Fusion of global and local descriptors for remote sensing image classification.
IEEE Geosci. Remote Sens. Lett. 2013, 10, 836840. [CrossRef]
12. Goodfellow, I.; Courville, A.; Bengio, Y. Deep Learning. Book in Preparation for MIT Press; The MIT Press:
Cambridge, MA, USA, 2016.
13. Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1127. [CrossRef]
14. Yue, J.; Zhao, W.; Mao, S.; Liu, H. SpectralSpatial classification of hyperspectral images using deep
convolutional neural networks. Remote Sens. Lett. 2015, 6, 468477. [CrossRef]
15. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Image net classification with deep convolutional neural networks.
In Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego,
CA, USA, 2012; pp. 11061114.
16. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 712 June 2015; pp. 19.
17. Chen, C.; Zhang, B.; Su, H.; Li, W.; Wang, L. Land-use scene classification using multi-scale completed local
binary patterns. Signal Image Video Process. 2015, 10, 18. [CrossRef]
18. Guo, Z.; Zhang, L.; Zhang, D. A Completed modeling of local binary pattern operator for texture classification.
IEEE Trans Image Process. 2010, 19, 16571663. [PubMed]
19. Perronnin, F.; Snchez, J.; Mensink, T. Improving the Fisher Kernel for Large-Scale Image Classification. Computer
VisionECCV 2010 Lecture Notes in Computer Science; Springer-Verlag: Berlin, Germany, 2010; pp. 143156.
20. Huang, G.-B.; Zhou, H.; Ding, X.; Zhang, R. Extreme Learning Machine for Regression and Multiclass
Classification. IEEE Trans. Syst. Man Cybern. 2012, 42, 513529. [CrossRef] [PubMed]
21. Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification
with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971987. [CrossRef]
22. Li, W.; Chen, C.; Su, H.; Du, Q. Local binary patterns and extreme learning machine for hyperspectral
imagery classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 36813693. [CrossRef]
23. Krapac, J.; Verbeek, J.; Jurie, F. Modeling spatial layout with fisher vectors for image categorization.
In Proceedings of International Conference on Computer Vision, Barcelona, Spain, 613 November 2011;
pp. 14871494.
24. Snchez, J.; Perronnin, F.; Mensink, T.; Verbeek, J. Image classification with the fisher vector: Theory and
practice. Int. J. Comput. Vis. 2013, 105, 222245. [CrossRef]
25. Liu, C. Maximum likelihood estimation from incomplete data via EM-type Algorithms. In Advanced Medical
Statistics; World Scientific Publishing Co.: Hackensack, NJ, USA, 2003; pp. 10511071.
26. Jaakkola, T.S.; Haussler, D. Exploiting generative models in discriminative classifiers. Adv. Neural Inf.
Process. Syst. 1999, 11, 487493.
27. Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 1722 June 2007;
pp. 18.
28. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings
of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose,
CA, USA, 35 November 2010; pp. 270279.
29. Dai, D.; Yang, W. Satellite image classification via two-layer sparse coding with biased image representation.
IEEE Geosci. Remote Sens. Lett. 2011, 8, 173176. [CrossRef]
30. Sheng, G.; Yang, W.; Xu, T.; Sun, H. High-resolution satellite scene classification using a sparse coding based
multiple feature combination. Int. J. Remote Sens. 2012, 33, 23952412. [CrossRef]
31. Ren, J.; Zabalza, J.; Marshall, S.; Zheng, J. Effective feature extraction and data reduction in remote sensing
using hyperspectral imaging [applications corner]. IEEE Sign. Process. Mag. 2014, 31, 149154. [CrossRef]
Remote Sens. 2016, 8, 483 17 of 17

32. Chen, C.; Li, W.; Tramel, E.W.; Fowler, J.E. Reconstruction of hyperspectral imagery from random projections
using multi hypothesis prediction. IEEE Trans. Geosci. Remote Sens. 2014, 52, 365374. [CrossRef]
33. Cheriyadat, A.M. Unsupervised feature learning for aerial scene classification. IEEE Trans. Geosci.
Remote Sens. 2014, 52, 439451. [CrossRef]
34. Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans.
Geosci. Remote Sens. 2015, 53, 21752184. [CrossRef]
35. Shao, W.; Yang, W.; Xia, G.-S.; Liu, G. A hierarchical scheme of multiple feature fusion for high-resolution
satellite scene categorization. In Lecture Notes in Computer Science Computer Vision Systems; Springer: Berlin,
Germany; Heidelberg, Germany, 2013; pp. 324333.
36. Chen, S.; Tian, Y. Pyramid of spatial relatons for scene-level land use classification. IEEE Trans. Geosci.
Remote Sens. 2015, 53, 19471957. [CrossRef]
37. Cvetkovic, S.; Stojanovic, M.B.; Nikolic, S.V. Multi-channel descriptors and ensemble of extreme learning
machines for classification of remote sensing images. Sign. Process. 2015, 39, 111120. [CrossRef]
38. Keiller, N.; Waner, O.; Jefersson, A.; Dos, S. Improving spatial feature representation from aerial scenes
by using convolutional networks. In Proceedings of the SIBGRAPI Conference on Graphics, Patterns and
Images, Salvador, 2629 August 2015; pp. 4451.
39. Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network
framework. IEEE Trans. Geosci. Remote Sens. 2016, 54, 17931802. [CrossRef]
40. Keiller, N.; Otavio, P.; Jefersson, S. Towards better exploiting convolutional neural networks for remote
sensing scene classification. 2016, ArXiv E-Prints, arXiv:1602.01517, http://arxiv.org/abs/1602.01517.

2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like