Professional Documents
Culture Documents
Xin Lu1 Zhe Lin2 Hailin Jin2 Jianchao Yang2 James Z. Wang1
1
The Pennsylvania State University
2
Adobe Research
{xinlu, jwang}@psu.edu, {zlin, hljin, jiayang}@adobe.com
ABSTRACT 1. INTRODUCTION
Effective visual features are essential for computational aes- Automated assessment or rating of pictorial aesthetics has
thetic quality rating systems. Existing methods used ma- many applications. In an image retrieval system, the rank-
chine learning and statistical modeling techniques on hand- ing algorithm can incorporate aesthetic quality as one of the
crafted features or generic image descriptors. A recently- factors. In picture editing software, aesthetics can be used in
published large-scale dataset, the AVA dataset, has further producing appealing polished photographs. Datta et al. [6]
empowered machine learning based approaches. We present and Ke et al. [13] formulated the problem as a classifica-
the RAPID (RAting PIctorial aesthetics using Deep learn- tion or regression problem where a given image is mapped
ing) system, which adopts a novel deep neural network ap- to an aesthetic rating, which is normally quantized with dis-
proach to enable automatic feature learning. The central crete values. Under this framework, the effectiveness of the
idea is to incorporate heterogeneous inputs generated from image representation, or the extracted features, can often
the image, which include a global view and a local view, and be the accuracy bottleneck. Various handcrafted aesthetics-
to unify the feature learning and classifier training using a relevant features have been proposed [6, 13, 21, 3, 20, 7, 26,
double-column deep convolutional neural network. In addi- 27], including low-level image statistics such as distributions
tion, we utilize the style attributes of images to help improve of edges and color histograms, and high-level photographic
the aesthetic quality categorization accuracy. Experimental rules such as the rule of thirds.
results show that our approach significantly outperforms the While these handcrafted aesthetics features are often in-
state of the art on the AVA dataset. spired from the photography or psychology literature, they
share some known limitations. First, the aesthetics-sensitive
attributes are manually designed, hence have limited scope.
Categories and Subject Descriptors It is possible that some effective attributes have not yet
I.4.7 [Image Processing and Computer Vision]: Fea- been discovered through this process. Second, because of
ture measurement; I.4.10 [Image Processing and Com- the vagueness of certain photographic or psychologic rules
puter Vision]: Image Representation; I.5 [Pattern Recog- and the difficulty in implementing them computationally,
nition]: Classifier design and evaluation these handcrafted features are often merely approximations
of such rules. There is often a lack of principled approach
to improve the effectiveness of such features.
General Terms Generic image features [23, 24, 22] are proposed to address
Algorithms, Experimentation the limitations of the handcrafted aesthetics features. They
used well-designed common image features such as SIFT and
Fisher Vector [18, 23], which have been successfully used for
Keywords object classification tasks. The generic image features have
Deep Learning; Image Aesthetics; Multi-Column Deep Neu- been shown to outperform the handcrafted aesthetics fea-
ral Networks tures [23]. However, because these features are meant to be
generic, they may be unable to attain the upper performance
∗The research has been primarily supported by Penn State’s limits in aesthetics-related problems.
College of Information Sciences and Technology and Adobe In this work, we intend to explore beyond generic image
Research. The authors would like to thank the anonymous features by learning effective aesthetics features from im-
reviewers. ages directly. We are motivated by the recent work in large
scale image classification using deep convolutional neural
Permission to make digital or hard copies of all or part of this work for personal or networks [15] where the features are automatically learned
classroom use is granted without fee provided that copies are not made or distributed from RGB images. The deep convolutional neural network
for profit or commercial advantage and that copies bear this notice and the full cita- takes pixels as inputs and learns a suitable representation
tion on the first page. Copyrights for components of this work owned by others than through multiple convolutional and fully connected layers.
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
However, the originally proposed architecture cannot be di-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org. rectly applied to our task. Image aesthetics relies on a com-
MM’14, November 03-07, 2014, Orlando, FL, USA. bination of local and global visual cues. For example, the
Copyright 2014 ACM 978-1-4503-3063-3/14/11 ...$15.00. rule of thirds is a global image cue while sharpness and noise
http://dx.doi.org/10.1145/2647868.2654927.
Center.crop( Warp( Padding(
Original(Image(
levels are local visual characteristics. Given an image, we learning architectures among the various types of neural net-
generate two heterogeneous inputs to represent its global works (e.g., Deep Belief Net [10] and Restricted Boltzmann
cues and local cues respectively. Figure 1 illustrates global Machine [9]). Krizhevsky et al. [15] significantly advanced
vs. local views. To support network training on heteroge- the 1000-class classification task in ImageNet challenge with
neous inputs, we extend the method in [15] by developing a a deep architecture of CNN in conjunction with dropout
double-column neural network structure which takes paral- and normalization techniques, Sermanet et al. [30] achieved
lel inputs from the two columns. One column takes a global the-state-of-the-art performance on all major pedestrian de-
view of the image and the other column takes a local view of tection datasets, and Ciresan et al. [4] reached a near-human
the image. We integrate the two columns after some layers performance on the MNIST1 dataset.
of transformations to form the final classifier. We further im- The effectiveness of CNN features has also been demon-
prove the aesthetic quality categorization by exploring style strated in image style classification [12]. Without training
attributes associated with images. We named our system deep neural network, Karayev et al. extracted existing Decaf
RAPID, which stands for RAting PIctorial aesthetics using features [8] and used those features as input for style clas-
Deep learning. We used a recently-released large dataset to sification. There are key differences between that work [12]
show the advantages of our approach. and ours. First, they mainly targeted style classification
whereas we focus on aesthetic categorization, which is a dif-
ferent problem. Second, they used existing features as input
1.1 Related Work to classification and did not train specific neural networks
Earlier visual aesthetics assessment research focused on for style or aesthetics categorization. In contrast, we train
examining handcrafted visual features based on common deep neural networks directly from RGB inputs, which are
cues such as color [6, 26, 27], texture [6, 13], composition [21, optimized for the given task. Third, they relied on features
20, 7], and content [20, 7], as well as generic image descrip- from global views, while we leverage heterogeneous input
tors [23, 31, 24]. Commonly investigated color features in- sources, i.e., global and local views, and propose double-
clude lightness, colorfulness, color harmony, and color dis- column neural networks to learn features jointly from both
tribution [6, 26, 27]. Texture descriptors vary from wavelet- sources. Finally, we propose a regularized neural network
based texture features [6], distribution of edges, to blur de- based on related attributes to further boost aesthetics cate-
scriptors and shallow depth-of-field descriptors [13]. Com- gorization.
position features typically include the rule of thirds, size and As designing handcrafted features has been widely con-
aspect ratio [20], and foreground and background composi- sidered an appropriate approach in assessing image aesthet-
tion [21, 20, 7]. There have been attempts to represent the ics, insufficient effort has been devoted to automatic feature
content of images using people and portrait descriptors [20, learning on a large collection of labeled ground-truth data.
7], scene descriptors [7], and generic image features such as The recently-developed AVA dataset [24] contains 250, 000
SIFT [18], GIST [28], and Fisher Vector [23, 24, 22]. images with aesthetic ratings and a 14, 000 subset with style
Despite the success of handcrafted and generic visual fea- labels (e.g., rule of thirds, motion blur, and complementary
tures, the usefulness of automatically learned features have colors), making automatic feature learning using deep learn-
been demonstrated in many vision applications [15, 4, 32, ing approaches possible.
30]. Recently, trained deep neural networks are used to build
and associate mid-level features with class labels. Convolu-
1
tional neural network (CNN) [16] is one of the most powerful http://yann.lecun.com/exdb/mnist/
• We developed a regularized double-column deep con-
volutional neural network to further improve aesthetic
256" 11"
categorization using style attributes.
5"
224" 224" 11"
5"
27"
3"
3" 27"
2. THE ALGORITHM
224"
55"
3"
2"
Patterns in aesthetically-pleasing photographs often indi-
256" 224" 55" 3" 256" cate photographers’ visual preferences. Among those pat-
3" Stride"" 27" 27"
of"2"
3" 64" 64" 64" 1000" terns, composition [17] and visual balance [25] are impor-
3"
tant factors [2]. They are reflected in the global view (e.g.,
Figure 2: Single-column convolutional neural net- top row in Figure 1) and the local view (e.g., bottom row in
work for aesthetic quality rating and categorization. the Figure). Popular composition principles include the rule
We have four convolutional layers and two fully- of thirds, diagonal lines, and golden ratio [11], while visual
connected layers. The first and second convolu- balance is affected by position, form, size, tone, color, bright-
tional layers are followed by max-pooling layers and ness, contrast, and proximity to the fulcrum [25]. Some of
normalization layers. The input patch of the size these patterns are not well-defined or even abstract, mak-
224 × 224 × 3 is randomly cropped from the normal- ing it difficult to calculate those features for assessing image
ized input of the size 256 × 256 × 3 as done in [15]. aesthetic quality. Motivated by this, we aim to leverage the
power of CNN to automatically identify useful patterns and
In this work, we train deep neural networks on the AVA employ learned visual features to rate or to categorize the
dataset to categorize image aesthetic quality. Specifically, aesthetic quality of images.
we propose a double-column CNN architecture to automati- However, applying CNN to the aesthetic quality catego-
cally discover effective features that capture image aesthetics rization task is not straightforward. The different aspect
from two heterogeneous input sources. The proposed ar- ratios and resolutions in photographs and the importance of
chitecture is different from the recent work in multi-column image details in aesthetics make it difficult to directly train
neural networks [4, 1]. Agostinelli et al. [1] extended stacked CNN where inputs are typically normalized to the same size
sparse autoencoder to a multi-column version by comput- and aspect ratio. A challenging question, therefore, is to
ing the optimal column weights and applied the model to perform automatic feature learning with regard to both the
image denoising. Ciresan et al. [4] averaged the output of global and the local views of the input images. To address
several columns trained on inputs with different standard this challenge, we take several different representations of an
preprocessing methods. Our architecture is different from image, i.e., the global and the local views of the image, which
that work because the two columns in our architecture are can be encoded by jointly considering those heterogeneous
jointly trained using two different inputs: The first column of representations. We first use each of the representations to
the network takes global image representation as the input, train a single-column CNN (SCNN) to assess image aesthet-
while the second column takes local image representations ics. We further developed a double-column CNN (DCNN)
as the input. This allows us to leverage both compositional to allow our model to use the heterogeneous inputs from one
and local visual information. image, aiming at identifying visual features in terms of both
The problem of assessing image aesthetics is also relevant global and local views. Finally, we investigate how the style
to recent work of image popularity estimation [14]. Aes- of images can be leveraged to boost aesthetic classification
thetic value is connected with the notion of popularity, while accuracy [29]. We present an aesthetic quality categoriza-
there is a fundamental difference between the two concepts. tion approach with style attributes by learning a regularized
Aesthetics concerns primarily with the nature and appre- double-column network (RDCNN), a three-column network.
ciation of beauty, while in the measurement of popularity
both aesthetics and how interesting the visual stimulus is 2.1 Single-column Convolutional Neural
to the viewer population are important. For instance, a Network
photograph of some thought-provoking subject may not be Deep convolutional neural network [15] takes inputs of
considered of high aesthetic value, but can be appreciated by fixed aspect ratio and size. However, an input image can be
many people based on the subject alone. On the other hand, of arbitrary size and aspect ratio. To normalize image sizes,
a beautiful picture of flowers may not be able to reach the we propose three different transformations: center-crop (gc ),
state of popularity if the viewers don’t consider the subject warp (gw ), and padding (gp ), which reflect the global view
of sufficient interestingness. (Ig ) of an image I. gc isotropically resizes original images by
normalizing their shorter sides to a fixed length s. Center-
1.2 Contributions crop normalizes the input to generate a s × s × 3 input. gc
Our main contributions are as follows. was adopted in a recent image classification work [15]. gw
anisotropically resizes (or warps) the original image into a
• We conducted systematic evaluation of the single-column normalized input with a fixed size s × s × 3. gp resizes the
deep convolutional neural network approach with dif- original image by normalizing the longer side of the image
ferent types of input modalities for aesthetic quality to a fixed length s and padding border pixels with zeros to
categorization; generate a normalized input of a fixed size s × s × 3. For
each image I and each type of transformation, we generate
an s × s × 3 input Igj with the transformation gj , where
• We developed a double-column deep convolutional neu- j ∈ {c, w, p}. As resizing inputs can cause harmful infor-
ral network architecture to jointly learn features from mation loss (i.e., the high-resolution local views) for aes-
heterogeneous inputs; thetic assessment, we also use randomly sampled fixed size
Global"View"
Fine:grained"View" 11" 5"
11"
224" 5" Column'1'
3"
27" 3" 27"
55" 3"
224" 55" 3" 256"
Stride"" 27" 27"
3" of"2" 64" 64" 64" 1000"
0" 1"
0" 1"
2"
11" 5" 256"
11"
224" 5"
3"
27" 3" 27"
55" 3" Column'2'
224" 55" 3"
Stride"" 27" 27"
3" of"2" 64" 64" 64" 1000"
Figure 3: Double-column convolutional neural network. Each training image is represented by its global and
local views, and is associated with its aesthetic quality label: 0 refers to a low quality image and 1 refers to a
high quality image. Networks in different columns are independent in convolutional layers and the first two
fully-connected layers. The final fully-connected layer are jointly trained.
(at s × s × 3) crops with the transformation lr . Here we The probability p(yi = c | xi , wc ) is expressed as
use g to denote global transformations and l to denote local
transformations. This results in normalized inputs {Ilr } (r exp (wcT xi )
p(yi = c | xi , wc ) = P T
. (2)
is an index of normalized inputs for each random cropping), c0 ∈C exp (wc0 xi )
which preserve the local views of an image with details from
the original high-resolution image. We used these normal- The aesthetic quality categorization task can be defined
ized inputs It ∈ {Igc , Igw , Igp , Ilr } for CNN training. In this as a binary classification problem where each input patch is
work, we set s to 256, thus the size of It is 256 × 256 × 3. To associated with an aesthetic label c ∈ C = {0, 1}. In Section
alleviate overfitting in network training, for each normalized 2.3, we explain a SCNN for image style categorization, which
input It , we extracted a random 224 × 224 × 3 patch Ip or its can be considered a multi-class classification task.
horizontal reflection to be the input patch to our network. As indicated by the previous study [15], the architecture
We present an example for the four transformations, gw , of the deep neural network may critically affect the perfor-
gc , gp , and lr , in Figure 1. As shown, the global view of mance. Our experiments suggest that the general guideline
an image is maintained via the transformations of gc , gw , for training a good-performing network is to first allow suffi-
and gp . Among the three global views, Igw and Igp maintain cient learning power of the network by using sufficient num-
the relative spatial layout among elements in the original ber of neurons. Meanwhile, we adjust the number of con-
image. Igw and Igp follow rule of thirds whereas the Igc does volutional layers and the fully-connected layers to support
not. In the bottom row of the figure, the local views of an the feature learning and classifier training. In particular, we
original image are represented by randomly-cropped patches extensively evaluate the network trained with different num-
{Ilr }. These patches depict the local details in the original bers of convolutional layers and fully-connected layers, and
resolution of the image. with or without normalization layers. Candidate architec-
The architecture of the SCNN used for aesthetic quality tures are shown in Table 1. To determine the optimal archi-
assessment is shown in Figure 2. It has a total of four convo- tecture for our task, we conduct experiments on candidate
lutional layers. The first and the second convolutional layers architectures and pick the one with the highest performance,
are followed by max-pooling layers and normalization layers. as shown in Figure 2.
The first convolutional layer filters the 224 × 224 × 3 patch With the selected architecture, we train SCNN with four
with 64 kernels of the size 11×11×3 with a stride of 2 pixels. different types of inputs (Igc , Igw , Igp , Ilr ) using the AVA
The second convolutional layer filters the output of the first dataset [24]. During training, we handle the overfitting
convolutional layer with 64 kernels of the size 5 × 5 × 64. problem by adopting dropout and shuffling the training data
Each of the third and forth convolutional layers has 64 ker- in each epoch. Specifically, we found that lr serves as an
nels of the size 3 × 3 × 64, and the two fully-connected layers effective data augmentation approach which alleviates over-
have 1000 and 256 neurons respectively. fitting. Because Ilr is generated by random cropping, an
Suppose for the input patch Ip of the i-th image, we have image contributes to the network training with different in-
the feature representation xi extracted from layer fc256 (the puts when a different patch is used.
outcome of the convolutional layers and the fc1000 layers), We experimentally evaluate the performance of these in-
and the label yi ∈ C. The training of the last layer is done puts with SCNN. Results will be presented in Section 3. Igw
by maximizing the following log likelihood function: performs the best among the three global input variations
(Igc , Igw , Igp ). Ilr yields an even better results compared with
N X
X Igw . Hence, we use Ilr and Igw as the two inputs to train
l(W) = I(yi = c) log p(yi = c | xi , wc ) , (1)
the proposed double-column network. In our experiments,
i=1 c∈C
we fix the dropout rate as 0.5 and initiate the learning rate
where N is the number of images, W = {wc }c∈C is the set of with 0.001. Given a test image, we compute its normal-
model parameters, and I(x) = 1 iff x is true and vice versa. ized input and followed by generating the input patch, with
which we calculate the probability of the input patch being Style3SCNN&
assigned to each aesthetic category. We repeat this process (pre3trained)&
for 50 times, average those results, and pick the class with
the highest probability. Style&Column& xs&
0"
xa& 1"
2.2 Double-column Convolutional Neural Aesthe/c&Column&
Network
For each image, its global or local information may be DCNN&
lost when transformed to a normalized input using gc , gw ,
gp , or lr . Representing an image through multiple inputs Figure 4: Regularized double-column convolutional
can somewhat alleviate the problem. As a first attempt, we neural network (RDCNN). The style attributes xs
generate one input to depict the global view of an image and are generated through pre-trained Style-SCNN and
another to represent its local view. we leveraged the style attributes to regularize the
We propose a novel double-column convolutional neural training process of RDCNN. The dashed line in-
network (DCNN) to support automatic feature learning with dicates that the parameters of the style column is
heterogeneous inputs, i.e., a global-view input and a local- fixed during RDCNN training. While training the
view input. We present the architecture of the DCNN in Fig- RDCNN, we only fine-tuned the parameters in the
ure 3. As shown in the figure, networks in different columns aesthetic column and the learning process is super-
are independent in convolutional layers and the first two vised by the aesthetic label.
fully-connected layers. The inputs of the two columns are
where X is the features of all training images, CA is the
Igw and Ilr . We take the two 256 × 1 vectors from each
label set for aesthetic quality, CS is the label set for style,
of the fc256 layer and jointly train the weights of the final
and Wa = {wac }c∈CA and Ws = {wsc }c∈CS are the model
fully-connected layer. We avoid the interaction between two
parameters. It is more difficult to obtain images with style
columns in convolutional layers because they are in different
attributes. In the AVA benchmark, among 230, 000 image
spatial scales. During training, the error is back propagated
with aesthetic labels only 14, 000 of them have style labels.
to the networks in each column respectively with stochastic
As a result, we cannot jointly perform aesthetics categoriza-
gradient descent. With the proposed architecture, we can
tion and style classification with a single neural network due
also automatically discover both the global and the local
to many missing labels.
features of an image from the fc1000 layers and fc256 layers.
Alternatively, we can use ideas from inductive transfer
The proposed network architecture could easily be ex-
learning [29], where we target minimizing the classification
panded to multi-column convolutional networks by incor-
error with one label, whereas we construct feature represen-
porating more types of normalized inputs. DCNN allows
tations with both labels. As we only have a subset of images
different architectures in individual networks, which may
with style labels, we first train a style classifier with them.
facilitate the parameter learning for networks in different
We then extract style attributes for all training images, and
columns.
applied those attributes to regularize the feature learning
In our work, network architectures are the same for both
and classifier training for aesthetic quality categorization.
columns. Given a test image, we perform a similar procedure
To learn style attributes for 230, 000 training images, we
as we do with SCNN to evaluate the aesthetic quality of an
first train a style classifier by performing the training pro-
image.
cedure discussed in Section 2.1 on 11, 000 labeled training
images (Style-SCNN). We adopted the same architecture as
2.3 Learning and Categorization with Style shown in Figure 2. The only difference is that we reduced
Attributes the number of filters in the the first and fourth convolu-
The discrete aesthetic labels, i.e., high quality and low tional layers to a half due to the reduced number of training
quality, provided weak supervision to make the network con- images. With Style-SCNN, we are maximizing the log likeli-
verge properly due to the large intra-class variation. This hood function in Equation 1 where C is the set of style labels
motivates us to exploit extra labels from the training images in the AVA dataset. We experimentally select the best ar-
to help identify their aesthetic characteristics. We propose chitectures (to be shown in Table 4) and inputs (Igc , Igw , Igp ,
to leverage style attributes, such as complementary colors, Ilr ). The details are described in Section 3. Given an image,
macro, motion blur, rule of thirds, shallow depth-of-field we apply the learned weights and extract the features from
(DOF), to help determine the aesthetic quality of images the fc256 layer as its style attribute.
because they are regarded as highly relevant attributes [24]. To facilitate the network training with style attributes of
There are two natural ways to formulate the problem. The images, we propose a regularized double-column convolu-
first is to leverage the idea of multi-task learning [5], which tional neural network (RDCNN) with the architecture shown
jointly construct feature representation and minimize the in Figure 4. Two normalized inputs of the aesthetic column
classification error for both labels. Assuming we have aes- are Igw and Ilr , same as in DCNN (Section 2.2). The input
thetic quality labels {yai } and style labels {ysi } for all train- of the style column is Ilr . The training of RDCNN is done
ing images, the problem becomes an optimization prblem: by solving the following optimization problem:
N N
X X
X X
max ( I(yai = c) log p(yai | xi , wac )+ max I(yai = c) log p(yai | xai , xsi , wac ) , (4)
X,Wa ,Ws Xa ,Wa
i=1 c∈CA i=1 c=1∈Ca
X (3)
I(ysi = c) log p(ysi | xi , wsc )) , where xsi are the style attributes of the i-th training image,
c∈CS xai are the features to be learned. Note that the maximiza-
Table 1: Accuracy for Different SCNN Architectures
conv1 pool1 rnorm1 conv2 pool2 rnorm2 conv3 conv4 conv5 conv6 fc1K fc256 fc2 Accuracy
(64) (64) (64) (64) (64) (64)
√ √ √ √ √ √ √ √ √ √ √
Arch 1 71.20%
√ √ √ √ √ √ √ √ √
Arch 2 60.25%
√ √ √ √ √ √ √ √ √ √
Arch 3 62.68%
√ √ √ √ √ √ √ √ √ √ √
Arch 4 65.14%
√ √ √ √ √ √ √ √ √ √
Arch 5 70.52%
√ √ √ √ √ √ √ √ √ √ √ √
Arch 6 62.49%
√ √ √ √ √ √ √ √ √ √ √ √ √
Arch 7 70.93%
With the network architecture fixed to Arch 1, we com- as an effective data augmentation approach to capture the
pare the performance of SCNN with different inputs, i.e., Igc , local aesthetic details of images. Igw performs much better
Igw , Igp , Ilr . We train classifiers with both δ = 0 and δ = 1 than Igc and Igp , which is the best among the three inputs
for each input type. The overall accuracy is presented in Ta- for capturing the global view of images.
ble 2. The results show that Ilr yields the highest accuracy Based on the above observation, we choose Arch 1 as the
among four types of inputs, which indicates that lr serves architecture of our model, with Ilr as input. As shown in
High$ High$ High$ Low$ Low$
Figure 10: Test images correctly classified by RDCNN and misclassified by DCNN. The label on each image
indicates the ground truth aesthetic quality of images.