Professional Documents
Culture Documents
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
School of Computer and Information, Hefei University of Technology, Hefei 230009, China
Department of EEIS, University of Science and Technology of China, Hefei 230026, China
College of Computer Science, Zhejiang University, Hangzhou 310027, China
d
School of Computing, National University of Singapore, 117417 Singapore, Singapore
b
c
a r t i c l e i n f o
Keywords:
Video advertising
Visual relevance
Product
abstract
We have witnessed the booming of contextual video advertising in recent years. However, those
advertisement systems solely take the metadata into account, such as titles, descriptions and tags.
This kind of text-based contextual advertising reveals a number of shortcomings in ads insertion and
ads association. In this paper, we present a novel video advertising system called VideoAder.
The system leverages the well organized media information from the video corpus for embedding
visual content relevant ads into a set of precisely located insertion position. Given a product, we utilize
content-based object retrieval technique to identify the relevant ads and their potential embedding
positions in the video stream. Then we formulate the ads association as an optimization problem to
maximize the total revenue for the system. Specically, the Single-Merge and Merge methods are
proposed to tackle the complex query in visual representation. Typical Feature Intensity (TFI) is used to
train a classier to automatically decide which method is more representive. Experimental results
demonstrated the accuracy and feasibility of the system.
& 2013 Elsevier B.V. All rights reserved.
1. Introduction
The explosively growing online multimedia data has brought
new challenges to online video advertising. For traditional video
advertising, the association between video and ads are decided by
keywords matching, such as Google AdWord and AdSense [1].
The relevance is calculated on the basis of video metadata such as
title, description and tags. However, the traditional text-based
contextual advertising reveals a number of disadvantages. First,
conventional video advertising systems usually determine the ads
insertion points by metadata analysis [2] or video structure [15],
without considering the visual coherence between ads and the
insertion point of video. This will make the ads highly intrusive to
the audience. Second, user-tagged text is generally incomplete
and inaccurate, or to say, the text quality will be variable in
diversity and accuracy by the subjectivity of text creator, which
might decrease the revenue of the advertising system.
There exist a wide variety of advertising schemes for contextual based video advertisement. Typical scheme for contextual
relevance are the content relevance system based on the videos
webpage (e.g., Google AdSense [16]). YouTube [19] and Hulu [21]
select relevant ads by mining contextual metadata and insert ads
Corresponding author.
E-mail address: hongrc.hfut@gmail.com (R. Hong).
0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2012.04.040
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i
Fig. 2. Examples of product suitable for (a) Single-Merge (Nano 6, Amazon Kindle, IPhone 3GS) approaches and (b) merge approaches (camera).
2. System framework
The overall system framework is depicted in Fig. 1.
2.1. Preprocessing
All videos of a video community website are decomposed into
series of keyframes at every 5 s. Intuitively, one keyframe per 5 s
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i
TFIp
9Sp9
X 9Mp9
X
1
E Si p,
29Sp9 i 1 j 1
Mj p
2.2.2. Merge
We observe that for some products, such as computer and
camera, seen in Fig. 2(b), their feature points are homogeneously
covered on all sides of the 3D object surfaces and all sides have
similar frequency of occurrences in videos. So the rst query
representation approach can hardly integrally represent the query,
2.3. Searching
Table 1
The TFI value for 10 query products.
Product
TFI
1. Amazon Kindle
2. iPhone 3GS
3. iPod Nano6
4. BlackBerry 9700
5. Cisco 7600 Phone
6. MacBook Pro
7. Nikon P700
8. NintendoWii
9. ThinkPad
10. Xbox36
0.593
0.669
0.516
0.368
0.284
0.157
0.223
0.122
0.161
0.101
60
50
48
45
40
30
20
28
1
S
2
S
50 49
5
M
6
M
7
M
8
M
9
M
10
M
Merge
35
25
19
4
S
Single Merge
43
30
3
S
23
14
17
20
10
15
9
10
15
2
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i
9Wd9
X
TFIDF W i d
i1
2.3.2. Filtering
We deliberate about lters to make the search result more
satisfactory. We lter the following keyframes after ranking:
(a) too many or too few feature points detected in this keyframe;
(b) ranking ahead but with few matching points;
(c) in the case of one-to-many matching, lter the redundant
matching points.
2.3.3. Re-ranking
We apply the computational expensive, geometric verication
to the small retrieved image set after searching and ltering the
top 100 keyframes. The spatial consistency from the k (k10)
spatially nearest neighbors is used to lter the visual words.
2.4. Advertisement association optimization
In AdWord [1], each advertiser places bids on a number of
keywords and species a maximum daily budget. The objective is to
maximize the total revenues while respecting ad relevance.
In our framework, let A denote the keywords (products) bid by
a
advertisers containing Na keywords, represented by A fai gN
i 1.
N
Let B fbj gj b 1 denote the all possible insertion points in the video
c
corpus, containing Nb insertion points. Let C fck gN
denote the
k1
candidate ad images for insertion. Online ads insertion can be
described as selection of M keywords, N insertion points and N ad
images from A, B and C. M and N can be given by the publisher of
VideoAder. The objective of ads association is to maximize the
total revenues while respecting the ad relevance. The ad relevance can be measured by two factor, one is the contextual
relevance between keyword and insertion point, the other is local
similarity of ad image and the keyframe in insetion point. Thus,
we introduce the following three items for optimization. Let Rb(ai)
denote the daily budgets of keywords ai, and Rr(ai,bj) denote the
contextual relevance between keyword ai and insertion point bj.
Let Rs(bj,ck) denote the local similarity of ad image ck and the
keyframe in insertion point bj. Local similarity requires that we
should be prior to select the products whose product images are
most relevant to the video scenes (e.g., product images which
have the same viewpoint with the product in the video will have
higher priority than that with different viewpoints).
The ads association can now be formulated as an optimization problem, which simultaneously maximized the three items
[2224]. Here we introduce the following design variables
X x1 ,x2 ,. . .,xNa , xiA{0,1}, and Y y1 ,y2 ,. . .,yNb , yjA{0,1}, and
Z z1 ,z2 ,. . .,zNc , zk A f0,1g, where xi,yj,zk indicate whether keyword ai, insertion point bj and ads image ck are selected
xi 1, yj 1, zk 1 or not xi 0, yj 0, zk 0. The optimization problem can be expressed as following nonlinear 0-1 programming problem, [20]
Na
X
i1
xi Rb ai
Nb
Na X
X
Nb X
Nc
X
xi yj Rr ai ,bj ws
yj zk Rs bj ,ck
i1j1
s:t:
Na
X
xi M,
i1
maxx,y,z f x,y,z wb
wr
Nb
X
yj N,
j1
Nc
X
j1k1
zk N, xi , yj , zk A f0,1g
k1
The parameters (wb, wr, ws) control the emphasis on daily budget
and ad relevance, and satisfy the constrains: 0rwb, wr, ws r1 and
wb wr ws 1. The parameters can be set according to the importance of the optimization item.
By examining Eq. (3), we can observe that there are
N
N
CM
Na C Nb C Nc MNN!solutions in total. The searching space will be
tremendous when dataset is large. For practical usage, we
introduce a heuristic searching algorithm to solve the optimization problem [21], As described in Algorithm 1, the number of
solutions can be signicantly decreased to O(Na Nb Nc
M N0 N0 ).
Algorithm 1. The heuristic searching algorithm for Eq. (3)
1. Initialize: set the labels of all the elements in X, Y and Z as
0.
2. Rank all the elements in X according to wbRb(ai) in a
descendent order, select the top M elements, and set the
labels as 1.
3. Rank all the elements in Y according to
wbRb(ai)wrRr(ai,bj) in a descendent order, select the top N0
(N o N0 {N b ) elements.
4. For each yj in the top N0 elements in Y, select the zk with
the max{wrRr(ai,bj)Rs(bj,ck)}.
5. For each xi in the top M elements in X, select the
unselected yj and zk with the
max{wbRb(ai)wrRr(ai,bj) Rs(bj,ck)}, set the label of yj and zk
as 1.
6. Output all the triples with xi 1, yj 1, zk 1.
3. Experiment
To ensure our algorithm and methods works in practice, we
conduct experiments on a web video set consisting of 307 videos
collected from YouTube. In order to ensure that the results would
make a signicant impact in practice, we concentrate on the 10
popular and distinctive products queries for searching and annotation. Typical queries include iPod, iPhone, Camera, etc.
All the ads are collected from Amazon. For each product query, we
extract most typical or representative product image from Amazon for Single-Merge query representation. To obtain multiple
views images of 3D apparent products, we search Google Images
to form the merged product query representation.
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i
Fig. 4. Search result for three products: (a) search result of Amazon Kindle, (b) search result of iPod Nano 6th and (c) search result of Nikon P7000.
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i
Fig. 5. The demo system of VideoAder: (a) main interface and (b) video displaying and advertising interface.
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i
Acknowledgment
This work was supported by a grant from the National Natural
Science Foundation of China, No. 61172164.
References
[1] A. Mehta, A. Saberi, U. Vazirani, V. Vazirani. AdWards and generalized on-line
matching, in: Proceedings of the 46th Annual IEEE Sysposium on Foundations
of Computer Science, Pittsburgh, USA, October 2005.
[2] T. Mei, X.-S. Hua, L. Yang, S. Li, VideoSence-Towards Effective Online Video
Advertising, ACM Multimedia, Augsburg, Germany, 2007.
[3] B.-J. Yi, J.-T. Lee, H.-W. Woo, H.-C. Rim. Contextual Video Advertising system
using scene information inferred from video scripts, in: Proceeding of the
33rd International ACM SIGIR Conference on Research and Development in
Information Retrieval, Geneva, Switzerland. July 2010.
[4] T. Linderberg, Feature detection with automatic scale election, Int. J. Comput.
Vision 30 (2) (1998) 79116.
[5] D.G. Lowe. Object recognition from local scale-invariant features, in:
Proceedings of the International Conference on Computer Vision, Vancouver,
Canada, 1999.
[6] M. Wang, X.-S. Hua, J. Tang, R. Hong, Beyond distance measurement:
constructing neighborhood similarity for video annotation, IEEE Trans. Multimedia 11 (3) (2009) 465476.
[7] M. Wang, X.-S. Hua, Active learning in multimedia annotation and retrieval: a
survey, ACM Trans. Intell. Syst. 2 (2) (2011) 10.
[8] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.J. Qi, Y. Song, Unied video
annotation via multi graph learning, IEEE Trans. Circuit Syst. Video Technol.
19 (5) (2009) 733746.
[9] R. Hong, M. Wang, M. Xu, S. Yan, T.S. Chua, Dynamic Captioning: Video
Accessibility Enhancement for Hearing Impairment, ACM Multimedia, Beijing, China, 2010.
[10] Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, T. Chua, X.-S. Hua, Visual Query
Suggestion: Towards Capturing User Intent in Internet Image Search, ACM
TOMCCAP 6 (3) (2010).
[11] R. Hong, J. Tang, Z.-J. Zha, Z. Luo, T.-S. Chua, Mediapedia: mining web
knowledge to construct multimedia encyclopedia, Lect. Notes Comput. Sci.
5916 (2010) 556566.
[12] M. Wang, K. Yang, X.-S. Hua, H.-J. Zhang, Towards a diverse relevant search
of social images, IEEE Trans. Multimedia 12 (8) (2010) 829842.
[13] M. Wang, Y. Sheng, B. Liu, X.-S. Hua, In-image accessibility indication, IEEE
Trans. Multimedia 12 (4) (2010) 330336.
[14] M. Wang, X.-S. Hua, T. Mei, R. Hong, G.-J. Qi, Y. Song, L.R. Dai, Semisupervised kernel density estimation for video annotation, Comput. Vis.
Image Understand. 113 (3) (2009) 384396.
[15] T. Mei, J. Guo, X.S. Hua, F. Liu, Adon: toward Contextual Overlay In-Video
Advertising, Multimedia Syst. 16 (4) (2010) 335344.
[16] AdSense. Avaliable at: /http://adsense.google.comS.
[17] R. Hong, J. Tang, H.K. Tan, S. Yan, C.W. Ngo, T.S. Chua, Beyond search:
event driven summarization for web videos, ACM Trans. Multimedia Comput.
Commun. Appl. 7 (4) (2011) 35.
[18] R. Hong, M. Wang, G. Li, Z.J. Zha, T.S. Chua, Multimedia Question Answering. IEEE Multimedia. 19(4) (2012) 7278.
[19] YouTube. Avaliable at: /http://www.youtube.comS.
[20] Y. Gao, J. Tang, R. Hong, S. Yan, Q. Dai, N. Zhang, T.S. Chua, Camera
constraint-free view-based 3-D object retrieval, IEEE Trans. Image Process.
21 (4) (2012) 22692281.
[21] Hulu. Avaliable at: /http://www.hulu.comS.
[22] S.H. Srinivasan, N. Sawant, S. Wadhwa, vADeo: Video Advertising System,
ACM Multimedia, Augsburg, Germany, 2007.
[23] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press,
UK, 2004.
Please cite this article as: R. Hong, et al., Advertising object in web videos, Neurocomputing (2013), http://dx.doi.org/10.1016/
j.neucom.2012.04.040i