Professional Documents
Culture Documents
Hong Chen1∗,Nan-Ning Zheng1 , Lin Liang2 ,Yan Li2 , Ying-Qing Xu2 , Heung-Yeung Shum2
1
Xi’an Jiaotong University
2
Microsoft Research, Asia
ABSTRACT those well-trained artists who possess this great skill can do it well.
In this paper, we present PicToon, a cartoon system which can gen- Recently, many technologies have been developed to make it possi-
erate a personalized cartoon face from an input Picture. PicToon is ble for a skilled cartoonist to work entirely on the computer. Such
easy to use and requires little user interaction. Our system consists technologies include stroke rendering [10, 16, 17, 6], and tone con-
of three major components: an image-based Cartoon Generator, an trol [16, 17, 6, 21]. By integrating these rendering technologies,
interactive Cartoon Editor for exaggeration, and a speech-driven various animation systems [11, 20, 13, 19] have been developed
Cartoon Animator. First, to capture an artistic style, the cartoon for interactive cartoon design. Although various drawing templates
generation is decoupled into two processes: sketch generation and have been provided by these systems, it is still difficult for an or-
stroke rendering. An example-based approach is taken to automat- dinary user to create a ”personalized cartoon”, or a cartoon resem-
ically generate sketch lines which depict the facial structure. An bling a particular person.
inhomogeneous non-parametric sampling plus a flexible facial tem-
plate is employed to extract the vector-based facial sketch. Various The PicToon system is developed for ordinary people to create a
styles of strokes can then be applied. Second, with the pre-designed personalized cartoon easily. There are two design goals for Pic-
templates in Cartoon Editor, the user can easily make the cartoon Toon:
exaggerated or more expressive. Third, a real-time lip-syncing al-
gorithm is also developed that recovers a statistical audio-visual • Creating personalized cartoons
mapping between the character’s voice and the corresponding lip
configuration. Experimental results demonstrate the effectiveness • Making the system easy to use
of our system.
To create a personalized cartoon, we adopt an image-based ap-
proach. Specifically, we propose an example-based learning ap-
Categories and Subject Descriptors proach to generate a cartoon sketch from an input face image, by
I.3.3 [Computer Graphics]: Picture/ Image Generation; observing how a particular artist would draw cartoons from train-
I.6.3 [Computer Graphics]: Methodology and Techniques; ing face images. Because it is difficult to describe the rules for how
I.4.8 [Image Processing and Computer Vision]: General an artist draws, we use a non-parametric sampling scheme that cap-
tures the relationship between the training images and their corre-
General Terms sponding sketches drawn by the same artist.
Applications
The generated cartoon sketch can then be enhanced by adding stroke
styles with stroke models like in [10]. The cartoon face can then
Keywords be easily exaggerated by interactively applying pre-designed facial
User interfaces, Multi-modal interaction and integration, Example- expression templates. To animate the cartoon face, a real-time lip-
based Learning, Non-parametric Sampling, Lip-syncing. syncing algorithm is applied to automatically generate cartoon an-
imation. Different facial expressions can be combined with lip-
1. INTRODUCTION syncing to make the animation more lively and believable.
People love cartoons. Cartoons are humorous, satirical, and at
times opinionated. Drawing cartoons is, however, not easy. Only The rest of this paper is organized as follows. In the next section we
∗
give an overview of related work. System architecture is presented
This work was done when Hong was a visiting student at Mi- in Section 3. We then introduce the key technologies of our system,
crosoft Research Asia. followed by a set of examples and demos. Finally we sum up the
features and benefits of our system, and outline some future work.
2. RELATED WORK
Many cartoon creation systems [11, 20, 13, 19] focus on low-level
editing and control tools for easy and flexible cartoon drawing and
animation. Inkwell [13], for example, developed some effective
techniques on exploiting layers, grouping/hierarchy of components
and allowing the manipulation of motion functions. Similarly, the
CharToon system [20] provided special skeleton-driven components,
Cartoon
Editor
Exaggerated
Cartoon
Cartoon Exaggerated
Generator Templates
Library
Cartoon
Animator
Interactive Hair
Contour Extraction
Figure 3: Our cartoon generation approach consists of two steps: sketch generation and stroke rendering.
4. CARTOON GENERATOR model is used to describe the statistical relationship between the
4.1 Decoupling Cartoon Generation sketch and facial image. The probability distribution of a sketch
point depends on its neighborhood in the facial image. To describe
According to some artistic drawing books [12, 15, 7], there exist
the inhomogeneity of facial features, the probability distribution is
two key aspects of a face drawing: line sketch and stroke style. As
also related to the pixel’s position.
shown in Figure 2, the sketch suggests the fundamental visual per-
ception characteristics of a face: the simple lines depict the global
To construct such a distribution, an inhomogeneous non-parametric
facial structure and highlight subtle but important facial features
sampling strategy is employed: only the points at the same facial
such as double eye lines. Note that the sketch is drawn in a plain
location as a sketch point from different training images are sam-
style without shape exaggeration. At the other hand, stroke styles,
pled.
such as pencil, ink or brush, are used by the artist to describe his or
her perception in different ways.
Since the input image and training images are usually not aligned
with each other, to determine facial points’ correspondence, we
Thus we decouple cartoon generation naturally into two stages:
warp all images to the M eanShape. A constrained AAM model
sketch generation and stroke rendering, as shown in Figure 3.
[5] is employed to locate the feature points required by the warping
process.
Sketch generation creates a vector-based sketch. The sketch lines
indicate the drawing language of the artist: where and how facial
Once the probability distribution for each pixel is obtained, we in-
lines should be drawn to describe the facial structure. Since there
tegrate the distribution to get the “expected sketch image”, then
is no precise rule of grammar in such a language, we have chosen
template fitting is employed to extract the vector-based sketch.
to take an example-based learning approach.
The hair of different people do not have consistent statistical prop-
Stroke rendering turns the sketch into a cartoon using “artistic
erties, or a clear correspondence, thus the hair is not a part of the
strokes”. Existing stroke rendering techniques can be applied. In
example-based approach, but added in a post-processing step by
this paper, a stroke model similar to [10] is used.
tracing the hair contour. The whole procedure is shown in Figure
3. Detailed algorithms are explained in following sections.
Our decoupling approach has several advantages. For example, by
separating sketch from stroke styles, it is easy to generate a faithful
sketch by an example-based learning scheme. Moreover, it is easy 4.2.2 Interactive face alignment
to create different stylistic cartoons from the same sketch simply To locate the feature points, current alignment algorithms are not
by applying different stroke styles. robust enough to get proper results fully automatically. Introduc-
ing a little user interaction though will improve the location result
greatly. Thus a constrained AAM model [5] is adopted in our sys-
4.2 Sketch Generation tem. The user can modify the alignment result by dragging some
4.2.1 An Example-based Approach facial points to the expected positions. Then these constraints are
To construct a sketch, our example-based approach requires a set added as an additional energy item in the posterior energy. The
of training examples: face images and their corresponding sketches optimal solution is searched for again by a gradient decent algo-
drawn by an artist, as shown in Figure 3. rithm. User input constraints help the algorithm to escape from
local minima successfully. With little user interaction, much better
In this work, an inhomogeneous Markov Random Field (MRF ) and robust alignment results can be obtained.
(a) (b) (c) (d)
Figure 4: Generated Cartoon. (a) Original image; (b) Generated sketch; (c) and (d) Cartoon rendered with two different strokes.
4.2.3 Non-parametric sampling are modelled by a Gaussian distribution. For the on-off switch, we
Learning the above MRF model parameters is very complex. In- have defined three types of lines in our system: always appearing;
spired by a non-parametric sampling method successfully used in probably appearing but independent of other lines; dependent of
texture synthesis [8], we employ an inhomogeneous non-parametric other lines. We learn the probability of each type from the training
sampling scheme. examples. We define the difference between the “expected sketch
image” and the generated sketch as the likelihood energy.
At the point q in sketch S, we want to construct the distribution
pq (S(q)|NI (q)) given its neighborhood in the facial image. First, During the fitting procedure, we directly sample from the prior
the sampling set Ωq is constructed by the training exemplars at model, followed by a local search to determine each line’s param-
the corresponding position. Then k nearest neighbors in Ωq are eters more precisely. A final generated facial sketch is shown in
selected to construct pq (S(q)|NI (q)). The probability αqi of the Figure 4(b).
sketch point q having sketch pixel vqi ’s value is inversely propor-
tional to the neighborhood distance. 4.2.5 Interactive Hair Contour Extraction
1 For hair contour extraction, we apply color segmentation to find the
αqi = exp(−d(NI (q), NI (vqi ))/T ) (1) hair region first, then use edge-tracing to get the hair contour.
Zq
where the square neighborhood distance d(NI (q), NI (vqi )) is cal- To determine the color distributions of the hair region and the back-
culated using cross-correlation, Zq is the normalizing constant, and ground more robustly, two brushes are provided for the users to
temperature T is used to control the smoothness of the distribution. mark each region interactively. According to the marked regions,
Thus we get the non-parametric distribution: the color distributions of the hair and background regions are cal-
culated respectively. Then the pixels with smaller Mahalanobis dis-
k
tance to the hair color’s distribution are connected into one hair re-
pq (S(q)|I) = αqi δ(q − vqi ) (2) gion by the floodfill algorithm. The color distribution is represented
i=1
by a mixture Gaussians. The number of kernels and parameters are
where δ(·) is the Delta function. updated adaptively to the marked samples. With two brushes, the
user can modify the segmentation result interactively and easily.
Inspired by a patch-pasting method successfully used in texture Figure 4(b) shows one hair contour extraction result combined with
synthesis [23], we extend the synthesized unit from a pixel to a the face automatically.
small 3 × 3 square patch o accelerate the sampling procedure.
4.3 Stroke Rendering
4.2.4 Template Fitting Stroke Rendering places stylized strokes along a newly-created sketch.
We can get an “expected sketch image” by integrating the distribu- After consulting with cartoonists, we propose the following at-
tion of each point, then we apply template fitting to extract the final tributes for each stroke to simulate different styles:
vector-based facial sketch. To preserve the artist’s drawing style, • Texture: a reference texture image corresponding to a partic-
template fitting is based on a flexible sketch model. The sketch ular stroke style.
model is represented as a set that contains a fixed number of lines.
Each line is defined by an on-off switch and the position of control • Width: the width of the whole stroke representing its thick-
points. The on-off switch is necessary to capture the subtle differ- ness.
ences of facial features, such as the presence of double eye lines.
• Direction: the drawing direction of a stroke.
To fit the sketch model to the “expected sketch image”, we formu- • Path: the skeleton of a stroke. The generated sketch lines are
late it as the minimization of an energy function, which consists of taken as stroke paths.
two components: prior energy and likelihood energy.
Stroke textures are obviously important. As shown in Figure 4, by
We first build a prior model from the training sketches. Similar to applying different artistic strokes, cartoon appearance can be sig-
the ASM model[4], the coordinates of control points in each line nificantly changed. Like [10], we can use stroke textures to simu-
Figure 6: Cartoon editor user interface.
the resulting short strokes. The hair would appear more natural in
this way.
To stretch a texture map over the length of the stroke, each stroke
(a) path is resampled to even segments and described by a Catmull-
Rom spline. The selected stroke texture is warped along the skele-
ton by a local parametric coordinate transformation and texture
mapping [10].
5. CARTOON EDITOR
Cartoon Editor is a graphical editor for the user to edit the cartoon
face freely, i.e., change its size, shape or redraw some facial parts.
For convenience, various pre-designed exaggerated templates are
supplied, including expressions such as smiling, rage, grimacing,
(b) (c) etc. A tool to generate in-between expression states is also pro-
vided. Figure 6 shows the user interface of our cartoon editor.
Figure 5: Cartoon face rendered with different stroke direc-
tions. (a) Stroke texture; (b) and (c) Rendered cartoon faces. Cartoon Editor groups facial lines into seven parts: face contour,
mouth, nose, right-eye, left-eye, right-eyebrow and left-eyebrow.
The editing is performed on the sketch generated by Cartoon Gen-
erator. To edit each facial part, the following manipulations are
late pen, ink or brush, and also produce various stroke styles, such designed.
as taper, flare, wiggle and so on.
• Modify: The user can drag the control points of the sketch
Drawing styles are also reflected by the line width. For example, to modify its shape or scale, and rotate or move the selected
the face and hair contours are often rendered with thick strokes, facial part.
while the strokes depicting the eyes, mouth and nose could be a • Delete Lines: Delete original lines from the sketch.
little more slender. Other supporting lines such as double-eye lines
are the thinnest. • Add Lines: Add additional lines not defined in the sketch
template.
The direction of a stroke is related to the selected stroke texture
• Apply Pre-designed Template: The user can select pre-designed
and the artist’s habit. Stroke direction is introduced to properly
templates to change the expression of the cartoon face or to
place the texture. This is important for some stroke styles, such as
exaggerate it.
the stroke texture with just one tapering end. As shown in Figure 5,
different stroke texture directions will generate different visual ef- • In-between Edit: The user can also generate in-between states
fects. by dragging a scroll bar.
Cartoon Editor records the changes for each facial part separately.
Usually a sketch line is rendered as a stroke. But in some cases, a
To save the edited results as templates for reuse, three kinds of
single sketch line may be drawn with multi-strokes by the artist to
transformation modes can be set for each facial part.
enhance the expressive power. For example, when rendering a car-
toon face with the style shown in Figure 4(c)(d), the facial contour • Replace: The absolute coordinates of the edited shape are
line is rendered with three separated strokes. For the hair, we auto- saved in the template. When reusing, the facial part is totally
matically break the hair contour at strong corners and render it with replaced with the new shape.
• Add Changes: The difference between the new shape and which is associated with a proto-lip template. We further assume
original shape is saved in the template. When using this that the random vector ai for class i has a Gaussian distribution and
mode, the shape difference is added to the cartoon face. each dimension in the vector distributes independently. By regres-
sion, we can compute the mean āij and covariance σij for each
• Erase: The facial part will not be drawn. Gaussian model (for class i, dimension j). After the training pro-
cess, we have the following model parameters:
By using these simple tools, the user can design expressions for the
cartoon or exaggerate it. Saved templates can be easily applied to
any cartoon face and can create impressive effects. Figure 7 shows
some results generated by using pre-designed templates. • Proto-lip templates µi and their relative proportion πi (i =
1...n).
6. CARTOON ANIMATOR • Mean āij and covariance σij for the j-th dimension of the
For users to animate cartoons easily, we have developed lip-syncing i-th class for the acoustic feature vector (i = 1...n; j =
technology which is used by Cartoon Animator to drive cartoons by 1...18).
speech. The lip movements are automatically synthesized by our
algorithm. In addition, expressions designed by Cartoon Editor are
used as key frames in the animation. 6.1.2 Audio to Visual Mapping
Given a new audio clip, we first segment the audio signal into
These pre-designed expressions are all saved in the cartoon tem- frames of 40ms each. Then the acoustic feature vector a for each
plate library. During animation, Cartoon Animator will select ap- frame is calculated as the system input. Since we assume each di-
propriate expression templates from the library as key frames. Since mension of the acoustic feature vector distributes independently,
all the cartoon faces are vector-based, facial animation is generated the likelihood p(a|µi ) can be represented by:
by morphing between these key frames. 18
1 (aj − āij )2
p(a|µi ) = √ exp(− 2
). (4)
j=1
2πσij 2σij
The key to our speech-driven cartoon animator is a real-time lip-
syncing algorithm. Instead of a conventional phoneme-viseme map- According to Bayesian estimation, the posterior probability is
ping (e.g., [3, 14, 22] ), our algorithm uses the acoustic feature vec-
p(a|µi )p(µi )
tor (e.g., MFCC used in speech recognition) as system input. The p(µi |a) =
n (5)
advantage of using the acoustic feature vector is that different lan- p(a|µi )p(µi )
guages (e.g., Chinese and English) do not require training different i=1
models. where p(µi ) = πi is the prior. Then the mapping result becomes
n
[7] B. Edwards. The new drawing on the right side of the brain.
J P Tarcher, 1999.