You are on page 1of 7

Abstract

The Captchacker Project exploits the potential of


Support Vector Machines to break visual captchas. We
define simulationbased! and captchabased! methods
to build our models.

"ur method performs extremel# $ell for breakin%
easil#se%mentable captchas& $ith a robust reco%nition.
We also %ive a %enerali'ation to non easil# se%mentable (
thus harder to break ( captchas& usin% )#namic
Pro%rammin%. *esides& $e explain ho$ to %et a lar%e
trainin% set $ithout %oin% throu%h the lon% and borin%
task of manual labellin%.
+inall#& our source code is available ,for free-. at /012.
1. Introduction
A CAPTCHA is a program that can generate and grade
tests that humans can pass but current computer programs
cannot [2].
The term CAPTCHA (for Completely Automated
Turing Test To Tell Computers and Humans Apart) as
coined in 2!!! by "uis #on Ahn$ %anuel &lum$ 'icholas
Hopper and (ohn "angford of Carnegie %ellon )ni#ersity.
At the time$ they de#eloped the first CAPTCHA to be used
by *ahoo.
CAPTCHAs are idely used+ spam comment
pre#ention$ ebsite registration protection$ online polls$
dictionary attac,s pre#ention$ search engine bots$ orms
and spam... As "e%onde nespaper pointed out [-]$ 2!!
billion spams are sent e#ery day... so there.s definiti#ely a
mass mar,et in here/
0o if you anna be able to #iagra spam$ hac, your sista.s
or gf.s %0' account$ chec, this paper out/
0ection 2 gi#es a #ery 1uic, o#er#ie of the common
methods in ord recognition. 2n section 3$ e e4plain ho
to brea, a simple type of captchas+ easily segmentable
captchas. 2n section 5$ e mo#e to a harder problem+
brea,ing non easily segmentable captchas. 0ection 6
e4plains some of the future challenges captchas brea,ers
ill ha#e to o#ercome. 7inally$ e conclude this pro8ect in
section 9.
2. State of the art
To brea, captchas$ a first class of algorithms is based on
geometric detections [-3$ -5$ -6$ -9$ -:]. ;e did not
study these methods$ because they are said to be not
robust$ hich sounds pretty intuiti#e. (;e did not chec,
this$ as this pro8ect is a school pro8ect$ e had limited
time).
Another class of algorithms is based on neural netor,s
[5$ 6$ 9$ :$ <$ =$ -!$ --]. These methods are e4tremely
popular mostly than,s to gurus * "eCun and >. ?. Hinton.
;e encourage the reader to ha#e a loo, at @con#olutional
neural netor,sA and @deepBbelief netor,sA. 2n this
pro8ect$ e decided not to implement these methods
despite their e4cellent results$ because of their relati#e
opacity$ and the difficulty to tune parameters to get good
models. ;e nonetheless in#ite the reader to ha#e a loo, at
%i,e C.'eil.s information about con#olutional neural
netor, implementation [6]$ hich is a good help to
understand neural netor,s at the beginning.
7inally$ more common classification methods such as
0upport Dector %achines are used [-<]$ using subparts of
the captchas as input data. This is the method e ha#e
chosen to study$ on se#eral type of captchas$ easily and
nonBeasily segmentable (sections 3 and 5).
The Captchacker Project
%arch 2!!=.
(eanB&aptiste 7iot
?cole Centrale Paris
?cole 'ormale 0upErieure de Cachan
jean-baptiste.fiot@student.ecp.fr
FEmi Paucher
?cole Centrale Paris
?cole 'ormale 0upErieure de Cachan
remi.paucher@student.ecp.fr
3. Easily se!entable Captchas
3.1. Catpchas studied
As a study case$ e focused on captchas from
?goshare.s ebsite (http://www.egoshare.com/).
These captchas can be segmented by thresholding the
intensity and separating connected components.
Here are some ?goshare captchas+
They are small <!426 images hich alays contain three
digits. Characters do not o#erlap$ therefore segmentation is
1uite easy.
3.2. Preprocessin
0egmentation as implemented in CGG ith CpenCD
and consists in three steps+
-. 7irst$ e con#ert the image into a grayBscale one$
and e threshold it to remo#e the bac,groundH
2. Then$ e calculate the three largest connected
components ith both 5Bconnecti#ity and <B
connecti#ityH
3. 7inally$ if there are less than three <Bconnected
components$ e ha#e to consider the 5Bconnected
components. &esides if the third smaller 5B
connected component is too small$ this means that
a digit has been split in se#eral different parts$ so
e ha#e to consider the <Bconnected components.
This segmentation performs #ery ell$ ith a success
rate being almost -!!I. 2n our e4periment$ e tested it on
about 3!!! captchas$ and the segmentation techni1ue
failed only once$ it as on that captcha+
)sing color information ould easily sol#e this problem.
3.3. "earnin features
3.3.0 Support Vector Machines ,SVM.
As e4plained in section 2$ e decided to use 0D% to
sol#e this classification problem. ;e rote Python scripts
using the lib0D% Python rapper for this part.
The idea of 0D% is to separate classes #ia a hyperplane.
The choice of this hyperplane is based on the ma4imiJation
of the margin (the min distance beteen the hyperplane
and the to classes).
2n this picture$ e ha#e to classes$ the blac,
hyperplane is the one ma4imiJing the margin$ thus chosen
to be the decision boundary. The circles ith strong edges
are the @support #ectorsA.
;hen the data are not linearly separable$ e can either
use a higher dimensional space andKor authoriJe outliers.
Putting the data into a higher dimensional space can
ma,e the classes linearly separable+
Than,s to this tric, the optimiJation algorithm in this
ne space is similar to the linear case in the original space.
Hoe#er$ the L application is generally not so easy to find.
Another idea is to allo outliers$ but this implies to
choose an error cost C. Choosing the cost C re1uires to
stri,e a happy medium+ a too lo cost generates many
outliers$ and a too high cost generates o#erfitting.
Here are some illustrations in 2M ith to classes$ ith
respecti#ely a lo and high error cost+
3.3.4 Simulationbased! method
;e call the folloing method simulationbased!
because e use simulated training data to build our
models.
2n our e4periments$ e generated the simulationB
based database ith three different fonts (Californian 7&$
Comic and Dera) ith each time to different thic,nesses$
a rotation parameter ranging from B3!N to 3!N$ and a scale
ranging from -: to 22 pi4els. ;e ended up ith a database
containing ---9 images per character.
This diagram shos the basic idea of the fontBbased
database generation. 0tarting from font files$ e create
rescaled and rotated digits. These digits are centered #ia a
simple CGG program e rote$ and their intensities are
normaliJed. 7inally$ e used these pictures to build the
0D% model.
3.3.3 Captchabased! method
Cn the other hand$ e or,ed on a captchabased!
method$ meaning our training data as based on captchas
from the ebsite.
As e did not ant to label millions of captchas
manually$ e used the folloing method. ;e build a first
model based on a fe labelled captchas (typically a
hundred)$ e use it to label other captchas$ e correct
manually the rongly labelled ones$ e rein8ect them in the
training set to build a better model$ and continue this
iterati#e process until e ha#e satisfying results. This tric,
is called active learnin%!.
This method$ less general than the simulationBbased$
ob#iously leads to better results (see ne4t section for
performance analysis).
............
.......................
S#$ $odel
%inal S#$ $odel
Te!porary S#$ $odel
Initial "abelled Captcha Set
(manually or using models from section 3.3.-)
O O O
Preprocessed diits
Auto!atically "abelled Captchas
(using last computed model)
Enlared Trainin Set
(rongly labelled captchas are
manually corrected)
I!pro&ed S#$ $odel
Satisfyin results '
'C
*?0
7ont files
3.(. Perfor!ance
Cnce our to databases ere generated$ e had to
choose the parameters used by the CB0upport Dector
Classification algorithm. Than,s to crossB#alidation$ e
chose the model minimiJing the empirical ris, on a :-5
captcha test set. Cur test set as ob#iously randomly
selected$ ithout any common data ith the training set of
the captchaBbased database B otherise all the success rates
ould ha#e been biased.
3.5.0 Simulationbased database
&ased on the training set described in section 3.3.-$ e
trained a multiclass 0D%. The error cost as found #ia a
,Bfold #alidation.
As far as the ,ernel type of the 0D% is concerned$ e
ha#e four different choices in the "ib0D% library+
FadiusB&ased 7unction (F&7)$
polynomial ,ernel$
linear ,ernel$
sigmoid ,ernel.
Here is a chart illustrating the performance of these
different ,ernels+
The polynomial ,ernel (ith a degree higher than 3)
performs slightly better than the F&7 one. The last to
ha#e orse performance.
These rates correspond to success rates of ellB
decrypted captchas$ hich means that the polynomial
,ernel produces =2.2I of success per character.
3.5.4 Captchabased database
The CaptchaBbased database as generated from -<=9
labeled captchas$ hich ma,es about 6:! images per
character. Than,s to cross #alidation$ e find the optimal
cost CP-2<.
The ,ernel performance chart has similar shape+
As e4pected$ e obtain much better results than ith the
simulatedBbased training set$ because e used training data
much closer to the testing data. Hoe#er$ this approach is
less general than the simulationBbased one.
3.). *ther easily se!entable captchas
;e tried our model ith other easily segementable
captchas$ and it seems to or, ell Q e do not ha#e
accurate success rate$ as e ha#e not manually labelled
captchas from this ebsite.
Here are some brea,able e4amples from other ebsites+
This shos that our method is 1uite robust$ and can gi#e
good results e#en ith a 1uite specific training set.
(. +on,easily se!entable Captchas
As once segmentation is done$ captchas are easily
brea,able$ companies ha#e begun to design more comple4
captchas using o#erlap.
?4amples from yahoo.com+
?4amples from gmail.com+
2n this section e studied the ne Hotmail catpchas$
hose ,ind are poorly brea,able. Here are e4amples+
(.1. -yna!ic prora!!in
5.0.0 Presentation and formali'ation of the problem
To brea, Hotmail captchas$ e trained a 0D% model on
our simulated database. Then$ than,s to this model$ e are
able to associate to each subindo a prediction (the
li,eliest class) as ell as a score telling ho sure e are
that this subindo belongs to this class.
[ ] ( ) [ ] [ ] ( ) { }
$ $ K ! -$ $ !$
i i i i i i
6 * s s 6 * $
>i#en a Hotmail captcha$ e ould li,e to segment the
image in si4 rectangular subindos ma,ing a partition of
the captcha$ that is to say finding si4 non o#erlapping
segments [Ai$ &i] that co#er entirely the original image and
ma4imiJing the sum of the li,elihoods si o#er the si4
subindos.
This is basically the folloing discrete optimiJation
problem+
- 2 3 5 6 9 - 2 3 5 6 9
( $ $ $ $ $ )
ma4
i i i i i i i i i i i i
s s s s s s + + + + +
gi#en the constraints+
{
6
i
-
=!
*
i
9
=$
k[-$6] *
i
k
=6
i
k-
%ore #isually$ gi#en a set of labeled segments$ the
problem consists in choosing the set of si4 distinct
segments that co#er entirely the hole segment.
2n this e4ample$ the #alid solution is the set of red
segments.
2n our case$ the set of segments ill ha#e all segments of
lengths from = (small characters li,e @3A) to 3! (large
letters li,e @%A or @;A).
5.0.4 7esolution of the problem
To sol#e this type of problem$ scanning all possible
solutions one by one is not an option$ since there is a #ery
large number of solutions$ so the computation time ould
be e4tremely high.
&y path to a point 8!$ let us refer to a se1uence of
nonBo#erlapping segments ma,ing a partition of [!$ R].
As e ha#e the to constraints (the number of si4
segments and no o#erlapping)$ classic methods of dynamic
programming consisting in defining a function at each point
telling hich path is the best to get to this point$ cannot be
applied directly and need to be impro#ed.
To sol#e the problem$ e associated to each point a
dictionar# telling the optimal path to get to this point for a
path length from - to 9$ each ,ey " representing the
optimal path from ! to this point of length ". The solution
of the problem is the si4th ,ey of the dictionary at the point
of abscissa (the best 9Blong path to get to the last point).
The problem is no to determine this function at each
horiJontal point of the image. The idea is that$ gi#en a ne
segment i$ e ill consider the concatenation of the
pre#ious optimal paths to Ai ith [Ai$ &i] and e ill see if
this path has a better score than the one in the dictionary of
point &i. 0o here is the algorithm+
0ort the list of labeled segments ith respect to &i
2nitialiJe the target function at point ! (Cne ,ey in the
dictionary representing the !Blong path)
7or each segment 2 in the list of segments+
7or each path P in the dictionary of the point Ai+
"et "8 be the length of the pathH
2f "8 S 6$ continue (the concatenation of the
path and the segment ill ha#e a length higher
than 9)
?lse$ if the path of length "8G- in the dictionary
of &i has a loer score than the path PG2$
replace it ith the path PG2
0o for each horiJontal position in the image$ e end up
ith the optimal paths of all lengths from - to 9.
This method is #ery efficient in terms of computation
time (C(n log (n)). The slo part as to determine all
scores on e#ery subindo. 2ndeed$ e used here the
Python 2mage "ibrary$ so cropping images before
computing the score and then con#erting the image to a list
of coefficients as #ery slo Q all these operations are not
handled nati#ely. The best ay ould ha#e been to do that
in CGG.
5.0.3 7esults
The model e choose has to be #ery good to gi#e the
best performance. 2ndeed the score has to be #ery high if
and only if the right character has been predicted.
;e noticed that the higher the number of classes is$ the
harder it is becomes to obtain good results. 2ndeed some
classes are not ell detected (in our test the TMU character
as alays detected as a T)U). ;hen dealing ith many
classes$ it becomes sloer and sloer to build the model. 2t
too, us one hour to build a -2 class model. Cur model as
not so good$ so e did not obtain #ery good results.
As e ha#e 8ust been lent a PC from our school.s %ath
"ab$ e ha#e launched a 39 class model computation Q
ith a much larger training set than e can afford on our
laptops. The results do not ma,e any sense$ the letter V is
o#erBrepresented in the results. >i#en the long computing
time to build one full 39 class model$ e ha#e not tried
other models ith different parameters yet.
5.0.5 9mprovement $a#s
To perform better regarding the model$ our training set
needs to be impro#ed. 7or e4ample$ e need to model the
distortion the same ay that it is modeled in Hotmail
captchas. 2n our training set$ e only considered cosines
distortions hereas distortion in Hotmail captchas if far
more complicated. This is hy some characters are not
ell detected in our model.
;e also noticed that$ despite the #ery poor recognition
results ith the sliding indo method$ the captchas ere
rather ell segmented. 2n our tests our segmentation
success rate as about 5!I. Therefore$ e could use the
same captchaBbased method to generate our training set
based on segmented characters in original captchas$ as e
did ith ?goshare captchas.
Cnce e ould ha#e impro#ed the 39 class model (see
5.-.3)$ e ill try to use it to segment automatically
captchas$ then build a captcha based model$ and apply the
acti#e learning techni1ue to get better and better models.
;e ill need to sitch some Python parts to CGG (see
5.-.2)$ because so far the segmentation lasts about 5!
seconds (on a 2>HJ PC$ ith the -2 class model build Q
not the 39 class model).
). %uture battles in the .orld of Captchas.
As you ,no$ the captcha orld also has his battles.
Hac,ers ant 1uic, and robust guesses in their programs$
and designers ant captchas the easiest possible to
understand for humans$ and the hardest for algorithms.
2n this pro8ect$ e focused on #isual captchas$ but here
are some other ideas$ more or less used+ audio captchas$
animated captchas$ O
/. Conclusion
To cut a long story short$ this pro8ect ell combined
theoretical studies and implementation$ in such an e4citing
sub8ect+ captcha brea,ing.
;e ould li,e to than, our instructor 2asonas
Ro,,inos for all the time he spent and for his good ad#ice.
0eferences
[-] 2!! milliards de spams sont en#oyEs cha1ue 8our dans le
monde. http://www.lemonde.fr/technologies/
article/2009/02/09/200-milliards-de-
spams-sont-envoyes-chaque-jour-dans-le-
monde_1152784_651865.html#ens_id=1152867
[2] http://www.lafdc.com/captcha/
[3] R. Chellapilla$ P. 0imard. )sing %achine "earning to &rea,
Disual Human 2nteraction Proofs (H2Ps)
[5] *. "ecun$ ". &ottou$ *$ &enshio$ P. Haffber. >radientB&ased
"earning Applied to Mocument Fecognition. Proc. of the
2???$ 'o#ember -===.
[6] %i,e C.'eill. 'eural netor, for Fecognition of
Handritten Migits.
[9] ". &ottou$ *$ &enshio$ *. "ecun. >lobal Training of
Mocument Processing 0ystems using >raph Transformer
'etor,s.
[:] *. "ecun$ ". &ottou$ *$ &enshio. Feading chec,s ith
multilayer graph transformer netor,.
[<] P. 0imard$ M. 0tein,raus$ (. Platt. &est Practices for
Con#olutional 'erural 'etor,s Applied to Disual
Mocument Analysis.
[=] C &urges$ C. %atan$ *. "eCun$ (. Men,er$ ". (ac,el$ C.
0tenard$ C. 'ohl$ (. &en. 0hortest Path 0egmentation+ a
method for training a 'eural 'etor, to FecogniJe
Character 0trings.
[-!] M. *ou$ >. Rim. An approach for locating segmentation
points of handritten digit strings using a neural netor,.
[--] >eoffrey ? Hinton.s homepage.
http://www.cs.toronto.edu/~hinton/
[-2] >. %ori$ (. %ali,. FecogniJing Cb8ects in Ad#ersial Clutter+
&rea,ing a Disual Captcha.
[-3] 0. Huang$$ *. "ee$ >. &ell$ W. Cu. A pro8ectionBbased
0egmentation Algorithm for &rea,ing %0' and *ahoo
capchas.
[-5] (. *an$ A.0. ?l Ahmad. A lo cost attac, on a %icrosoft
Catpcha.
[-6] (. 0adri$C. *. 0uen$ Tien M. &ui. 'e Approach for
segmentation and recognition of handritten numeral strings.
[-9] ?. Dellas1ues$ ".0. Cli#eira$ A.0. &ritto (r$ A.". Roerich$ F.
0abourin. 7iltering 0egmentation cuts for digit string
recognition.
[-:] ". de Cli#eira$ ?. "ethelier$ 7. &ortoloJJi$ F. 0abourin.
Handritten Migits 0egmentation based on 0tructural
Approach.
[-<] ". 0. Cli#eira$ F. 0abourin. 0upport Dector %achines for
Handritten 'umerical 0tring Fecognition.
[-=] The Captchac,er Pro8ect Home Page
http://code.google.com/p/captchacker

You might also like