Support Vector Machines to break visual captchas. We define simulationbased! and captchabased! methods to build our models.
"ur method performs extremel# $ell for breakin% easil#se%mentable captchas& $ith a robust reco%nition. We also %ive a %enerali'ation to non easil# se%mentable ( thus harder to break ( captchas& usin% )#namic Pro%rammin%. *esides& $e explain ho$ to %et a lar%e trainin% set $ithout %oin% throu%h the lon% and borin% task of manual labellin%. +inall#& our source code is available ,for free-. at /012. 1. Introduction A CAPTCHA is a program that can generate and grade tests that humans can pass but current computer programs cannot [2]. The term CAPTCHA (for Completely Automated Turing Test To Tell Computers and Humans Apart) as coined in 2!!! by "uis #on Ahn$ %anuel &lum$ 'icholas Hopper and (ohn "angford of Carnegie %ellon )ni#ersity. At the time$ they de#eloped the first CAPTCHA to be used by *ahoo. CAPTCHAs are idely used+ spam comment pre#ention$ ebsite registration protection$ online polls$ dictionary attac,s pre#ention$ search engine bots$ orms and spam... As "e%onde nespaper pointed out [-]$ 2!! billion spams are sent e#ery day... so there.s definiti#ely a mass mar,et in here/ 0o if you anna be able to #iagra spam$ hac, your sista.s or gf.s %0' account$ chec, this paper out/ 0ection 2 gi#es a #ery 1uic, o#er#ie of the common methods in ord recognition. 2n section 3$ e e4plain ho to brea, a simple type of captchas+ easily segmentable captchas. 2n section 5$ e mo#e to a harder problem+ brea,ing non easily segmentable captchas. 0ection 6 e4plains some of the future challenges captchas brea,ers ill ha#e to o#ercome. 7inally$ e conclude this pro8ect in section 9. 2. State of the art To brea, captchas$ a first class of algorithms is based on geometric detections [-3$ -5$ -6$ -9$ -:]. ;e did not study these methods$ because they are said to be not robust$ hich sounds pretty intuiti#e. (;e did not chec, this$ as this pro8ect is a school pro8ect$ e had limited time). Another class of algorithms is based on neural netor,s [5$ 6$ 9$ :$ <$ =$ -!$ --]. These methods are e4tremely popular mostly than,s to gurus * "eCun and >. ?. Hinton. ;e encourage the reader to ha#e a loo, at @con#olutional neural netor,sA and @deepBbelief netor,sA. 2n this pro8ect$ e decided not to implement these methods despite their e4cellent results$ because of their relati#e opacity$ and the difficulty to tune parameters to get good models. ;e nonetheless in#ite the reader to ha#e a loo, at %i,e C.'eil.s information about con#olutional neural netor, implementation [6]$ hich is a good help to understand neural netor,s at the beginning. 7inally$ more common classification methods such as 0upport Dector %achines are used [-<]$ using subparts of the captchas as input data. This is the method e ha#e chosen to study$ on se#eral type of captchas$ easily and nonBeasily segmentable (sections 3 and 5). The Captchacker Project %arch 2!!=. (eanB&aptiste 7iot ?cole Centrale Paris ?cole 'ormale 0upErieure de Cachan jean-baptiste.fiot@student.ecp.fr FEmi Paucher ?cole Centrale Paris ?cole 'ormale 0upErieure de Cachan remi.paucher@student.ecp.fr 3. Easily se!entable Captchas 3.1. Catpchas studied As a study case$ e focused on captchas from ?goshare.s ebsite (http://www.egoshare.com/). These captchas can be segmented by thresholding the intensity and separating connected components. Here are some ?goshare captchas+ They are small <!426 images hich alays contain three digits. Characters do not o#erlap$ therefore segmentation is 1uite easy. 3.2. Preprocessin 0egmentation as implemented in CGG ith CpenCD and consists in three steps+ -. 7irst$ e con#ert the image into a grayBscale one$ and e threshold it to remo#e the bac,groundH 2. Then$ e calculate the three largest connected components ith both 5Bconnecti#ity and <B connecti#ityH 3. 7inally$ if there are less than three <Bconnected components$ e ha#e to consider the 5Bconnected components. &esides if the third smaller 5B connected component is too small$ this means that a digit has been split in se#eral different parts$ so e ha#e to consider the <Bconnected components. This segmentation performs #ery ell$ ith a success rate being almost -!!I. 2n our e4periment$ e tested it on about 3!!! captchas$ and the segmentation techni1ue failed only once$ it as on that captcha+ )sing color information ould easily sol#e this problem. 3.3. "earnin features 3.3.0 Support Vector Machines ,SVM. As e4plained in section 2$ e decided to use 0D% to sol#e this classification problem. ;e rote Python scripts using the lib0D% Python rapper for this part. The idea of 0D% is to separate classes #ia a hyperplane. The choice of this hyperplane is based on the ma4imiJation of the margin (the min distance beteen the hyperplane and the to classes). 2n this picture$ e ha#e to classes$ the blac, hyperplane is the one ma4imiJing the margin$ thus chosen to be the decision boundary. The circles ith strong edges are the @support #ectorsA. ;hen the data are not linearly separable$ e can either use a higher dimensional space andKor authoriJe outliers. Putting the data into a higher dimensional space can ma,e the classes linearly separable+ Than,s to this tric, the optimiJation algorithm in this ne space is similar to the linear case in the original space. Hoe#er$ the L application is generally not so easy to find. Another idea is to allo outliers$ but this implies to choose an error cost C. Choosing the cost C re1uires to stri,e a happy medium+ a too lo cost generates many outliers$ and a too high cost generates o#erfitting. Here are some illustrations in 2M ith to classes$ ith respecti#ely a lo and high error cost+ 3.3.4 Simulationbased! method ;e call the folloing method simulationbased! because e use simulated training data to build our models. 2n our e4periments$ e generated the simulationB based database ith three different fonts (Californian 7&$ Comic and Dera) ith each time to different thic,nesses$ a rotation parameter ranging from B3!N to 3!N$ and a scale ranging from -: to 22 pi4els. ;e ended up ith a database containing ---9 images per character. This diagram shos the basic idea of the fontBbased database generation. 0tarting from font files$ e create rescaled and rotated digits. These digits are centered #ia a simple CGG program e rote$ and their intensities are normaliJed. 7inally$ e used these pictures to build the 0D% model. 3.3.3 Captchabased! method Cn the other hand$ e or,ed on a captchabased! method$ meaning our training data as based on captchas from the ebsite. As e did not ant to label millions of captchas manually$ e used the folloing method. ;e build a first model based on a fe labelled captchas (typically a hundred)$ e use it to label other captchas$ e correct manually the rongly labelled ones$ e rein8ect them in the training set to build a better model$ and continue this iterati#e process until e ha#e satisfying results. This tric, is called active learnin%!. This method$ less general than the simulationBbased$ ob#iously leads to better results (see ne4t section for performance analysis). ............ ....................... S#$ $odel %inal S#$ $odel Te!porary S#$ $odel Initial "abelled Captcha Set (manually or using models from section 3.3.-) O O O Preprocessed diits Auto!atically "abelled Captchas (using last computed model) Enlared Trainin Set (rongly labelled captchas are manually corrected) I!pro&ed S#$ $odel Satisfyin results ' 'C *?0 7ont files 3.(. Perfor!ance Cnce our to databases ere generated$ e had to choose the parameters used by the CB0upport Dector Classification algorithm. Than,s to crossB#alidation$ e chose the model minimiJing the empirical ris, on a :-5 captcha test set. Cur test set as ob#iously randomly selected$ ithout any common data ith the training set of the captchaBbased database B otherise all the success rates ould ha#e been biased. 3.5.0 Simulationbased database &ased on the training set described in section 3.3.-$ e trained a multiclass 0D%. The error cost as found #ia a ,Bfold #alidation. As far as the ,ernel type of the 0D% is concerned$ e ha#e four different choices in the "ib0D% library+ FadiusB&ased 7unction (F&7)$ polynomial ,ernel$ linear ,ernel$ sigmoid ,ernel. Here is a chart illustrating the performance of these different ,ernels+ The polynomial ,ernel (ith a degree higher than 3) performs slightly better than the F&7 one. The last to ha#e orse performance. These rates correspond to success rates of ellB decrypted captchas$ hich means that the polynomial ,ernel produces =2.2I of success per character. 3.5.4 Captchabased database The CaptchaBbased database as generated from -<=9 labeled captchas$ hich ma,es about 6:! images per character. Than,s to cross #alidation$ e find the optimal cost CP-2<. The ,ernel performance chart has similar shape+ As e4pected$ e obtain much better results than ith the simulatedBbased training set$ because e used training data much closer to the testing data. Hoe#er$ this approach is less general than the simulationBbased one. 3.). *ther easily se!entable captchas ;e tried our model ith other easily segementable captchas$ and it seems to or, ell Q e do not ha#e accurate success rate$ as e ha#e not manually labelled captchas from this ebsite. Here are some brea,able e4amples from other ebsites+ This shos that our method is 1uite robust$ and can gi#e good results e#en ith a 1uite specific training set. (. +on,easily se!entable Captchas As once segmentation is done$ captchas are easily brea,able$ companies ha#e begun to design more comple4 captchas using o#erlap. ?4amples from yahoo.com+ ?4amples from gmail.com+ 2n this section e studied the ne Hotmail catpchas$ hose ,ind are poorly brea,able. Here are e4amples+ (.1. -yna!ic prora!!in 5.0.0 Presentation and formali'ation of the problem To brea, Hotmail captchas$ e trained a 0D% model on our simulated database. Then$ than,s to this model$ e are able to associate to each subindo a prediction (the li,eliest class) as ell as a score telling ho sure e are that this subindo belongs to this class. [ ] ( ) [ ] [ ] ( ) { } $ $ K ! -$ $ !$ i i i i i i 6 * s s 6 * $ >i#en a Hotmail captcha$ e ould li,e to segment the image in si4 rectangular subindos ma,ing a partition of the captcha$ that is to say finding si4 non o#erlapping segments [Ai$ &i] that co#er entirely the original image and ma4imiJing the sum of the li,elihoods si o#er the si4 subindos. This is basically the folloing discrete optimiJation problem+ - 2 3 5 6 9 - 2 3 5 6 9 ( $ $ $ $ $ ) ma4 i i i i i i i i i i i i s s s s s s + + + + + gi#en the constraints+ { 6 i - =! * i 9 =$ k[-$6] * i k =6 i k- %ore #isually$ gi#en a set of labeled segments$ the problem consists in choosing the set of si4 distinct segments that co#er entirely the hole segment. 2n this e4ample$ the #alid solution is the set of red segments. 2n our case$ the set of segments ill ha#e all segments of lengths from = (small characters li,e @3A) to 3! (large letters li,e @%A or @;A). 5.0.4 7esolution of the problem To sol#e this type of problem$ scanning all possible solutions one by one is not an option$ since there is a #ery large number of solutions$ so the computation time ould be e4tremely high. &y path to a point 8!$ let us refer to a se1uence of nonBo#erlapping segments ma,ing a partition of [!$ R]. As e ha#e the to constraints (the number of si4 segments and no o#erlapping)$ classic methods of dynamic programming consisting in defining a function at each point telling hich path is the best to get to this point$ cannot be applied directly and need to be impro#ed. To sol#e the problem$ e associated to each point a dictionar# telling the optimal path to get to this point for a path length from - to 9$ each ,ey " representing the optimal path from ! to this point of length ". The solution of the problem is the si4th ,ey of the dictionary at the point of abscissa (the best 9Blong path to get to the last point). The problem is no to determine this function at each horiJontal point of the image. The idea is that$ gi#en a ne segment i$ e ill consider the concatenation of the pre#ious optimal paths to Ai ith [Ai$ &i] and e ill see if this path has a better score than the one in the dictionary of point &i. 0o here is the algorithm+ 0ort the list of labeled segments ith respect to &i 2nitialiJe the target function at point ! (Cne ,ey in the dictionary representing the !Blong path) 7or each segment 2 in the list of segments+ 7or each path P in the dictionary of the point Ai+ "et "8 be the length of the pathH 2f "8 S 6$ continue (the concatenation of the path and the segment ill ha#e a length higher than 9) ?lse$ if the path of length "8G- in the dictionary of &i has a loer score than the path PG2$ replace it ith the path PG2 0o for each horiJontal position in the image$ e end up ith the optimal paths of all lengths from - to 9. This method is #ery efficient in terms of computation time (C(n log (n)). The slo part as to determine all scores on e#ery subindo. 2ndeed$ e used here the Python 2mage "ibrary$ so cropping images before computing the score and then con#erting the image to a list of coefficients as #ery slo Q all these operations are not handled nati#ely. The best ay ould ha#e been to do that in CGG. 5.0.3 7esults The model e choose has to be #ery good to gi#e the best performance. 2ndeed the score has to be #ery high if and only if the right character has been predicted. ;e noticed that the higher the number of classes is$ the harder it is becomes to obtain good results. 2ndeed some classes are not ell detected (in our test the TMU character as alays detected as a T)U). ;hen dealing ith many classes$ it becomes sloer and sloer to build the model. 2t too, us one hour to build a -2 class model. Cur model as not so good$ so e did not obtain #ery good results. As e ha#e 8ust been lent a PC from our school.s %ath "ab$ e ha#e launched a 39 class model computation Q ith a much larger training set than e can afford on our laptops. The results do not ma,e any sense$ the letter V is o#erBrepresented in the results. >i#en the long computing time to build one full 39 class model$ e ha#e not tried other models ith different parameters yet. 5.0.5 9mprovement $a#s To perform better regarding the model$ our training set needs to be impro#ed. 7or e4ample$ e need to model the distortion the same ay that it is modeled in Hotmail captchas. 2n our training set$ e only considered cosines distortions hereas distortion in Hotmail captchas if far more complicated. This is hy some characters are not ell detected in our model. ;e also noticed that$ despite the #ery poor recognition results ith the sliding indo method$ the captchas ere rather ell segmented. 2n our tests our segmentation success rate as about 5!I. Therefore$ e could use the same captchaBbased method to generate our training set based on segmented characters in original captchas$ as e did ith ?goshare captchas. Cnce e ould ha#e impro#ed the 39 class model (see 5.-.3)$ e ill try to use it to segment automatically captchas$ then build a captcha based model$ and apply the acti#e learning techni1ue to get better and better models. ;e ill need to sitch some Python parts to CGG (see 5.-.2)$ because so far the segmentation lasts about 5! seconds (on a 2>HJ PC$ ith the -2 class model build Q not the 39 class model). ). %uture battles in the .orld of Captchas. As you ,no$ the captcha orld also has his battles. Hac,ers ant 1uic, and robust guesses in their programs$ and designers ant captchas the easiest possible to understand for humans$ and the hardest for algorithms. 2n this pro8ect$ e focused on #isual captchas$ but here are some other ideas$ more or less used+ audio captchas$ animated captchas$ O /. Conclusion To cut a long story short$ this pro8ect ell combined theoretical studies and implementation$ in such an e4citing sub8ect+ captcha brea,ing. ;e ould li,e to than, our instructor 2asonas Ro,,inos for all the time he spent and for his good ad#ice. 0eferences [-] 2!! milliards de spams sont en#oyEs cha1ue 8our dans le monde. http://www.lemonde.fr/technologies/ article/2009/02/09/200-milliards-de- spams-sont-envoyes-chaque-jour-dans-le- monde_1152784_651865.html#ens_id=1152867 [2] http://www.lafdc.com/captcha/ [3] R. Chellapilla$ P. 0imard. )sing %achine "earning to &rea, Disual Human 2nteraction Proofs (H2Ps) [5] *. "ecun$ ". &ottou$ *$ &enshio$ P. Haffber. >radientB&ased "earning Applied to Mocument Fecognition. Proc. of the 2???$ 'o#ember -===. [6] %i,e C.'eill. 'eural netor, for Fecognition of Handritten Migits. [9] ". &ottou$ *$ &enshio$ *. "ecun. >lobal Training of Mocument Processing 0ystems using >raph Transformer 'etor,s. [:] *. "ecun$ ". &ottou$ *$ &enshio. Feading chec,s ith multilayer graph transformer netor,. [<] P. 0imard$ M. 0tein,raus$ (. Platt. &est Practices for Con#olutional 'erural 'etor,s Applied to Disual Mocument Analysis. [=] C &urges$ C. %atan$ *. "eCun$ (. Men,er$ ". (ac,el$ C. 0tenard$ C. 'ohl$ (. &en. 0hortest Path 0egmentation+ a method for training a 'eural 'etor, to FecogniJe Character 0trings. [-!] M. *ou$ >. Rim. An approach for locating segmentation points of handritten digit strings using a neural netor,. [--] >eoffrey ? Hinton.s homepage. http://www.cs.toronto.edu/~hinton/ [-2] >. %ori$ (. %ali,. FecogniJing Cb8ects in Ad#ersial Clutter+ &rea,ing a Disual Captcha. [-3] 0. Huang$$ *. "ee$ >. &ell$ W. Cu. A pro8ectionBbased 0egmentation Algorithm for &rea,ing %0' and *ahoo capchas. [-5] (. *an$ A.0. ?l Ahmad. A lo cost attac, on a %icrosoft Catpcha. [-6] (. 0adri$C. *. 0uen$ Tien M. &ui. 'e Approach for segmentation and recognition of handritten numeral strings. [-9] ?. Dellas1ues$ ".0. Cli#eira$ A.0. &ritto (r$ A.". Roerich$ F. 0abourin. 7iltering 0egmentation cuts for digit string recognition. [-:] ". de Cli#eira$ ?. "ethelier$ 7. &ortoloJJi$ F. 0abourin. Handritten Migits 0egmentation based on 0tructural Approach. [-<] ". 0. Cli#eira$ F. 0abourin. 0upport Dector %achines for Handritten 'umerical 0tring Fecognition. [-=] The Captchac,er Pro8ect Home Page http://code.google.com/p/captchacker