You are on page 1of 4

Python OCR or how to break CAPTCHAs

http://blog.c22.cc/2010/10/12/python-ocr-or-how-to-break-captchas/
After my little stint writing the scr.im PoC script, a few people on Twitter reminded me of a
blog post that Andreas Riancho from Bonsai-sec wrote back in February. Andreas (the creator
of the excellent W3AF tool) wrote a short Python script to take a CAPTCHA image and perform
an OCR on it. As a geek, this piqued my interest, but the one problem I had with it was that
the script relied on the pytesser Python library, which is Windows only!
There were a few issues with that.
1.

Its Windows only and I prefer to avoid Windows unless theres no other choice

2.

The project only ever reached version 0.0.1

3.

The project has been abandoned since May 2007

So, not wanting to give up on something that looked fun, and also useful, I started a search
for an alternative. I quickly found that the pytesser Python library is a wrapper around
the tesseract-ocr project, and that there had been some work on another Python library
called Python-Tesseract that looks like it does the job (and isnt platform dependent).
After installing tesseract-ocr (apt-get install tesseract-ocr on Backtrack) I downloaded the
Python-tesseract files and modified the script from Andreas Riancho a little (the actual
changes to make things work are minimal). I also changed a few things to get the script to
reasonably accurately decode scr.im captcha images.

#!/usr/bin/python

# [PoC] tesseract OCR script - tuned for scr.im captcha


#
# Chris John Riley
# blog.c22.cc
# contact [AT] c22 [DOT] cc
# 12/10/2010
# Version: 1.0
#

# Changelog
# 0.1> Initial version taken from Andreas Riancho's \
#

example script (bonsai-sec.com)

# 1.0> Altered to use Python-tesseract, tuned image \


#

manipulation for scr.im specific captchas

from PIL import Image

img = Image.open('captcha.jpg') # Your image here!


img = img.convert("RGBA")

pixdata = img.load()

# Make the letters bolder for easier recognition

for y in xrange(img.size[1]):
for x in xrange(img.size[0]):
if pixdata[x, y][0] < 90:
pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
for x in xrange(img.size[0]):
if pixdata[x, y][1] < 136:
pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
for x in xrange(img.size[0]):
if pixdata[x, y][2] > 0:
pixdata[x, y] = (255, 255, 255, 255)

img.save("input-black.gif", "GIF")

Make the image bigger (needed for OCR)

im_orig = Image.open('input-black.gif')
big = im_orig.resize((1000, 500), Image.NEAREST)

ext = ".tif"
big.save("input-NEAREST" + ext)

Perform OCR using tesseract-ocr library

from tesseract import image_to_string


image = Image.open('input-NEAREST.tif')
print image_to_string(image)

A majority of this code is preparation, the actual OCR job is performed in the final lines using
the image_to_string call. Simple isnt it!
The above script is tuned to the scr.im captcha image. As can be seen by the below
examples:

As you can see, after running it through some filters (thanks Andreas), the CAPTCHA becomes
a lot clearer, and significantly easier to OCR. Even in this case however, tesseract-ocr
sometimes returns the value as W6BHP instead of W68HP. Still, thats an easy mistake to
make and Im sure with more tweaking, the preparation could be perfected!
So, next time somebody says we implemented a CAPTCHA to prevent scripted attacks, you
can take it with a pinch of salt!
Links:

[PoC] scr.im.tesseract.py script > here


Breaking Weak CAPTCHA in 26 Lines of Code > bonsai-sec.com
Pytesser > here
Tesseract-OCR > here
Python-Tesseract > here

You might also like