Async Hell - Gevent & Requests

5/10/2014
Async Hell: Gevent & Requests
jmoiron
plays the blues

Gevent and Requests are both absolutely fantastic libraries, for different reasons:
gevent is technically brilliant, and requests is brilliant from a design perspective.
Despite this, the combination of the two sent me all the way down the rabbit hole
and into my own little hell. Before I go on, let me distill the lesson of this episode
in a convenient little quip:
If you are profiling something, start by profiling it.
For about 3 of the past 4 weeks, I've been working on a pretty major project to
replace a large Twisted application we have with a small gevent application, and
it's been a huge success. The resultant code is fully usable in synchronous
contexts (like a REPL or most unit testing frameworks), was way faster in testing,
and since it was able to leverage more libraries for its work, about 1/5th the size.
It's been a huge success, that is, until Thursday, when I went to put it into
production.
This application handles ~250 jobs per second across a number of different
machines. Each Twisted-based worker process seems to have a capacity for
around 30 jobs per second, and operates at a load of ~20 at around 60% CPU
utilization. The new gevent-based workers could only handle about 5-10 per
second each, and pegged the CPU at full tilt boogie.
The logging I had enabled showed that we were resource starved; the new app
was unable to pull jobs in off of the queue at a fast enough rate. This sent me on a
long chase, separating out the job fetch portion to its own network daemon,
before finally finding the root of that issue (using channel.get, which did 4000
gets per 100 messages received, rather than channel.consume, which used
http://jmoiron.net/blog/async-hell-gevent-requests/
1/4
5/10/2014
pre-fetching and made this almost instant) and pulling it back in process.
With the resource starvation problem nicked, I prepared to run my triumphant
production test. 5-10 jobs/sec.
I added more logging, this time around every requests.getand around all
potentially time intensive input processing (json via ujson, xml via lxml), and
noticed that the worker stopped during the non-cooperative CPU bound steps.
This was, of course, entirely expected; but the duration of these stops were 100x
more than I'd experienced outside of this context.
Now convinced now that gevent had a major fault with running c-calling code, I
attempted to produce a synthetic example that would confirm my findings in a
way I could easily share and debug. I wrote a little program which read 50 xml
feeds and ran speedparser on all of them; first, serially in a normal loop, and then
"asynchronously" via a gevent pool. I ran my test, expecting this gevent bug to
show itself, but the timings were nearly identical. "Maybe", I thought, "this only
happens with network traffic."
So I wrote a gevent-based wsgi application which served these feeds from
memory, loaded the feeds off of that, and they still ended up being more nearly
identical. Exasperated, and with nowhere else to turn, I decided to write a cProfile
mode for my app and see what kinds of functions we were calling.
The first thing I noticed was that, somewhere down the line, I had been using
response.text.encode("utf-8"), and textin turn was using chardetto
detect the character encoding. I recalled from my many hours of performance
testing writing speedparser that chardetcan be an absolute dog, so excitedly
assuming I'd had the problem nicked again, I replaced that with un-encoded
response.contentand gave it a run. 2-3 jobs/sec.
Deflated, embarrassed, and depressed, I went back to my profiler. json and xml
were still taking far too long... In my encoding speed tests, ujson was decoding
over 100MB/sec, but my printed timing information showed it taking sometimes
up to a second for even small (50 kB) requests. Furthermore, the separation
between the log timestamps showed that nothing else was done during this time.
2/4
5/10/2014
When I went back into the pstatsoutput, I noticed that cProfile disagreed; very
little time was spent in ujson. Since ujson is a bit of a black box to the profiler, I
turned my attention to speedparser, which uses considerably more python code,
and while examining the callees noticed something very peculiar indeed:
1.
2.
3.
4.
5.
Function
called...
ncalls tottime cumtime
.../feed.py:48(_parse) -> 14
0.001
0.323 .../feedlib.py:55(speedparse_s
14
0.005
0.024 .../feedlib.py:98(postprocess
14
0.000 11.976 .../requests/models.py:759
Why was my completely non-cooperative CPU-bound feed parser (which itself

processes ~60MB/sec) spending 12 seconds in the requests library over 14 calls?
Digging deeper, I discovered that contentis actually a property which returns a
lazily-loaded iterator, which itself reads chunk-wise from the raw socket. In 1-byte
chunks. Since saving memory is not a requirement for me, I wrote my own
Response which just reads the whole thing in at once, and once again went to the
server to test it out. 65-70 jobs/sec.
Note: Since the publication of this blog entry, requests has
changed the generator to load requests in 10k chunks. This was
still a performance problem for me, and I wrote a small wrapper
to load it in 500k chunks instead.
What I suspect was happening, but cannot confirm, is that the generator returned
by response.contentmade it into a GIL-holding section of one of these C
libraries, which then yielded (via the generator's use of read) back to gevent's
hub, which then went loco for a bit until it chanced upon the right greenthread,
read more, and repeated, til the content was totally loaded. This kept any code
from executing during these CPU-bound critical sections, but had the undesired
side effect of adding potentially thousands of random delays. This kept the log
clean of any other goingson during that interval, which led me to mistakenly
believe that this code was running unnaturally slowly. The profiler was right; that
code actually spent very little time running; it spent most of its time waiting on
read calls.
3/4
5/10/2014
So now that this is all in the past, what could I have done to discover this in less
than 50 hours of debugging and hair pulling? Despite it seeming like the
proverbial howitzer when a simple pistol would do, I really should reach for the
profiler earlier in the future. The profiling code and subsequent pstats exploration
actually only took about 2 hours; and indeed, my only other major breakthrough
(the amqp bug) was discovered using a python-based gevent profiler which was
sadly slowed down my app a bit too much to be practical.
In addition, it's made me appreciate optimization and how it can really bite you
unknowingly. The lazy chunked-read behavior of response.contentis superior
to reading the whole response all the time, provided the resource you're
concerned about is memory. Laziness is generally considered a virtue (certainly,
by perl and haskell hackers), as it enables you to occasionally skip work
altogether. Ironically, it was the combination of request's vanity and gevent's
cleverness that led to this incredible performance nightmare.
Apr 23 2012
Gt Y
by jason moiron, 2013
4/4

Async Hell - Gevent & Requests

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Async Hell - Gevent & Requests

Uploaded by

Copyright:

Available Formats

5/10/2014

Async Hell: Gevent & Requests

Async Hell: Gevent & Requests

Async Hell: Gevent & Requests

Async Hell: Gevent & Requests

Why was my completely non-cooperative CPU-bound feed parser (which itself

Async Hell: Gevent & Requests

by jason moiron, 2013

You might also like