Streaming JSON objects with python

debian eng pdo rant

Dear Lazyweb,

Suppose I'd like to create an HTTP service that gives you a looong list of JSON objects. For example, one JSON object for every package in Debian.

One way I could do that is to just transfer a big, dozens of megabytes JSON array with one element per package. With the standard json module in python, this requires the client to buffer all the data in memory for decoding, as the JSON result would be something like this:

[
  { info for package 1 },
  { info for package 2 },
  ...dozens of megabytes...
  { info for package 30000 },
  ...
]

However, it would be reasonable to engineer things so that the client can process packages one at a time, as they are produced. This can dramatically reduce client memory usage, as well as remove a large latency between the start of the request and the start of processing.

One trivial idea would be to just send JSON objects one after the other, and sort of call load multiple times. It doesn't work:

import json
from cStringIO import StringIO
a = StringIO()
json.dump({1:3}, a)
json.dump({2:4}, a)
a.reset()
print a.getvalue()
print json.load(a)

gives:

$ python z1.py
{"1": 3}{"2": 4}
Traceback (most recent call last):
  File "z1.py", line 8, in <module>
    print json.load(a)
  File "/usr/lib/python2.6/json/__init__.py", line 267, in load
    parse_constant=parse_constant, **kw)
  File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.6/json/decoder.py", line 322, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 8 - line 1 column 16 (char 8 - 16)

And the json module documentation does not provide any way of telling the "stream" decoder to just stop parsing after the first valid object.

An alternative can be to use chunked transfer encoding and transfer each JSON object in a separate HTTP chunk. It would add the overhead of a little HTTP header per JSON object, which could be significant when generating lots of small JSON objects, but it would be an implementation that strictly follows HTTP and JSON standards to the letter in a clean way.

Except:

How do you stream HTTP chunks in WSGI? The WSGI specification's only mention of chunks is that if you produce a result as a generator of "chunks", they are just sent one after the other (effectively, concatenated) in the response stream.
How do you read and process one chunk at a time with urllib/urllib2? I searched on Google but I only found people in tears.

Another alternative is to wrap the JSON objects in some custom container protocol, but then we are not talking standards anymore.

Is this all you can do with the standard python library? Can this only be done by adding a dependency on a C library with two obscure separately maintained bindings, or by writing my own JSON parser?

Ideas? Please mail enrico@enricozini.org and I'll add them here.

Updates

Wouter van Heyst mentions that he heard of ijson, which indeed has my exact use case in the synopsis, but after some research looks like a third binding for yajl.

Aigars Mahinovs suggests using YAML instead of JSON. Indeed YAML supports streaming very well (it's even already supported in DDE, but last time I tried to do anything serious with it, PyYaml was slooow. Of course, this being the Python ecosystem, There Are Lots Other Ways To Do It And You Shall Be Patronised For Not Chosing The Proper One, but Debian only has PyYaml. Another itch with YAML is that there is no YAML parser in the standard python library, which requires clients to add an extra dependency.

Jyrki Pulliainen has so far mailed my favourite solution: disable formatting in the encoder and separate JSON objects with newlines:

we built a streaming JSON app as a part of our platform. However, we
didn't stream a single JSON object as that was a bit cumbersome. When
having syntax like:

[
  <json-obj1>,
  <json-obj2>,
  ....
  <json-objN>
]

The reader has some problems. Should it read the whole stream to find
the closing bracket ] even if the stream would me tens of megabytes?
Should it try to split on , and parse single objects? What if the
objects contain , so should there be some regexp magic?

We ended up streaming separate JSON objects separated by newlines
(\n). This way the reader can read until hitting \n or end of stream
and then split the result on newlines. The new stream looked like
this:

<json-obj1>\n
<json-obj2>\n
...
<json-objN>\n

This way we could efficiently stream hundreds of thousands of JSON
objects from server to client without reserving a load of memory. For
the WSGI part how to do this, the best would probably to do a
generator that would yield a single JSON object with the newline (we
didn't use WSGI back then). Reading can be done using any reader
capable of handling streams. We used pycurl for reading, but I can't
really recommend it as it is way too complicated for a simple task
like this :p

Hopefully this helps!

This means that if the client has no access to a JSON reader that supports a stream of concatenated toplevel objects, they can still trivially code it on top of any JSON library. This means that working clients can be written based just on Squeeze's python standard library. That's what I'm going to do; I'll probably implement it as a separate 'jsons' output format for DDE, because producing a JSON response with multiple toplevel obejcts somehow breaks the JSON standard.

Thanks to all who provided this excellent information!