Gzip streaming in Python (or lack thereof)

debian eng pdo rant

Consider this simple Python snippet:

#!/usr/bin/python

import gzip
from urllib import urlopen

zfd = urlopen("http://ftp.debian.org/debian/dists/sid/Contents-udeb.gz")
fd = gzip.GzipFile(fileobj=zfd, mode="r")
for line in fd:
    foobar(line)

It does not work: it turns out GzipFile wants to seek() and tell() on its file object:

$ /tmp/z.py
Traceback (most recent call last):
  File "/tmp/z.py", line 8, in <module>
    for line in fd:
  File "/usr/lib/python2.6/gzip.py", line 438, in next
    line = self.readline()
  File "/usr/lib/python2.6/gzip.py", line 393, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.6/gzip.py", line 219, in read
    self._read(readsize)
  File "/usr/lib/python2.6/gzip.py", line 247, in _read
    pos = self.fileobj.tell()   # Save current position
AttributeError: addinfourl instance has no attribute 'tell'

Oh dear... this really shouldn't be. Let's look around the internet for details:

Since opener.open returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data, why not simply pass that file-like object directly to GzipFile? As you “read” from the GzipFile instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't work. Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream. So the inelegant hack of using StringIO is the best solution: download the compressed data, create a file-like object out of it with StringIO, and then decompress the data from that.

Oh really, so it's a limitation of "the way gzip compression works", and the only way to work around this fundamental design flaw of gzip is to buffer the lot in memory or to a temporary file? [facepalm]

# Apparently, this is not how gzip compression works:
curl http://ftp.debian.org/debian/dists/sid/Contents-udeb.gz | zcat

This is not the first time that, after finding a frustrating spot in core Python things, looking around the internet for extra information has the effect of amplifying my frustration by destroying my expectations of common sense.

What is the gzip module doing anyway?

    if self._new_member:
        # If the _new_member flag is set, we have to
        # jump to the next member, if there is one.
        #
        # First, check if we're at the end of the file;
        # if so, it's time to stop; no more members to read.
        pos = self.fileobj.tell()   # Save current position
        self.fileobj.seek(0, 2)     # Seek to end of file
        if pos == self.fileobj.tell():
            raise EOFError, "Reached EOF"
        else:
            self.fileobj.seek( pos ) # Return to original position

        self._init_read()
        self._read_gzip_header()
        self.decompress = zlib.decompressobj(-zlib.MAX_WBITS)
        self._new_member = False

Right, tell() and seek() in order to check if it's reached end of file. Oh dear...

So, there's a flaw in the Python standard library, and:

it is not explicitly mentioned in the module documentation, which just says "fileobj [...] can be a regular file, a StringIO object, or any other object which simulates a file".
in basic Python learning material, we find a paragraph whose purpose is to explain that this is not Python's fault, but it is a fundamental limitation on the way gzip works.

I wish core Python documentation started properly documenting Python flaws so that one can code with language limitations in mind, insted of having to rediscover them and their work arounds every time.

Update: I reported the issue to the Python BTS and it turns out this has been finally fixed in version 3.2, but the fix will not be backported to the 2.x series.