Rants, kept to a bare minimum and strictly with a useful component.
Evolution's old odd mail folders to mbox
Something wrong happened in my dad's Evolution. It just would get stuck checking mail forever, with no useful diagnostic that I could find. Fun. Not.
Anyway, I solved by resetting everything to factory defaults, moving away all gconf entries and .evolution/ files. Then it started to work again, of course then I needed to reconfigure it from scratch.
It turned out however that some old mail was only archived locally, and in a kind of weird format that looks like this:
$ ls -la Enrico/
total 336
drwx------ 2 enrico enrico 4096 Jul 23 03:05 .
drwxr-xr-x 7 enrico enrico 4096 Jul 23 03:12 ..
-rw------- 1 enrico enrico 3230 Dec 4 2010 113.HEADER
-rw------- 1 enrico enrico 14521 Dec 4 2010 113.TEXT
-rw------- 1 enrico enrico 3209 Oct 22 2010 134.HEADER
-rw------- 1 enrico enrico 2937 Oct 22 2010 134.TEXT
-rw------- 1 enrico enrico 3116 Jun 27 2011 15.
-rw------- 1 enrico enrico 3678 Jun 27 2011 168.
-rw------- 1 enrico enrico 73 Apr 27 2009 22.1.MIME
-rw------- 1 enrico enrico 3199 Apr 27 2009 22.2
-rw------- 1 enrico enrico 88 Apr 27 2009 22.2.MIME
[...]
I couldn't even find the name of that mail folder layout, let alone conversion tools. So I had to sit down and waste my sunday break writing software to convert that to a mbox file. Here's the tool, may it save you the awful time I had today: http://anonscm.debian.org/gitweb/?p=users/enrico/evo2mbox.git
Note: feel free to fork it, or send patches, but don't bother with feature requests. Evolution isn't and won't be a personal interest of mine. Anything that makes an afternoon at my parents more tiresome than a whole busy month of paid work, doesn't deserve to be.
Luckily they now seem to have changed the local folder format to Maildir.
SQLAlchemy, MySQL and sql_mode=traditional
As everyone should know, by default MySQL is an embarassing stupid toy:
mysql> create table foo (val integer not null);
Query OK, 0 rows affected (0.03 sec)
mysql> insert into foo values (1/0);
ERROR 1048 (23000): Column 'val' cannot be null
mysql> insert into foo values (1);
Query OK, 1 row affected (0.00 sec)
mysql> update foo set val=1/0 where val=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1
mysql> select * from foo;
+-----+
| val |
+-----+
| 0 |
+-----+
1 row in set (0.00 sec)
Luckily, you can tell it to stop being embarassingly stupid:
mysql> set sql_mode="traditional";
Query OK, 0 rows affected (0.00 sec)
mysql> update foo set val=1/0 where val=0;
ERROR 1365 (22012): Division by 0
(There is an even better sql mode you can choose, though: it is called "Install PostgreSQL")
Unfortunately, I've been hired to work on a project that relies on the embarassing stupid behaviour of MySQL, so I cannot set sql_mode=traditional globally or the existing house of cards will collapse.
Here is how you set it session-wide with SQLAlchemy 0.6.x: it took me quite a while to find out:
import sqlalchemy.interfaces
# Without this, MySQL will silently insert invalid values in the
# database, causing very long debugging sessions in the long run
class DontBeSilly(sqlalchemy.interfaces.PoolListener):
def connect(self, dbapi_con, connection_record):
cur = dbapi_con.cursor()
cur.execute("SET SESSION sql_mode='TRADITIONAL'")
cur = None
engine = create_engine(..., listeners=[DontBeSilly()])
Why does it take all that effort is beyond me. I'd have expected this to be turned on by default, possibly with a switch that insane people could use to turn it off.
Repubblica vs i traduttori automatici
Vedo questo titolo: "Cieco e senza ali: l'insetto della cava più profonda" e scatta la sensazione di unghie sulla lavagna. Non avran mica tradotto l'inglese "cave" con "cava"?
Si.
E ne sono veramente convinti, ribadendo nell'articoletto che è proprio proprio una cava artificiale, scavata proprio proprio dall'uomo:
…lo hanno trovato durante l'ispezione della cava più profonda scavata
dall'uomo, la Krubera-Voronja, in Abkhazia.
Cara Repubblica, questa è una cava, e questa è una caverna. Vi sembra plausibile una cava fonda piú di 2 chilometri in cui si trovano forme di vita mai scoperte prima? Ah, ma sai, in Abkhazia...
Bastava scrivere Krubera-Voronja su un motore di ricerca; fare 30 secondi di approfondimento prima di pubblicare la roba, ma mi rendo conto che pensare è fatica, soprattutto quando uno ha perso l'abitudine a farlo.
Si apre però una riflessione interessante: un traduttore automatico non fa questo errore. Questo vuol dire che i traduttori automatici hanno superato l'intelligenza di Repubblica. QUESTA è una notizia da pubblicare!
Il prossimo passo è superare l'intelligenza dei moscerini della frutta.
Quei simpatici spammer di Aruba
Aruba ha deciso, di punto in bianco, di iscrivermi a tutte le loro newsletter.
Le newsletter non hanno link di deiscrizione. O meglio, forse ce l'hanno, ma si vedono solo decodificando la mail usando programmi che io non ho intenzione di usare. A prescindere dal link di deiscrizione, perché dovrei deiscrivermi da delle newsletter alle quali non mi sono mai iscritto?
Ho mandato questa mail a abuse@staff.aruba.it, e altre 3 segnalazioni dopo di questa, che ovviamente sono state ignorate:
Buon giorno,
vi segnalo questo spam inviato da voi stessi (in allegato la mail con
gli header intatti).
Potreste per favore procedere con provvedimenti disciplinari contro voi
stessi? Il vostro comportamento su internet viola le piú banali regole
di netiquette, ed è vostro interesse, come provider, istruire voi stessi
sulle stesse e farvele rispettare.
Cordiali saluti,
Enrico
Che dire, una nazione del terzo mondo si merita ISP da terzo mondo.
È pur sempre un'ottima scusa per studiarsi gli header_check di postfix: ora le mail delle newsletter di Aruba, che son tra l'altro dei patozzi da 300Kb l'una, incontrano un REJECT direttamente nella sessione SMTP:
550 5.7.1 Criminal third-world ISP spammers not accepted here.
Per farlo, ho aggiunto a /etc/postfix/main.cf:
# Reject aruba spam right away
header_checks = pcre:/etc/postfix/known_idiots.pcre
E poi ho creato il file /etc/postfix/known_idiots.pcre:
/^Received:.+smtpnewsletter[0-9]+.aruba.it/
REJECT Criminal third-world ISP spammers not accepted here.
Nel frattempo ho mandato un'email al Garante Privacy e una all'AGCOM, piú per curiosità che altro. Non mi aspetto nessuna risposta, ma se succede qualcosa lo aggiungo volentieri qui.
Python list gotcha
Suppose in python you're building a list of buckets:
>>> a = [[]] * 10 >>> print a [[], [], [], [], [], [], [], [], [], []]
Looks good. However:
>>> a[5].append(1) >>> print a [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1]]
Surprising? What happens here is that multiplying the list replicates the reference to the same empty list. You have the exact same mutable list replicated 10 times: instead of 10 buckets, you have 10 references to 1 bucket: therefore if appending to one it looks like one appends to all.
What you need here is a way to invoke the list constructor [] multiple times:
>>> a = [[] for i in range(10)] >>> print a [[], [], [], [], [], [], [], [], [], []] >>> a[5].append(1) >>> print a [[], [], [], [], [], [1], [], [], [], []]
a mistake like this can take quite a bit of time to track down.
Gzip streaming in Python (or lack thereof)
Consider this simple Python snippet:
#!/usr/bin/python
import gzip
from urllib import urlopen
zfd = urlopen("http://ftp.debian.org/debian/dists/sid/Contents-udeb.gz")
fd = gzip.GzipFile(fileobj=zfd, mode="r")
for line in fd:
foobar(line)
It does not work: it turns out
GzipFile wants to seek() and
tell() on its file object:
$ /tmp/z.py
Traceback (most recent call last):
File "/tmp/z.py", line 8, in <module>
for line in fd:
File "/usr/lib/python2.6/gzip.py", line 438, in next
line = self.readline()
File "/usr/lib/python2.6/gzip.py", line 393, in readline
c = self.read(readsize)
File "/usr/lib/python2.6/gzip.py", line 219, in read
self._read(readsize)
File "/usr/lib/python2.6/gzip.py", line 247, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'
Oh dear... this really shouldn't be. Let's look around the internet for details:
Since opener.open returns a file-like object, and you know from the headers that when you read it, you're going to get gzip-compressed data, why not simply pass that file-like object directly to GzipFile? As you “read” from the GzipFile instance, it will “read” compressed data from the remote HTTP server and decompress it on the fly. It's a good idea, but unfortunately it doesn't work. Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream. So the inelegant hack of using StringIO is the best solution: download the compressed data, create a file-like object out of it with StringIO, and then decompress the data from that.
Oh really, so it's a limitation of "the way gzip compression works", and the
only way to work around this fundamental design flaw of gzip is to buffer the
lot in memory or to a temporary file? [facepalm]
# Apparently, this is not how gzip compression works:
curl http://ftp.debian.org/debian/dists/sid/Contents-udeb.gz | zcat
This is not the first time that, after finding a frustrating spot in core Python things, looking around the internet for extra information has the effect of amplifying my frustration by destroying my expectations of common sense.
What is the gzip module doing anyway?
if self._new_member:
# If the _new_member flag is set, we have to
# jump to the next member, if there is one.
#
# First, check if we're at the end of the file;
# if so, it's time to stop; no more members to read.
pos = self.fileobj.tell() # Save current position
self.fileobj.seek(0, 2) # Seek to end of file
if pos == self.fileobj.tell():
raise EOFError, "Reached EOF"
else:
self.fileobj.seek( pos ) # Return to original position
self._init_read()
self._read_gzip_header()
self.decompress = zlib.decompressobj(-zlib.MAX_WBITS)
self._new_member = False
Right, tell() and seek() in order to check if it's reached end of file.
Oh dear...
So, there's a flaw in the Python standard library, and:
- it is not explicitly mentioned in the module documentation, which just says "fileobj [...] can be a regular file, a StringIO object, or any other object which simulates a file".
- in basic Python learning material, we find a paragraph whose purpose is to explain that this is not Python's fault, but it is a fundamental limitation on the way gzip works.
I wish core Python documentation started properly documenting Python flaws so that one can code with language limitations in mind, insted of having to rediscover them and their work arounds every time.
Update: I reported the issue to the Python BTS and it turns out this has been finally fixed in version 3.2, but the fix will not be backported to the 2.x series.
Streaming JSON objects with python
Dear Lazyweb,
Suppose I'd like to create an HTTP service that gives you a looong list of JSON objects. For example, one JSON object for every package in Debian.
One way I could do that is to just transfer a big, dozens of megabytes JSON
array with one element per package. With the standard json module in python,
this requires the client to buffer all the data in memory for decoding, as the
JSON result would be something like this:
[
{ info for package 1 },
{ info for package 2 },
...dozens of megabytes...
{ info for package 30000 },
...
]
However, it would be reasonable to engineer things so that the client can process packages one at a time, as they are produced. This can dramatically reduce client memory usage, as well as remove a large latency between the start of the request and the start of processing.
One trivial idea would be to just send JSON objects one after the other, and
sort of call load multiple
times.
It doesn't work:
import json
from cStringIO import StringIO
a = StringIO()
json.dump({1:3}, a)
json.dump({2:4}, a)
a.reset()
print a.getvalue()
print json.load(a)
gives:
$ python z1.py
{"1": 3}{"2": 4}
Traceback (most recent call last):
File "z1.py", line 8, in <module>
print json.load(a)
File "/usr/lib/python2.6/json/__init__.py", line 267, in load
parse_constant=parse_constant, **kw)
File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.6/json/decoder.py", line 322, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 8 - line 1 column 16 (char 8 - 16)
And the json module documentation
does not provide any way of telling the "stream" decoder to just stop parsing
after the first valid object.
An alternative can be to use chunked transfer encoding and transfer each JSON object in a separate HTTP chunk. It would add the overhead of a little HTTP header per JSON object, which could be significant when generating lots of small JSON objects, but it would be an implementation that strictly follows HTTP and JSON standards to the letter in a clean way.
Except:
- How do you stream HTTP chunks in WSGI? The WSGI specification's only mention of chunks is that if you produce a result as a generator of "chunks", they are just sent one after the other (effectively, concatenated) in the response stream.
- How do you read and process one chunk at a time with urllib/urllib2? I searched on Google but I only found people in tears.
Another alternative is to wrap the JSON objects in some custom container protocol, but then we are not talking standards anymore.
Is this all you can do with the standard python library? Can this only be done by adding a dependency on a C library with two obscure separately maintained bindings, or by writing my own JSON parser?
Ideas? Please mail enrico@enricozini.org and I'll add them here.
Updates
Wouter van Heyst mentions that he heard of ijson, which indeed has my exact use case in the synopsis, but after some research looks like a third binding for yajl.
Aigars Mahinovs suggests using YAML instead of JSON. Indeed YAML supports streaming very well (it's even already supported in DDE, but last time I tried to do anything serious with it, PyYaml was slooow. Of course, this being the Python ecosystem, There Are Lots Other Ways To Do It And You Shall Be Patronised For Not Chosing The Proper One, but Debian only has PyYaml. Another itch with YAML is that there is no YAML parser in the standard python library, which requires clients to add an extra dependency.
Jyrki Pulliainen has so far mailed my favourite solution: disable formatting in the encoder and separate JSON objects with newlines:
we built a streaming JSON app as a part of our platform. However, we
didn't stream a single JSON object as that was a bit cumbersome. When
having syntax like:
[
<json-obj1>,
<json-obj2>,
....
<json-objN>
]
The reader has some problems. Should it read the whole stream to find
the closing bracket ] even if the stream would me tens of megabytes?
Should it try to split on , and parse single objects? What if the
objects contain , so should there be some regexp magic?
We ended up streaming separate JSON objects separated by newlines
(\n). This way the reader can read until hitting \n or end of stream
and then split the result on newlines. The new stream looked like
this:
<json-obj1>\n
<json-obj2>\n
...
<json-objN>\n
This way we could efficiently stream hundreds of thousands of JSON
objects from server to client without reserving a load of memory. For
the WSGI part how to do this, the best would probably to do a
generator that would yield a single JSON object with the newline (we
didn't use WSGI back then). Reading can be done using any reader
capable of handling streams. We used pycurl for reading, but I can't
really recommend it as it is way too complicated for a simple task
like this :p
Hopefully this helps!
This means that if the client has no access to a JSON reader that supports a stream of concatenated toplevel objects, they can still trivially code it on top of any JSON library. This means that working clients can be written based just on Squeeze's python standard library. That's what I'm going to do; I'll probably implement it as a separate 'jsons' output format for DDE, because producing a JSON response with multiple toplevel obejcts somehow breaks the JSON standard.
Thanks to all who provided this excellent information!
My rule to see if a framework is worth of attention
I came up with a little rule:
In order to be worth of any attention, a framework must be stable enough that I can charge money to train people to use it.
This probably applies to other kinds of software stacks, libraries, development environments and, well, to most software applications.
In the context of python web frameworks, this means that:
- If it changes API all the time it is not worth of attention, because my customers won't get value for their money, as they'd continuously need retraining and rewriting their software.
- If I see lots of DeprecationWarnings it is not worth of attention, because my customers will see them and blame me for teaching them deprecated stuff.
- If fixes for bugs affecting the stable version are only distributed "in a
recent git" or "in the next development version", and they are not
backported into a new bugfix-only stable release, then it is not worth of
attention, because:
- my customers' business is to develop their own products based on the framework.
- My customers' business is not to be maintaning in-house stable updates of the framework. Although if the framework's community is nice enough they might end up giving a hand.
- If it requires virtualenv or can only be obtained through easy_install it is
not worth of attention, because:
- my customers are not interested in maintaning custom deployment environments over time.
- My customers are not interested in tracking each and every single library's upstream development to keep their production system free of bugs.
- My customers are used to getting software through a proper distribution which also takes care of security updates.
- I am paid to teach them how to use a framework, not a custom python-only package management system.
- In my experience, if distributions have trouble keeping packages up to date, upstream is doing something fundamentally wrong.
In light of this rule, I regret to notice that I see very few python web frameworks worth of any attention.
On python stable APIs
There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable.
There is another theory which states that this has already happened.
In Debian testing:
/usr/lib/python2.6/dist-packages/sqlalchemy/types.py:547: SADeprecationWarning: The Binary type has been renamed to LargeBinary.
In Debian Lenny:
ImportError: cannot import name LargeBinary
I was starting to think that SQLAlchemy wasn't too bad, since I've been using it for 6 months and I haven't seen its API change yet.
But there it is, a beautiful reminder that SQLAlchemy, too, is part of the marvelously autistic Python ecosystem.
