Posts containing useful tips.

Computing time offsets between EXIF and GPS

I like the idea of matching photos to GPS traces. In Debian there is gpscorrelate but it's almost unusable to me because of bug #473362 and it has an awkward way of specifying time offsets.

Here at SoTM10 someone told me that exiftool gained -geosync and -geotag options. So it's just a matter of creating a little tool that shows a photo and asks you to type the GPS time you see in it.

Apparently there are no bindings or GIR files for gtkimageview in Debian, so I'll have to use C.

Here is a C prototype:

/*
 * gpsoffset - Compute EXIF time offset from a photo of a gps display
 *
 * Use with exiftool -geosync=... -geotag trace.gpx DIR
 *
 * Copyright (C) 2009--2010  Enrico Zini <enrico@enricozini.org>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */


#define _XOPEN_SOURCE /* glibc2 needs this */
#include <time.h>
#include <gtkimageview/gtkimageview.h>
#include <libexif/exif-data.h>
#include <stdio.h>
#include <stdlib.h>

static int load_time(const char* fname, struct tm* tm)
{
    ExifData* exif_data = exif_data_new_from_file(fname);
    ExifEntry* exif_time = exif_data_get_entry(exif_data, EXIF_TAG_DATE_TIME);
    if (exif_time == NULL)
    {
        fprintf(stderr, "Cannot find EXIF timetamp\n");
        return -1;
    }

    char buf[1024];
    exif_entry_get_value(exif_time, buf, 1024);
    //printf("val2: %s\n", exif_entry_get_value(t2, buf, 1024));

    if (strptime(buf, "%Y:%m:%d %H:%M:%S", tm) == NULL)
    {
        fprintf(stderr, "Cannot match EXIF timetamp\n");
        return -1;
    }

    return 0;
}

static time_t exif_ts;
static GtkWidget* res_lbl;

void date_entry_changed(GtkEditable *editable, gpointer user_data)
{
    const gchar* text = gtk_entry_get_text(GTK_ENTRY(editable));
    struct tm parsed;
    if (strptime(text, "%Y-%m-%d %H:%M:%S", &parsed) == NULL)
    {
        gtk_label_set_text(GTK_LABEL(res_lbl), "Please enter a date as YYYY-MM-DD HH:MM:SS");
    } else {
        time_t img_ts = mktime(&parsed);
        int c;
        int res;
        if (exif_ts < img_ts)
        {
            c = '+';
            res = img_ts - exif_ts;
        }
        else
        {
            c = '-';
            res = exif_ts - img_ts;
        }
        char buf[1024];
        if (res > 3600)
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d:%02d:%02d",
                    c, res, c, res / 3600, (res / 60) % 60, res % 60);
        else if (res > 60)
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%02d:%02d",
                    c, res, c, (res / 60) % 60, res % 60);
        else
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d",
                    c, res, c, res);
        gtk_label_set_text(GTK_LABEL(res_lbl), buf);
    }
}

int main (int argc, char *argv[])
{
    // Work in UTC to avoid mktime applying DST or timezones
    setenv("TZ", "UTC");

    const char* filename = "/home/enrico/web-eddie/galleries/2010/04-05-Uppermill/P1080932.jpg";

    gtk_init (&argc, &argv);

    struct tm exif_time;
    if (load_time(filename, &exif_time) != 0)
        return 1;

    printf("EXIF time: %s\n", asctime(&exif_time));
    exif_ts = mktime(&exif_time);

    GtkWidget* window = gtk_window_new(GTK_WINDOW_TOPLEVEL);
    GtkWidget* vb = gtk_vbox_new(FALSE, 0);
    GtkWidget* hb = gtk_hbox_new(FALSE, 0);
    GtkWidget* lbl = gtk_label_new("Timestamp:");
    GtkWidget* exif_lbl;
    {
        char buf[1024];
        strftime(buf, 1024, "EXIF time: %Y-%m-%d %H:%M:%S", &exif_time);
        exif_lbl = gtk_label_new(buf);
    }
    GtkWidget* date_ent = gtk_entry_new();
    res_lbl = gtk_label_new("Result:");
    GtkWidget* view = gtk_image_view_new();
    GdkPixbuf* pixbuf = gdk_pixbuf_new_from_file(filename, NULL);

    gtk_box_pack_start(GTK_BOX(hb), lbl, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(hb), date_ent, TRUE, TRUE, 0);

    gtk_signal_connect(GTK_OBJECT(date_ent), "changed", (GCallback)date_entry_changed, NULL);
    {
        char buf[1024];
        strftime(buf, 1024, "%Y-%m-%d %H:%M:%S", &exif_time);
        gtk_entry_set_text(GTK_ENTRY(date_ent), buf);
    }

    gtk_widget_set_size_request(view, 500, 400);
    gtk_image_view_set_pixbuf(GTK_IMAGE_VIEW(view), pixbuf, TRUE);
    gtk_container_add(GTK_CONTAINER(window), vb);
    gtk_box_pack_start(GTK_BOX(vb), view, TRUE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), hb, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), exif_lbl, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), res_lbl, FALSE, TRUE, 0);
    gtk_widget_show_all(window);

    gtk_main ();

    return 0;
}

And here is its simple makefile:

CFLAGS=$(shell pkg-config --cflags gtkimageview libexif)
LDFLAGS=$(shell pkg-config --libs gtkimageview libexif)

gpsoffset: gpsoffset.c

It's a simple prototype but it's a working prototype and seems to do the job for me.

I currently cannot find out why after I click on the text box, there seems to be no way to give the focus back to the image viewer so I can control it with keys.

There is another nice algorithm to compute time offsets to be implemented: you choose a photo taken from a known place and drag it on that place on a map: you can then look for the nearest point on your GPX trace and compute the time offset from that.

I have seen that there are programs for geotagging photos that implement all such algorithms, and have a nice UI, but I haven't seen any in Debian.

Are there any such softwares that can be packaged?

If not, the interpolation and annotation tasks can now already be performed by exiftool, so it's just a matter of building a good UI, and I would love to see someone picking up the task.

Posted Sun Jul 11 12:34:04 2010 Tags: tips

Searching OSM nodes in Spatialite

Third step of my SoTM10 pet project: finding the POIs.

I put together a query to find all nodes with a given tag inside a bounding box, and also a query to find all the tag values for a given tag name inside a bounding box.

The result is this simple POI search engine:

#
# poisearch - simple geographical POI search engine
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

from pysqlite2 import dbapi2 as sqlite

class PoiDB(object):
    def __init__(self):
        self.db = sqlite.connect("pois.db")
        self.db.enable_load_extension(True)
        self.db.execute("SELECT load_extension('libspatialite.so')")
        self.oldsearch = []
        self.bbox = None

    def set_bbox(self, xmin, xmax, ymin, ymax):
        '''Set bbox for searches'''
        self.bbox = (xmin, xmax, ymin, ymax)

    def tagid(self, name, val):
        '''Get the database ID for a tag'''
        c = self.db.cursor()
        c.execute("SELECT id FROM tag WHERE name=? AND value=?", (name, val))
        res = None
        for row in c:
            res = row[0]
        return res

    def tagnames(self):
        '''Get all tag names'''
        c = self.db.cursor()
        c.execute("SELECT DISTINCT name FROM tag ORDER BY name")
        for row in c:
            yield row[0]

    def tagvalues(self, name, use_bbox=False):
        '''
        Get all tag values for a given tag name,
        optionally in the current bounding box
        '''
        c = self.db.cursor()
        if self.bbox is None or not use_bbox:
            c.execute("SELECT DISTINCT value FROM tag WHERE name=? ORDER BY value", (name,))
        else:
            c.execute("SELECT DISTINCT tag.value FROM poi, poitag, tag"
                      " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                      "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                      "   AND poitag.tag = tag.id AND poitag.poi = poi.id"
                      "   AND tag.name=?",
                      self.bbox + (name,))
        for row in c:
            yield row[0]

    def search(self, name, val):
        '''Get all name:val tags in the current bounding box'''
        # First resolve the tagid
        tagid = self.tagid(name, val)
        if tagid is None: return

        c = self.db.cursor()
        c.execute("SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag"
                  " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                  "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                  "   AND poitag.tag = ? AND poitag.poi = poi.id",
                  self.bbox + (tagid,))
        self.oldsearch = []
        for row in c:
            self.oldsearch.append(row)
            yield row[0], simplejson.loads(row[1]), row[2], row[3]

    def count(self, name, val):
        '''Count all name:val tags in the current bounding box'''
        # First resolve the tagid
        tagid = self.tagid(name, val)
        if tagid is None: return

        c = self.db.cursor()
        c.execute("SELECT COUNT(*) FROM poi, poitag"
                  " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                  "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                  "   AND poitag.tag = ? AND poitag.poi = poi.id",
                  self.bbox + (tagid,))
        for row in c:
            return row[0]

    def replay(self):
        for row in self.oldsearch:
            yield row[0], simplejson.loads(row[1]), row[2], row[3]

Problem 3 solved: now on to the next step, building a user interface for it.

Posted Sat Jul 10 15:50:31 2010 Tags: tips

Importing OSM nodes into Spatialite

Second step of my SoTM10 pet project: creating a searchable database with the points. What a fantastic opportunity to learn Spatialite.

Learning Spatialite is easy. For example, you can use the two tutorials with catchy titles that assume your best wish in life is to create databases out of shapefiles using a pre-built, i386-only executable GUI binary downloaded over an insecure HTTP connection.

To be fair, the second of those tutorials is called "An almost Idiot's Guide", thus expliciting the requirement of being an almost idiot in order to happily acquire and run software in that way.

Alternatively, you can use A quick tutorial to SpatiaLite which is so quick it has examples that lead you to write SQL queries that trigger all sorts of vague exceptions at insert time. But at least it brought me a long way forward, at which point I could just cross reference things with PostGIS documentation to find out the right way of doing things.

So, here's the importer script, which will probably become my reference example for how to get started with Spatialite, and how to use Spatialite from Python:

#!/usr/bin/python

#
# poiimport - import nodes from OSM into a spatialite DB
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

import xml.sax
import xml.sax.handler
from pysqlite2 import dbapi2 as sqlite
import simplejson
import sys
import os

class OSMPOIReader(xml.sax.handler.ContentHandler):
    '''
    Filter SAX events in a OSM XML file to keep only nodes with names
    '''
    def __init__(self, consumer):
        self.consumer = consumer

    def startElement(self, name, attrs):
        if name == "node":
            self.attrs = attrs
            self.tags = dict()
        elif name == "tag":
            self.tags[attrs["k"]] = attrs["v"]

    def endElement(self, name):
        if name == "node":
            lat = float(self.attrs["lat"])
            lon = float(self.attrs["lon"])
            id = int(self.attrs["id"])
            #dt = parse(self.attrs["timestamp"])
            uid = self.attrs.get("uid", None)
            uid = int(uid) if uid is not None else None
            user = self.attrs.get("user", None)

            self.consumer(lat, lon, id, self.tags, user=user, uid=uid)

class Importer(object):
    '''
    Create the spatialite database and populate it
    '''
    TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])

    def __init__(self, filename):
        self.db = sqlite.connect(filename)
        self.db.enable_load_extension(True)
        self.db.execute("SELECT load_extension('libspatialite.so')")
        self.db.execute("SELECT InitSpatialMetaData()")
        self.db.execute("INSERT INTO spatial_ref_sys (srid, auth_name, auth_srid,"
                        " ref_sys_name, proj4text) VALUES (4326, 'epsg', 4326,"
                        " 'WGS 84', '+proj=longlat +ellps=WGS84 +datum=WGS84"
                        " +no_defs')")
        self.db.execute("CREATE TABLE poi (id int not null unique primary key,"
                        " name char, data text)")
        self.db.execute("SELECT AddGeometryColumn('poi', 'geom', 4326, 'POINT', 2)")
        self.db.execute("SELECT CreateSpatialIndex('poi', 'geom')")
        self.db.execute("CREATE TABLE tag (id integer primary key autoincrement,"
                        " name char, value char)")
        self.db.execute("CREATE UNIQUE INDEX tagidx ON tag (name, value)")
        self.db.execute("CREATE TABLE poitag (poi int not null, tag int not null)")
        self.db.execute("CREATE UNIQUE INDEX poitagidx ON poitag (poi, tag)")
        self.tagid_cache = dict()

    def tagid(self, k, v):
        key = (k, v)
        res = self.tagid_cache.get(key, None)
        if res is None:
            c = self.db.cursor()
            c.execute("SELECT id FROM tag WHERE name=? AND value=?", key)
            for row in c:
                self.tagid_cache[key] = row[0]
                return row[0]
            self.db.execute("INSERT INTO tag (id, name, value) VALUES (NULL, ?, ?)", key)
            c.execute("SELECT last_insert_rowid()")
            for row in c:
                res = row[0]
            self.tagid_cache[key] = res
        return res

    def __call__(self, lat, lon, id, tags, user=None, uid=None):
        # Acquire tag IDs
        tagids = []
        for k, v in tags.iteritems():
            if k not in self.TAG_WHITELIST: continue
            for val in v.split(";"):
                tagids.append(self.tagid(k, val))

        # Skip elements that don't have the tags we want
        if not tagids: return

        geom = "POINT(%f %f)" % (lon, lat)
        self.db.execute("INSERT INTO poi (id, geom, name, data)"
                        "     VALUES (?, GeomFromText(?, 4326), ?, ?)",
                (id, geom, tags["name"], simplejson.dumps(tags)))

        for tid in tagids:
            self.db.execute("INSERT INTO poitag (poi, tag) VALUES (?, ?)", (id, tid))


    def done(self):
        self.db.commit()

# Get the output file name
filename = sys.argv[1]

# Ensure we start from scratch
if os.path.exists(filename):
    print >>sys.stderr, filename, "already exists"
    sys.exit(1)

# Import
parser = xml.sax.make_parser()
importer = Importer(filename)
handler = OSMPOIReader(importer)
parser.setContentHandler(handler)
parser.parse(sys.stdin)
importer.done()

Let's run it:

$ ./poiimport pois.db < pois.osm 
SpatiaLite version ..: 2.4.0    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct Dbf access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 Rel. 4.7.1, 23 September 2009
GEOS version 3.2.0-CAPI-1.6.0
$ ls -l --si pois*
-rw-r--r-- 1 enrico enrico 17M Jul  9 23:44 pois.db
-rw-r--r-- 1 enrico enrico 37M Jul  9 16:20 pois.osm
$ spatialite pois.db
SpatiaLite version ..: 2.4.0    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct DBF access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 version ......: Rel. 4.7.1, 23 September 2009
GEOS version ........: 3.2.0-CAPI-1.6.0
SQLite version ......: 3.6.23.1
Enter ".help" for instructions
spatialite> select id from tag where name="amenity" and value="fountain";
24
spatialite> SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.56 AND xmax <= 2.90 AND ymin >= 41.84 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
Font Picant de la Cellera|{"amenity": "fountain", "name": "Font Picant de la Cellera"}|2.616045|41.952449
Font de Can Pla|{"amenity": "fountain", "name": "Font de Can Pla"}|2.622354|41.974724
Font de Can Ribes|{"amenity": "fountain", "name": "Font de Can Ribes"}|2.62311|41.979193

It's impressive: I've got all sort of useful information for the whole of Spain in just 17Mb!

Let's put it to practice: I'm thirsty, is there any water fountain nearby?

spatialite> SELECT count(1) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.80 AND xmax <= 2.85 AND ymin >= 41.97 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
0

Ouch! No water fountains mapped in Girona... yet.

Problem 2 solved: now on to the next step, trying to show the results in some usable way.

Posted Sat Jul 10 09:10:35 2010 Tags: tips

Filtering nodes out of OSM files

I have a pet project here at SoTM10: create a tool for searching nearby POIs while offline.

The idea is to have something in my pocket (FreeRunner or N900), which doesn't require an internet connection, and which can point me at the nearest fountains, post offices, atm machines, bars and so on.

The first step is to obtain a list of POIs.

In theory one can use Xapi but all the known Xapi servers appear to be down at the moment.

Another attempt is to obtain it by filtering all nodes with the tags we want out of a planet OSM extract. I downloaded the Spanish one and set to work.

First I tried with xmlstarlet, but it ate all the RAM and crashed my laptop, because for some reason, on my laptop the Linux kernels up to 2.6.32 (don't now about later ones) like to swap out ALL running apps to cache I/O operations, which mean that heavy I/O operations swap out the very programs performing them, so the system gets caught in some infinite I/O loop and dies. Or at least this is what I've figured out so far.

So, we need SAX. I put together this prototype in Python, which can process a nice 8MB/s of OSM data for quite some time with a constant, low RAM usage:

#!/usr/bin/python

#
# poifilter - extract interesting nodes from OSM XML files
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#


import xml.sax
import xml.sax.handler
import xml.sax.saxutils
import sys

class XMLSAXFilter(xml.sax.handler.ContentHandler):
    '''
    A SAX filter that is a ContentHandler.

    There is xml.sax.saxutils.XMLFilterBase in the standard library but it is
    undocumented, and most of the examples using it you find online are wrong.
    You can look at its source code, and at that point you find out that it is
    an offensive practical joke.
    '''
    def __init__(self, downstream):
        self.downstream = downstream

    # ContentHandler methods

    def setDocumentLocator(self, locator):
        self.downstream.setDocumentLocator(locator)

    def startDocument(self):
        self.downstream.startDocument()

    def endDocument(self):
        self.downstream.endDocument()

    def startPrefixMapping(self, prefix, uri):
        self.downstream.startPrefixMapping(prefix, uri)

    def endPrefixMapping(self, prefix):
        self.downstream.endPrefixMapping(prefix)

    def startElement(self, name, attrs):
        self.downstream.startElement(name, attrs)

    def endElement(self, name):
        self.downstream.endElement(name)

    def startElementNS(self, name, qname, attrs):
        self.downstream.startElementNS(name, qname, attrs)

    def endElementNS(self, name, qname):
        self.downstream.endElementNS(name, qname)

    def characters(self, content):
        self.downstream.characters(content)

    def ignorableWhitespace(self, chars):
        self.downstream.ignorableWhitespace(chars)

    def processingInstruction(self, target, data):
        self.downstream.processingInstruction(target, data)

    def skippedEntity(self, name):
        self.downstream.skippedEntity(name)

class OSMPOIHandler(XMLSAXFilter):
    '''
    Filter SAX events in a OSM XML file to keep only nodes with names
    '''
    PASSTHROUGH = ["osm", "bound"]
    TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])

    def startElement(self, name, attrs):
        if name in self.PASSTHROUGH:
            self.downstream.startElement(name, attrs)
        elif name == "node":
            self.attrs = attrs
            self.tags = []
            self.propagate = False
        elif name == "tag":
            if self.tags is not None:
                self.tags.append(attrs)
                if attrs["k"] in self.TAG_WHITELIST:
                    self.propagate = True
        else:
            self.tags = None
            self.attrs = None

    def endElement(self, name):
        if name in self.PASSTHROUGH:
            self.downstream.endElement(name)
        elif name == "node":
            if self.propagate:
                self.downstream.startElement("node", self.attrs)
                for attrs in self.tags:
                    self.downstream.startElement("tag", attrs)
                    self.downstream.endElement("tag")
                self.downstream.endElement("node")

    def ignorableWhitespace(self, chars):
        pass

    def characters(self, content):
        pass

# Simple stdin->stdout XMl filter
parser = xml.sax.make_parser()
handler = OSMPOIHandler(xml.sax.saxutils.XMLGenerator(sys.stdout, "utf-8"))
parser.setContentHandler(handler)
parser.parse(sys.stdin)

Let's run it:

$ bzcat /store/osm/spain.osm.bz2 | pv | ./poifilter > pois.osm
[...]
$ ls -l --si pois.osm
-rw-r--r-- 1 enrico enrico 19M Jul 10 23:56 pois.osm
$ xmlstarlet val pois.osm 
pois.osm - valid

Problem 1 solved: now on to the next step: importing the nodes in a database.

Posted Fri Jul 9 16:28:15 2010 Tags: tips

Tweaking locale settings

I sometimes meet some Italian programmer who prefers his system to be in English, so they get untranslated manpages and error messages.

I sometimes notice that their solution often leaves them something to complain about.

Some set LANG=C and complain they can't see accented characters.

Some set LANG=en_US.UTF-8 and complain that OpenOffice Calc wants dates in the format MM/DD/YYYY which is an Abomination unto Nuggan, as well as unto Me.

But the locales system can do much better than that. In fact, most times people would be extremely happy with LC_MESSAGES in English, and everything else in Italian:

$ locale
LANG=it_IT.UTF-8
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="it_IT.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="it_IT.UTF-8"
LC_NAME="it_IT.UTF-8"
LC_ADDRESS="it_IT.UTF-8"
LC_TELEPHONE="it_IT.UTF-8"
LC_MEASUREMENT="it_IT.UTF-8"
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=

A way to do this (is there a better one?) is to tell the display manager (GDM, KDM...) to use the Italian locale, and then override the right LC_* bits in ~/.xsessionrc:

$ cat ~/.xsessionrc
export LC_MESSAGES=en_US.UTF-8

That does the trick: English messages, with Proper currency, Proper dates, accented letters sort Properly, Proper A4 printer paper, Proper SI units. Even Nuggan would be happy.

Posted Sat Apr 17 15:39:19 2010 Tags: tips

Temporarily disabling file caching

Does it happen to you that you cp a big, big file (say, similar in order of magnitude to the amount of RAM) and the system becomes rather unusable?

It looks like Linux is saying "let's cache this", and as you copy it will try to free more and more ram in order to cache the big file you're copying. In the end, all the RAM is full with file data that you are not going to need.

This varies according to how /proc/sys/vm/swappiness is set.

I learnt about posix_fadvise and I tried to play with it. The result is this preloadable library that hooks into open(2) and fadvises everything as POSIX_FADV_DONTNEED.

It is all rather awkward. fadvise in that way will discard existing cache pages if the file is already cached, which is too much. Ideally one would like to say "don't cache this because of me" without stepping on the feet of other system activities.

Also, I found I need to also hook into write(2) and run fadvise after every single write, because you can't fadvise a file to be written in its entirety, unless you pass fadvise the file size in advance. But the size of the output file cannot be known by the preloaded library, so meh.

So, now I can run: nocache cp bigfile someplace/ without trashing the existing caches. I can also run nocache tar zxf foo.tar.gz and so on. I wish, of course, that there were no need to do so in the first place.

Here is the nocache library source code, for reference:

/*
 * nocache - LD_PRELOAD library to fadvise written files to not be cached
 *
 * Copyright (C) 2009--2010 Enrico Zini <enrico@enricozini.org>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */

#define _XOPEN_SOURCE 600
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <dlfcn.h>
#include <stdarg.h>
#include <errno.h>
#include <stdio.h>

typedef int (*open_t)(const char*, int, ...);
typedef int (*write_t)(int fd, const void *buf, size_t count);

int open(const char *pathname, int flags, ...)
{
    static open_t func = 0;
    int res;
    if (!func)
        func = (open_t)dlsym(RTLD_NEXT, "open");

    // Note: I wanted to add O_DIRECT, but it imposes restriction on buffer
    // alignment
    if (flags & O_CREAT)
    {
        va_list ap;
        va_start(ap, flags);
        mode_t mode = va_arg(ap, mode_t);
        res = func(pathname, flags, mode);
        va_end(ap);
    } else
        res = func(pathname, flags);

    if (res >= 0)
    {
        int saved_errno = errno;
        int z = posix_fadvise(res, 0, 0, POSIX_FADV_DONTNEED);
        if (z != 0) fprintf(stderr, "Cannot fadvise on %s: %m\n", pathname);
        errno = saved_errno;
    }

    return res;
}

int write(int fd, const void *buf, size_t count)
{
    static write_t func = 0;
    int res;
    if (!func)
        func = (write_t)dlsym(RTLD_NEXT, "write");

    res = func(fd, buf, count);

    if (res > 0)
    {
        int saved_errno = errno;
        int z = posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
        if (z != 0) fprintf(stderr, "Cannot fadvise during write: %m\n");
        errno = saved_errno;
    }

    return res;
}

Updates

Steve Schnepp writes:

Robert Love did a O_STREAMING patch for 2.4. It wasn't merged in 2.6 since POSIX_FADV_NOREUSE should be used instead.

But unfortunatly it's currently mapped either as WILLNEED or as a noop.

It seems that there is a google code project that has spawned to control this.

Posted Mon Mar 8 13:26:28 2010 Tags: tips

Cropping images with GDAL

I am working to get a better integration between Meteosatlib and GDAL.

A nice aspect of GDAL is that it allows to create read/write drivers around two functions: Open and CreateCopy.

Open opens a datset read only, and to implement that all you have to do is to implement read access to your data using the GDALDataset interface.

To implement CreateCopy, all you have to do is to create a new file reading information from a GDALDataset; then call Open on it. This means that there is no need to support incremental updates, and that all the data required to create a new file is readily available. This simplifies matters a lot.

GDAL provides some interesting image manipulation functions, that can work over just these two calls. The way it does it is by exploiting the concept of a virtual dataset, which wraps an existing dataset but changes some parameters on the fly.

This means that you can wrap a read only dataset with a virtual dataset that transforms it somehow, and then pass the virtual dataset to CreateCopy to save the transformed image in a format of your choice.

On top of that, for many transformations you do not need to create your own virtual datasets, but you can use the functions provided by the VRT GDAL driver.

One can learn a lot on how to use VRTDataset by reading the source code for gdal_translate. By doing that I could come up with this code for cropping an image, which is an interesting VRTDataset example.

GDALDataset* crop(GDALDataset* poDS, int xoff, int yoff, int xsize, int ysize)
{
        VRTDataset *poVDS = (VRTDataset*)VRTCreate(xsize, ysize);

        // Copy dataset info
        const char* pszProjection = poDS->GetProjectionRef();
        if (pszProjection != NULL && strlen(pszProjection) > 0)
                poVDS->SetProjection(pszProjection);

        double adfGeoTransform[6];
        if (poDS->GetGeoTransform(adfGeoTransform) == CE_None)
        {
                // Adapt the geotransform matrix to the subarea
                adfGeoTransform[0] += xoff * adfGeoTransform[1]
                        + yoff * adfGeoTransform[2];
                adfGeoTransform[3] += xoff * adfGeoTransform[4]
                        + yoff * adfGeoTransform[5];

                poVDS->SetGeoTransform(adfGeoTransform);
        }

        poVDS->SetMetadata(poDS->GetMetadata());

        // Here I also copy metadata from my own domain
        char **papszMD;
        papszMD = poDS->GetMetadata(MD_DOMAIN_MSAT);
        if (papszMD != NULL)
                poVDS->SetMetadata(papszMD, MD_DOMAIN_MSAT);

        for (int i = 0; i < poDS->GetRasterCount(); ++i)
        {
                GDALRasterBand* poSrcBand = poDS->GetRasterBand(i + 1);
                GDALDataType eBandType = poSrcBand->GetRasterDataType();
                poVDS->AddBand(eBandType, NULL);
                VRTSourcedRasterBand* poVRTBand = (VRTSourcedRasterBand*)poVDS->GetRasterBand(i + 1);
                poVRTBand->AddSimpleSource(poSrcBand,
                                xoff, yoff,
                                xsize, ysize,
                                0, 0, xsize, ysize);
                poVRTBand->CopyCommonInfoFrom(poSrcBand);

                // Again, I copy my own metadata
                papszMD = poSrcBand->GetMetadata(MD_DOMAIN_MSAT);
                if (papszMD != NULL)
                        poVRTBand->SetMetadata(papszMD, MD_DOMAIN_MSAT);
        }

        return poVDS;
}

This function wraps a dataset with a virtual dataset that crops it. Just pass the resulting dataset to GDALCreateCopy to save it in the format that you need.

Posted Wed Feb 3 16:19:54 2010 Tags: tips

Custom function decorators with TurboGears 2

I am exposing some library functions using a TurboGears2 controller (see web-api-with-turbogears2). It turns out that some functions return a dict, some a list, some a string, and TurboGears 2 only allows JSON serialisation for dicts.

A simple work-around for this is to wrap the function result into a dict, something like this:

@expose("json")
@validate(validator_dispatcher, error_handler=api_validation_error)
def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
    # Call API
    res = self.engine.list_colours(filter, productID, maxResults)

    # Return result
    return dict(r=res)

It would be nice, however, to have an @webapi() decorator that automatically wraps the function result with the dict:

def webapi(func):
    def dict_wrap(*args, **kw):
        return dict(r=func(*args, **kw))
    return dict_wrap

# ...in the controller...

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    @webapi
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

This works, as long as @webapi appears last in the list of decorators. This is because if it appears last it will be the first to wrap the function, and so it will not interfere with the tg.decorators machinery.

Would it be possible to create a decorator that can be put anywhere among the decorator list? Yes, it is possible but tricky, and it gives me the feeling that it may break in any future version of TurboGears:

class webapi(object):
    def __call__(self, func):
        def dict_wrap(*args, **kw):
            return dict(r=func(*args, **kw))
        # Migrate the decoration attribute to our new function
        if hasattr(func, 'decoration'):
            dict_wrap.decoration = func.decoration
            dict_wrap.decoration.controller = dict_wrap
            delattr(func, 'decoration')
        return dict_wrap

# ...in the controller...

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    @webapi
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

As a convenience, TurboGears 2 offers, in the decorators module, a way to build decorator "hooks":

class before_validate(_hook_decorator):
    '''A list of callables to be run before validation is performed'''
    hook_name = 'before_validate'

class before_call(_hook_decorator):
    '''A list of callables to be run before the controller method is called'''
    hook_name = 'before_call'

class before_render(_hook_decorator):
    '''A list of callables to be run before the template is rendered'''
    hook_name = 'before_render'

class after_render(_hook_decorator):
    '''A list of callables to be run after the template is rendered.

    Will be run before it is returned returned up the WSGI stack'''

    hook_name = 'after_render'

The way these are invoked can be found in the _perform_call function in tg/controllers.py.

To show an example use of those hooks, let's add a some polygen wisdom to every data structure we return:

class wisdom(decorators.before_render):
    def __init__(self, grammar):
        super(wisdom, self).__init__(self.add_wisdom)
        self.grammar = grammar
    def add_wisdom(self, remainder, params, output):
        from subprocess import Popen, PIPE
        output["wisdom"] = Popen(["polyrun", self.grammar], stdout=PIPE).communicate()[0]

# ...in the controller...

    @wisdom("genius")
    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

These hooks cannot however be used for what I need, that is, to wrap the result inside a dict. The reason is because they are called in this way:

        controller.decoration.run_hooks(
                'before_render', remainder, params, output)

and not in this way:

        output = controller.decoration.run_hooks(
                'before_render', remainder, params, output)

So it is possible to modify the output (if it is a mutable structure) but not to exchange it with something else.

Can we do even better? Sure we can. We can assimilate @expose and @validate inside @webapi to avoid repeating those same many decorator lines over and over again:

class webapi(object):
    def __init__(self, error_handler = None):
        self.error_handler = error_handler

    def __call__(self, func):
        def dict_wrap(*args, **kw):
            return dict(r=func(*args, **kw))
        res = expose("json")(dict_wrap)
        res = validate(validator_dispatcher, error_handler=self.error_handler)(res)
        return res

# ...in the controller...

    @expose("json")
    def api_validation_error(self, **kw):
        pylons.response.status = "400 Error"
        return dict(e="validation error on input fields", form_errors=pylons.c.form_errors)

    @webapi(error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

This got rid of @expose and @validate, and provides almost all the default values that I need. Unfortunately I could not find out how to access api_validation_error from the decorator so that I can pass it to the validator, therefore I remain with the inconvenience of having to explicitly pass it every time.

Posted Wed Nov 4 17:52:38 2009 Tags: tips

Building a web-based API with Turbogears2

I am using TurboGears2 to export a python API over the web. Every API method is wrapper by a controller method that validates the parameters and returns the results encoded in JSON.

The basic idea is this:

@expose("json")
def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
    # Call API
    res = self.engine.list_colours(filter, productID, maxResults)

    # Return result
    return res

To validate the parameters we can use forms, it's their job after all:

class ListColoursForm(TableForm):
    fields = [
            # One field per parameter
            twf.TextField("filter", help_text="Please enter the string to use as a filter"),
            twf.TextField("productID", help_text="Please enter the product ID"),
            twf.TextField("maxResults", validator=twfv.Int(min=0), default=200, size=5, help_text="Please enter the maximum number of results"),
    ]
list_colours_form=ListColoursForm()

#...

    @expose("json")
    @validate(list_colours_form, error_handler=list_colours_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Parameter validation is done by the form

        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

All straightforward so far. However, this means that we need two exposed methods for every API call: one for the API call and one error handler. For every API call, we have to type the name several times, which is error prone and risks to get things mixed up.

We can however have a single error handler for all methonds:

def get_method():
    '''
    The method name is the first url component after the controller name that
    does not start with 'test'
    '''
    found_controller = False
    for name in pylons.c.url.split("/"):
        if not found_controller and name == "controllername":
            found_controller = True
            continue
        if name.startswith("test"):
            continue
        if found_controller:
            return name
    return None

class ValidatorDispatcher:
    '''
    Validate using the right form according to the value of the "method" field
    '''
    def validate(self, args, state):
        method = args.get("method", None)
    # Extract the method from the URL if it is missing
        if method is None:
            method = get_method()
            args["method"] = method
        return forms[method].validate(args, state)

validator_dispatcher = ValidatorDispatcher()

This validator will try to find the method name, either as a form field or by parsing the URL. It will then use the method name to find the form to use for validation, and pass control to the validate method of that form.

We then need to add an extra "method" field to our forms, and arrange the forms inside a dictionary:

class ListColoursForm(TableForm):
    fields = [
            # One hidden field to have a place for the method name
            twf.HiddenField("method")
            # One field per parameter
            twf.TextField("filter", help_text="Please enter the string to use as a filter"),
    #...

forms["list_colours"] = ListColoursForm()

And now our methods become much nicer to write:

    @expose("json")
    def api_validation_error(self, **kw):
        pylons.response.status = "400 Error"
        return dict(form_errors=pylons.c.form_errors)

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Parameter validation is done by the form

        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

api_validation_error is interesting: it returns a proper HTTP error status, and a JSON body with the details of the error, taken straight from the form validators. It took me a while to find out that the form errors are in pylons.c.form_errors (and for reference, the form values are in pylons.c.form_values). pylons.response is a WebOb Response that we can play with.

So now our client side is able to call the API methods, and get a proper error if it calls them wrong.

But now that we have the forms ready, it doesn't take much to display them in web pages as well:

def _describe(self, method):
    "Return a dict describing an API method"
    ldesc = getattr(self.engine, method).__doc__.strip()
    sdesc = ldesc.split("\n")[0]
    return dict(name=method, sdesc = sdesc, ldesc = ldesc)

@expose("myappserver.templates.myappapi")
def index(self):
    '''
    Show an index of exported API methods
    '''
    methods = dict()
    for m in forms.keys():
        methods[m] = self._describe(m)
    return dict(methods=methods)

@expose('myappserver.templates.testform')
def testform(self, method, **kw):
    '''
    Show a form with the parameters of an API method
    '''
    kw["method"] = method
    return dict(method=method, action="/myapp/test/"+method, value=kw, info=self._describe(method), form=forms[method])

@expose(content_type="text/plain")
@validate(validator_dispatcher, error_handler=testform)
def test(self, method, **kw):
    '''
    Run an API method and show its prettyprinted result
    '''
    res = getattr(self, str(method))(**kw)
    return pprint.pformat(res)

In a few lines, we have all we need: an index of the API methods (including their documentation taken from the docstrings!), and for each method a form to invoke it and a page to see the results.

Make the forms children of AjaxForm, and you can even see the results together with the form.

Posted Thu Oct 15 15:45:39 2009 Tags: tips

Lessons learnt from Oracle

Lesson number 1: "how to handle temporary files".

$ rm tp*
$ proc code=cpp lines=yes dba_vm.pc 

Pro*C/C++: Release 10.2.0.1.0 - Production on Wed Sep 30 11:10:00 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

System default option values taken from: /usr/local/oracle/10.2.01/db_1/precomp/admin/pcscfg.cfg

$ echo $?
0
$ ls tp*
tpoHRjc8  tpqkU4Cp  tpY5Eo4G
$ 

Today we learn this: for every successful compilation of a single source file, it is a Good Thing to leave three temporary files around as a tribute to the Holy Trinity who bestowed upon us the grace of seeing our prayers fulfilled and our work rewarded.

For those illiterate of us who do not want to learn from the Market Leaders, and do not want to put a rm -f tp* in their makefiles just to avoid in the future to see hours of work on tpl_support.cc disappear at the first invocation of make, here's the Yokel's Wrapper:

#!/bin/sh -ue

if [ $# -ne 2 ]
then
        echo "Usage: $0 infile outfile" >&2
        exit 1
fi

DIR=`mktemp -d`

cleanup() {
        rm -rf "$DIR"
}

trap cleanup EXIT

cp "$1" "$DIR"
(
        cd $DIR
        proc code=cpp lines=yes "$1"
        mv `basename "$1" .pc`.c "$2"
)
mv "$DIR/$2" .

exit 0
Posted Wed Sep 30 11:31:02 2009 Tags: tips