Filtering nodes out of OSM files
I have a pet project here at SoTM10: create a tool for searching nearby POIs while offline.
The idea is to have something in my pocket (FreeRunner or N900), which doesn't require an internet connection, and which can point me at the nearest fountains, post offices, atm machines, bars and so on.
The first step is to obtain a list of POIs.
In theory one can use Xapi but all the known Xapi servers appear to be down at the moment.
Another attempt is to obtain it by filtering all nodes with the tags we want out of a planet OSM extract. I downloaded the Spanish one and set to work.
First I tried with xmlstarlet, but it ate all the RAM and crashed my laptop, because for some reason, on my laptop the Linux kernels up to 2.6.32 (don't now about later ones) like to swap out ALL running apps to cache I/O operations, which mean that heavy I/O operations swap out the very programs performing them, so the system gets caught in some infinite I/O loop and dies. Or at least this is what I've figured out so far.
So, we need SAX. I put together this prototype in Python, which can process a nice 8MB/s of OSM data for quite some time with a constant, low RAM usage:
#!/usr/bin/python # # poifilter - extract interesting nodes from OSM XML files # # Copyright (C) 2010 Enrico Zini <firstname.lastname@example.org> # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # import xml.sax import xml.sax.handler import xml.sax.saxutils import sys class XMLSAXFilter(xml.sax.handler.ContentHandler): ''' A SAX filter that is a ContentHandler. There is xml.sax.saxutils.XMLFilterBase in the standard library but it is undocumented, and most of the examples using it you find online are wrong. You can look at its source code, and at that point you find out that it is an offensive practical joke. ''' def __init__(self, downstream): self.downstream = downstream # ContentHandler methods def setDocumentLocator(self, locator): self.downstream.setDocumentLocator(locator) def startDocument(self): self.downstream.startDocument() def endDocument(self): self.downstream.endDocument() def startPrefixMapping(self, prefix, uri): self.downstream.startPrefixMapping(prefix, uri) def endPrefixMapping(self, prefix): self.downstream.endPrefixMapping(prefix) def startElement(self, name, attrs): self.downstream.startElement(name, attrs) def endElement(self, name): self.downstream.endElement(name) def startElementNS(self, name, qname, attrs): self.downstream.startElementNS(name, qname, attrs) def endElementNS(self, name, qname): self.downstream.endElementNS(name, qname) def characters(self, content): self.downstream.characters(content) def ignorableWhitespace(self, chars): self.downstream.ignorableWhitespace(chars) def processingInstruction(self, target, data): self.downstream.processingInstruction(target, data) def skippedEntity(self, name): self.downstream.skippedEntity(name) class OSMPOIHandler(XMLSAXFilter): ''' Filter SAX events in a OSM XML file to keep only nodes with names ''' PASSTHROUGH = ["osm", "bound"] TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"]) def startElement(self, name, attrs): if name in self.PASSTHROUGH: self.downstream.startElement(name, attrs) elif name == "node": self.attrs = attrs self.tags =  self.propagate = False elif name == "tag": if self.tags is not None: self.tags.append(attrs) if attrs["k"] in self.TAG_WHITELIST: self.propagate = True else: self.tags = None self.attrs = None def endElement(self, name): if name in self.PASSTHROUGH: self.downstream.endElement(name) elif name == "node": if self.propagate: self.downstream.startElement("node", self.attrs) for attrs in self.tags: self.downstream.startElement("tag", attrs) self.downstream.endElement("tag") self.downstream.endElement("node") def ignorableWhitespace(self, chars): pass def characters(self, content): pass # Simple stdin->stdout XMl filter parser = xml.sax.make_parser() handler = OSMPOIHandler(xml.sax.saxutils.XMLGenerator(sys.stdout, "utf-8")) parser.setContentHandler(handler) parser.parse(sys.stdin)
Let's run it:
$ bzcat /store/osm/spain.osm.bz2 | pv | ./poifilter > pois.osm [...] $ ls -l --si pois.osm -rw-r--r-- 1 enrico enrico 19M Jul 10 23:56 pois.osm $ xmlstarlet val pois.osm pois.osm - valid
Problem 1 solved: now on to the next step: importing the nodes in a database.