Index of categories

Pages related to my visit in Addis Ababa for a Linux training course.

Fourth day in Addis

Unix file permissions:

    drwxr-xr-x   2 root root    38 2006-07-14 
    |
    +- Is a directory

    drwxr-xr-x   2 root root    38 2006-07-14 
     ---
      |
      +- User permissions (u)

    drwxr-xr-x   2 root root    38 2006-07-14 
        ---
         |
         +- Group permissions (g)

    drwxr-xr-x   2 root root    38 2006-07-14 
           ---
            |
            +- Permissions for others (o)

    drwxr-xr-x   2 root root    38 2006-07-14 
                   ----
                    |
                    +- Owner user

    drwxr-xr-x   2 root root    38 2006-07-14 
                        ----
                         |
            Owner group -+

Other bits:

  • 4000 Set user ID:

    • For executable files: run as the user who owns the file, instead of the user who runs the file
    • For directories: I think it's not used
  • 2000 Set group ID:

    • For executable files: run as the group who owns the file, instead of the group of the user who runs the file
    • For directories: when a file is created inside the directory, it belongs to the group of the directory instead of the default group of the user who created the file
  • 1000 Sticky bit:

    • For files: I think it's not used anymore
    • For directories: only the owner of a file can delete or rename the file

The executable bit for directories means "can access the files in the directory".

If a directory is readable but not executable, then I can see the list of files (with ls) but I cannot access the files.

To access a file, all the directories of its path up to / need to be executable.

Commands to manipulate permissions:

  • chown - change file owner and group
  • chgrp - change group ownership
  • chmod - change file access permissions

  • sudo adduser enrico www-data adds the user enrico to the group www-data.

Example setup for a website for students:

    # Create the group 'students'
    mkdir /var/www/students
    chgrp students /var/www/students
    chmod 2775 /var/www/students

    # If you don't want other users to read the files of the students:

    chmod 2770 /var/www/students
    adduser www-data students
     (this way the web server can read the
      pages)

    # when you add a user to a group, it does not affect running processes:

     - users need to log out and in again
     - servers need to be restarted

Apache:

  • To install apache2 without a graphical interface:

     apt-cache search apache2 | less
     sudo apt-get install apache2
    
  • By default, /var/www is where is the static website.

  • By default, ~/public_html is the personal webspace for every user, accessible as: http://localhost/~user

  • By default, /usr/lib/cgi-bin contains scripts that are executed when someone browses http://website/cgi-bin/script

  • By default, apache reads the server name from the DNS. If we don't have a name in the DNS and we want to use the IP, we need to set:

     ServerName 10.4.15.158
    

    in /etc/apache/apache2.conf (set it to your IP address)

  • To access the Apache manual: http://localhost/doc/apache2-doc/manual/

  • http://localhost/doc/apache2-doc/manual/mod/mod_access.html The access control module

  • http://localhost/doc/apache2-doc/manual/mod/mod_auth.html The user authentication module

  • To edit a user password file, use:

     htpasswd - Manage user files for basic authentication
    
  • Example .htaccess file to password protect a directory:

     AuthUserFile /etc/apache2/students
     AuthType Basic
     AuthName "Students"
     Require valid-user
    
  • Information about .htaccess is in http://localhost/doc/apache2-doc/manual/howto/htaccess.html

  • If you need to tell apache to listen on different ports, add a Listen directive to /etc/apache2/ports.conf. Then you can use:

     <VirtualHost www.training.aau.edu.et:9000>
     [...]
     </VirtualHost>
    
  • To setup an HTTPS website:

    • Documentation is in http://localhost/doc/apache2-doc/manual/ssl/
    • How to create a certificate: http://www.tc.umn.edu/~brams006/selfsign.html

    • Create a certificate:

      /usr/sbin/apache2-ssl-certificate -days 365

    • Create a virtual host on port 443:

      [...]

    • Enable SSL in the VirtualHost:

      SSLEngine On SSLCertificateFile /etc/apache2/ssl/apache.pem

    • Enable listening on the HTTPS port (/etc/apache2/ports.conf):

      Listen 443

Apache troubleshooting:

  • check that there are no errors in the configuration file:

     apache2ctl configtest
    

    This it is always a good thing to do before restarting or reloading apache.

  • read logs in /var/log/apache2/

  • if you made a change but you don't see it on the web, it can be that you have the old page in the cache of the browser: try reloading a few times.

To install PHP

  • apt-get install libapache2-module-php5
  • then by default, every file .php is executed as php code
  • Small but useful test php file:

     <? phpinfo() ?>
    

To install MySQL

  • apt-get install mysql-client mysql-server
  • for administration run mysql as root:

    • Create a database with:

      create database students

  • Give a user access to the database:

     # Without password
     grant all on students.* to enrico;
    
     # With password
     grant all on students.* to enrico identified by "SECRET";
    
  • More information can be found at http://www-css.fnal.gov/dsg/external/freeware/mysqlAdmin.html

To use MySQL from PHP:

    apt-get install php5-mysqli php5-mysql

Problems found today:

  • the apache2 manual in /usr/share/doc/manual can only be viewed using apache because it uses MultiView. So you need to have a working apache to read how to have a working apache.

  • chmod does not have examples in the manpage.

Posted Sat Jun 6 00:57:39 2009 Tags:

Etiopia

È interessante, bello e triste allo stesso tempo trovarsi a ridefinire il significato di "Abissinia". E maledire che per i primi 30 anni della tua vita, quella parola l'hai sentita soltanto quando uno stronzo cantava "Faccetta nera".

Posted Sat Jun 6 00:57:39 2009 Tags:

Addis course Tasks & Skills questions

  • What does the command find /etc | less do?

  • What does the command ps aux do?

  • What does the command mii-tool do and when would you use it?

  • What does the command host www.google.com do?

  • How do you get the MAC address of your computer?

  • What can you use dnsmasq for?

  • What is in /etc/dnsmasq.conf?

  • What is the use of the dhcp-option configuration parameter of /etc/dnsmasq.conf?

  • What is the difference between chown, chgrp and chmod?

  • What would you use nmap for?

  • How do you check to see if a network service is running on your computer?

  • What does apache2ctl configtest do? When should you run it?

  • Consider this piece of configuration of apache:

     AuthUserFile /etc/apache2/students
     AuthType Basic
     AuthName "Students"
     Require valid-user
    

    What does it do?

    What command would you use to add a new username and password to /etc/apache2/students? (you can write the entire commandline if you know it, but just the name of the command is fine)

  • You created the configuration for a new apache site in /etc/apache2/sites-available. How do you activate the new site?

  • When do you need to add the line Listen 443 to /etc/apache2/ports.conf?

  • What do you normally find in /var/log/syslog, and when would you read it?

  • What does the command smbclient //localhost/web do?

  • What does the command sudo smbpasswd -a enrico do?

  • Where do you look for the explanation of the many directives found in /etc/samba/smb.conf?

  • What is the purpose of the package cupsys?

  • What is the purpose of the command iptables?

  • What is the difference between MDA, MTA and MUA?

  • In a normal mail server configuration, when should you accept a mail coming from outside your local network?

  • Suppose you are a mail software and you need to send a mail to addis@yahoo.com: how do you find out the internet host to which you should connect to send the mail?

  • What is the difference between man 5 postconf and man 8 postconf?

  • What is the different use of SMTP and IMAP?

  • What is a "smarthost" in the context of mail server configuration?

  • What does the command mailq do?

  • What does the command sudo postsuper -d ALL deferred do?

  • Postfix has four mail queues: "incoming", "active", "deferred" and "hold". What is the difference among them?

  • What does the package dovecot do?

  • In the file /etc/dovecot/dovecot.conf, what is the difference between having protocols = imap and protocols = imaps?

  • What happens if I put the line enrico@enricozini.org in the file /home/enrico/.forward?

  • Consider this list of possible strategies for handling mail classified as spam:

    • silently delete it
    • refuse the mail and send a notification to the sender
    • refuse the mail and send a notification to the receiver
    • quarantine the e-mail
    • refuse delivery with a SMTP error
    • deliver with an extra header that says that it's spam

    What are their advantages and disadvantages?

Posted Sat Jun 6 00:57:39 2009 Tags:

Edifici

Da una canzone in amarico:

"Il tuo amore è diventato vecchio

come gli edifici costruiti dagli italiani"

Posted Sat Jun 6 00:57:39 2009 Tags:

Tenth day in Addis

Procedure to check if all the services of Dream University are up and running

If a machine blocks pings, use arping instead.

  1. Test DHCP:

    $ sudo ifdown eth0
    $ sudo ifup eth0
    $ ifconfig
    
  2. Test the DNS:

    # See if the DNS machine is on
    # The network
    $ ping -n 192.168.0.1
    
    # See if the DNS resolves names
    $ host www.dream.edu.et
    
  3. Test the gateway:

    # Ping the gateway
    $ ping gateway
    # Ping an outside host
    $ ping -n 10.4.15.6
    
  4. Test the proxy:

    # Ping the proxy
    $ ping proxy
    # Open a web page and see if it displays
    # See if it caches
    http_proxy=http://proxy.dream.edu.et:3030/ wget -S -O/dev/null http://www.enricozini.org  2>&1 | grep X-Cache
    
  5. Test the mail server:

    $ ping smtp
    $ nmap smtp -p 25 |grep 25/tcp
    $ if nmap gateway -p 25 |grep 25/tcp | grep -q open ; then echo "It works"; fi
    $ send a mail and see if you receive it
    

To do more advanced network and service monitoring, try nagios:

New useful tools seen today

wget - The non-interactive network downloader.

Special devices

  • /dev/null:
    • On read, there is no data.
    • On write, discards data.
  • /dev/zero:
    • On read, reads an infininte amount of zero bits.
    • On write, discards data.
  • /dev/random, /dev/urandom
    • On read, reads random bits.
    • On write, discards data.
    • Difference: /dev/random is cryptographically secure, but it can hang waiting for system events

Example uses:

wget -O/dev/null http://www.example.org

dd if=/dev/zero of=testdisk bs=1M count=50
mke2fs testdisk
sudo mount -o loop testdisk  /mnt

Tiny little commands

  • true - do nothing, successfully
  • false - do nothing, unsuccessfully
  • yes - output a string repeatedly until killed

Example uses:

  • while /bin/true; do echo ciao; done
  • Using /bin/false as a shell
  • yes | boring-tool-that-asks-lots-of-silly-questions

Some more shell syntax

  • 2>&1 Redirects the standard error in the standard output
  • 2> Redirects the standard error instead of the standard output

Some people run commands ignoring the standard error: command 2> /dev/null this causes unexpected error messages to go unnoticed: please do not do it.

What to check if a machine is very slow

  • See if the ram is full: $ free If it is, you see what are the fattest programs using top, pressing M to sort by memory usage.
  • See if there are lots of programs competing for CPU: $ top
  • Check if you have I/O bottlenecks: $ vmstat (but I don't know how to read it)
  • For a desktop on older hardware, you can try xubuntu instead of ubuntu

More VIM command mode

Command mode allows to perform various text editing functions.

You work by performing operations on selected blocks of text.

Some common operations:

  • y: copy ("yank")
  • p: paste
  • P: paste before
  • d: cut ("delete")
  • c: change
  • i: insert
  • `a: append
  • .: repeat last operation

Some common blocks:

  • w: word
  • }: paragraph
  • left and right arrow: one character left or right
  • up and down arrow: this line and the one on top or below
  • f letter: from the cursor until the given letter
  • v: selection
  • V: line selection
  • ^V: block selection

Examples:

  • yw: copy word
  • dw: cut word
  • yy: copy line
  • dd: cut line
  • V (select lines) y: copy a selection of lines
  • V (select lines) d: cut a selection of lines
  • p: paste

The best way to learn more vim is always to run vimtutor.

Installing squirrelmail

To install squirrelmail:

  1. apt-get install squirrelmail
  2. /usr/sbin/squirrelmail-config and configure IMAP and SMTP.

    In our case, since we use IMAPS, the IMAP server is imap.dream.edu.et, port 993, secure IMAP enabled and SMTP is smtp.dream.edu.et.

  3. Read /usr/share/doc/squirrelmail/README.Debian.gz (with zless) for how to proceed with setup. A short summary:
    • link /etc/squirrelmail/apache.conf into the apache conf.d directory
    • customise /etc/squirrelmail/apache.conf for example setting up the virtual hosts, or running it only on SSL

To have different virtual hosts over HTTPS, you need to have a different IP for every virtual host: name based virtual hosts do not work on HTTPS.

You can configure multiple IP addresses on the same computer: use network interfaces named: eth0:1, eth0:2, eth0:3... These are called interface aliases.

You cannot setup interface aliases using the graphical network configuration and you need to add them in /etc/network/interfaces:

    iface eth0:1 inet static
          address 192.168.0.201
          netmask 255.255.255.0
          gateway 192.168.0.3
    auto eth0:1

This is the trick commonly used to put different virtual HTTPS hosts on the same computer.

Links

squid documentation:

Shell programming:

Performance analysis:

Setting up mail services:

Posted Sat Jun 6 00:57:39 2009 Tags:

Ninth day in Addis

SSH

To enable remote logins with ssh

apt-get install openssh-server

Then you can login with:

$ ssh efossnet@proxy.dream.edu.et

To verify the host key fingerprint of a machine:

$ ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub

Note: you need to verify it before logging in!

More information at http://www.securityfocus.com/infocus/1806

Example ssh usages

To log in:

    $ ssh efossnet@proxy

To run a command in the remote computer:

    $ ssh efossnet@proxy "cat /etc/hosts"

To copy a file to the remote computer:

    $ scp Desktop/july-18.tar.gz efossnet@proxy:

To copy a file from the remote computer:

    $ scp efossnet@proxy:july-18.tar.gz /tmp/

Beware of brute-force login attempts

Warning about SSH: there are people who run automated scans for ssh servers and try to login using commonly used easy passwords.

If you have an SSH server on the network, use strong passwords, or if you can it's even better to disable password authentication: in /etc/ssh/sshd_config, add:

    PasswordAuthentication no

To log in using public/private keys:

  1. Create your key:

    ssh-keygen -t rsa
    
  2. Copy your public key to the machine where you want to log in:

    ssh-copy-id -i .ssh/id_rsa.pub efossnet@proxy
    
  3. Now you can ssh using your RSA key

If you use ssh often, read these:

proxy

Problems we had today with the proxy:

ssl does not work

Reason: squid tries to directly connect to the ssl server, but the AAU network wants us to go through their proxy.

Ideal solution: none. There is no way to tell squid to use a parent proxy for SSL connections.

Solution: update the documentation for the Dream university users telling to setup a different proxy for SSL connections.

Longer term solution: get the AAU network admins to enable outgoing SSL connections from the Dream university proxy.

Other things that can be done:

  • report a bug on squid reporting the need and requesting the feature
  • download squid source code and implement the feature ourselves, then submit the patch to the squid people

Browsing normal pages returns an error of 'Connection refused'.

In the logs, the line is:

1153294204.912    887 192.168.0.200 TCP_MISS/503 1441 GET http://www.google.com.et/search? - NONE/- text/html

That "/503" is one of the HTTP error codes.

Explanation of the error codes:

Reason: the other proxy is refusing connections from our proxy.

Solution: none so far. Will need to get in touch with the admins of the other proxy to try to find out why it refuses connection to our proxy, and how we can fix the problem.

postfix on smtp.dream.edu.et

Basic information is at http://www.postfix.org/basic.html.

Difference between mail name and smarthost:

  • The mail name is the name of the mail server you're setting up (TODO: need more details on what's it used for)
  • The smarthost is the name of the mail server that will relay mail for you.

Quick way to send test mails:

apt-get install mailx
echo ciao | mail efossnet@localhost

To configure a workstation not to do any mail delivery locally and send all mail produced locally to smtp.dream.edu.et:

  1. install postfix choosing "Satellite system"
  2. put smtp.dream.edu.et as a smarthost.

To setup a webmail: apt-get install squirrelmail (on a working apache setup).

To setup mailing lists: apt-get install mailman, then follow the instructions in /usr/share/doc.

Mail server issues we encountered

When a mail is sent to efossnet@localhost, the system tries to send it to efossnet@yoseph.org

Investigation:

  • "yoseph.org" does not appear anywhere in /etc or /var/spool/postfix
  • postfix configuration has been reloaded
  • postfix logs show that the mail has been 'forwarded'

Cause: the user efossnet had forgotten that he or she had setup a .forward file in the home directory.

Solution:

 rm ~efossnet/.forward

Apache

To add a new website:

  1. cd /etc/apache2/sites-available
  2. sudo cp default course
  3. sudo vi course:

    1. Remove the first line
    2. Add a ServerName directive with the address of your server: ServerName course.dream.edu.et
    3. Customize the rest as needed: you at least want to remove the support for browsing /usr/share/doc and you want to use a different document root.
  4. sudo a2ensite course

  5. sudo /etc/init.d/apache2 reload

More VIM

Undo: u (in command mode)

Redo: ^R (in command mode)

You can undo and redo multiple times.

To recover a lost password for root or for the ubuntu admin user

Boot with a live CD, mount the system on the hard disk (the live CD usually does it automatically), then edit the file /etc/shadow, removing the password:

enrico:$1$3AJfasjJFHa234dfh230:13343:0:99999:7:::

becomes:

enrico::13343:0:99999:7:::

You can edit the file because, in the live CD system, you can always become root.

After you do this, reboot the system: you can log in without password, and set yourself a new password using the command passwd.

Installing packages not on the CDs

To get a package for installing when offline:

  1. apt-get --print-uris install dnsmasq
  2. Manually download the packages at the URLs that it gives you

Otherwise, apt-get --download-only install dnsmasq will download the package for you in /var/cache/apt/archives.

You can install various previously downloaded debian packages with:

dpkg -i *.deb

Backups

There are various ways:

  • dump (for ext2/ext3 file systems) or xfsdump (for xfs file systems).

    Makes a low-level dump of the file system.

    It must be used for every different partition.

    It makes the most exact backup possible, including inode numbers.

    It can do full and incremental backups.

    To see the type of the filesystems, use 'mount' with no parameters.

    To restore: restore or xfsrestore.

  • tar

    Filesystem independent.

    It can work accross partitions.

    It correctly backups permissions and hard links.

    It can do full and incremental backups.

    Example:

     tar lzcpf backup.tar.gz /home /var /etc /usr/local
     tar lzcpf root.tar.gz /
    

    To restore:

     tar zxpf backup.tar.gz
    
  • faubackup

    Filesystem independent.

    Uses hard drive as backup storage.

    Always incremental.

    It cannot do compression.

    Unchanged files in new backups are just links to old backups, and do not occupy space.

    Any old backup can be deleted at any time without compromising the others.

    It can be used to provided a "yesterday's files" service to users (both locally and exported as a read-only samba share...).

    To restore, just copy the files from the backup area.

  • amanda

     apt-get install amanda-client amanda-server
    

    It is a network backup system.

    It can do full and incremental backups.

    You can have a backup server which handles the storage and various backup clients that send the files to backup to the server.

    It takes some studying to set up.

    To restore: it has its own tool.

Some data requires exporting before backing it up:

  • To save the list of installed packages and the answer to configuration questions:

     dpkg --get-selections > pkglist
     debconf-get-selections > pkgconfig
    

    To restore:

     dpkg --set-selections < list
     debconf-set-selections < pkgconfig
     apt-get dselect-upgrade
    

    If you do this, they you only need to backup /etc, /home, /usr/local, /var.

  • To save the contents of a MySQL database:

     mysqldump name-of-database | gzip > name-of-database.dump.gz
    

    To restore:

     zcat name-of-database.dump.gz | mysql
    

You can schedule these dumps to be made one hour before the time you make backups.

Scheduling tasks

As a user:

crontab -e

As root: add a file in one of the /etc/cron.* directories.

In cron.{hourly,daily,weekly,monthly} you put scripts.

In the other directories you put crontab files (man 5 crontab).

If the system is turned off during normal maintainance hours, you can do two things:

  1. Change /etc/crontab to use different maintanance hours
  2. Install anacron (it's installed by default in ubuntu)

For scheduling one-shot tasks, use at(1):

$ at 17:40
echo "Please tell Enrico that the lesson is finished" | mail efossnet@dream.edu.et
^D

When and how to automate

  1. First, you manage to do it yourself
  2. Then, you document it
  3. Then, you automate it

Start at step 1 and go to 2 or 3 if/when you actually need it.

(credits to sto@debian.org: he's the one from which I heard it for the first time, said so well).

Interesting programs to schedule during maintanance

  • rkhunter, chkrootkit
  • checksecurity
  • debsecan
  • tiger

Important keys to know in a Unix terminal

These are special keys that work on Unix terminals:

  • ^C: interrupt (sends SIGTERM)
  • ^\: interrupt (send SIGQUIT)
  • ^D: end of input
  • ^S: stop scrolling
  • ^Q: resume scrolling

Therefore, if the terminal looks like it got stuck, try hitting ^Q.

Problems we had today with postfix

  • Problem: mail to efossnet@dream.edu.et is accepted only if sent locally.

    Reason:

     $ host -t mx dream.edu.et
     Host dream.edu.et not found: 3(NXDOMAIN)
    

    Solution: tell dnsmasq to handle a MX record also for dream.edu.et:

    mx-host=dream.edu.et,smtp.dream.edu.et,50

  • The problem not solved with the previous solution.

    Reason: postfix was making complaints which mentioned localhost as a domain name.

    Solution: fixed by changing 'myhostname' in main.cf to something different than localhost.

    Note: solved by luck. Investigate why this happened.

Problems found yesterday and today

  • there is no way to tell squid to use another proxy for SSL connections: it only does them directly
  • if you want to configure evolution to get mail from /var/mail/user, you need to explicitly enter the path. It would be trivially easier if evolution presented a good default, since it's easy to compute. It would also be useful if below the "Path" entry there were some text telling what path is being requested: the mail spool? the evolution mail storage?
  • In Evolution: IMAP or IMAPv4r1? What is the difference? Why should I care?
  • apt-get --print-uris doesn't print the URIs if the package is in the local cache, and there seems to be no way to have it do it.
  • in /etc/apache2/sites-available/default, is the NameVirtualHost * directive appropriate there? It gets in the way when using 'default' as a template for new sites.

    Otherwise, one can add a new (disabled) site that can be used as a template for new sites instead of default.

  • the default comments put by crontab -e are not that easy to read.

Posted Sat Jun 6 00:57:39 2009 Tags:

Third day in Addis

Believe it or not, a network that fails often is the best thing to have when you are teaching network troubleshooting.

Various tools useful for networking:

  • ifconfig - configure a network interface
  • dnsmasq - Simple DNS and DHCP server
  • host - DNS lookup utility
  • route - show / manipulate the IP routing table
  • arping - send ARP REQUEST to a neighbour host
  • mii-tool - view, manipulate media-independent interface status (IOW, see if the cable works)
  • nmap - Network exploration tool and security / port scanner

    Examples:

     # Look at what machines are active in the local network:
     nmap -sP 10.5.15.0/24
    
     # Look at what ports are open in a machine:
     nmap 10.5.15.26
    
  • tcpdump - dump traffic on a network

    It can be used to see if there is traffic, and to detect traffic that shouldn't be there.

Useful tip:

    # Convert a unix timestamp to a readable date
    date -d @1152841341

What happens when you browse a web page:

  1. type the address www.google.com in the browser
  2. the browser needs the IP address of the web server:

  3. . look for the DNS address in /etc/resolv.conf (/etc/resolv.conf is created automatically by the DHCP client)

  4. . try all the DNS servers in /etc/resolv.conf until one gives you the IP address of www.google.com
  5. . take the first address that comes from the DNS (in our case was 64.233.167.104)

  6. figure out how to connect to 64.233.167.104:

  7. . consult the routing table to see if it's in the local network:

    1. if it's in the local network, then look for the MAC address (using ARP
      • Address Resolution Protocol)
    2. if it'd not in the local network, then send through the gateway (again using ARP to find the MAC address of the gateway)
  8. Send out the HTTP request to the local web server or through the gateway, using the Ethernet physical protocol, and the MAC address to refer to the other machine.

Troubleshooting network problems:

  1. See if the network driver works:

  2. . With ifconfig, see if you see the HWaddr:. If you do not see it, then the linux driver for the network card is not working. Unfortunately there's no exact way to say that it works perfectly

  3. See if you have an IP address with ifconfig. If you find out that you need to rerun DHCP (for example, if the network cable was disconnected when the system started), then you can do it either by deactivating/reactivating the Ethernet interface using System/Administration/Networking or, on a terminal, running:

    # ifdown eth0
    # ifup eth0
    

    If you don't get an IP, try to see if the DHCP server is reachable by running:

    $ arping -D [address of DHCP server]
    
  4. See if the local physical network works:

  5. . With sudo mii-tool, see if the cable link is ok. If it's not, then it's a problem in the cable or the plugs, or simply the device at the other end of the cable is turned off.

  6. . Try arping or ping -n on a machine in the local network (like the gateway) to see if the local network works.

  7. See if the DNS works:

  8. . Find out the DNS address:

    cat /etc/resolv.conf
    
  9. . If it's local, arping it

  10. . If it's not local, ping -n it
  11. . Try to resolve a famous name using that DNS:

    $ host [name] [IP address of the DNS]
    
  12. . Try to resolve the name of the machine you're trying to connect. If you can resolve a famous name but not the name you need, then it's likely a problem with their DNS.

  13. If you use a proxy, see if the proxy is reachable: check if the proxy name resolves to an IP, if you can ping it, if you can telnet to the proxy address and port:

    $ telnet [proxy address] [proxy port]
    

    you quit telnet with ^]quit.

  14. If you can connect directly to the web server, try to see if it answers:

    $ telnet [address] 80
    

    If you are connected, you can confirm that it's a web server:

    GET / HTTP/1.0 (then Enter twice)
    

    If it's a web server, it should give you something like a webpage or an HTTP redirect.

When you try to setup a service and it doesn't work:

  1. check that it's running:

    $ ps aux | grep dnsmasq
    
  2. check that it's listening on the right port:

    $ sudo netstat -lp
    
  3. check that it's listening from the outside:

    $ nmap [hostname]
    
  4. check for messages in /var/log/daemon.log or /var/log/syslog

  5. check that the configuration is correct and reload or restart the server to make sure it's running with the right configuration:

    # /etc/init.d/dnsmasq restart
    

dnsmasq:

By default: works as a DNS server that serves the data in /etc/hosts.

By default: uses /etc/resolv.conf to find addresses of other DNS to use when a name is not found in /etc/hosts.

To enable the DHCP server, uncomment:

    dhcp-range=192.168.0.50,192.168.0.150,12h

in /etc/dnsmasq.conf and set it to the range of addresses you want to serve. Pay attention to never put two DHCP servers on the same local network, or they will interfere with each others.

To test if the DHCP server is working, use dhcping (not installed by default on Ubuntu).

To communicate other information like DNS, gateway and netmask to the clients, use this piece of dnsmasq.conf:

    # For reference, the common options are:
    # subnet mask - 1
    # default router - 3
    # DNS server - 6
    # broadcast address - 28
    dhcp-option=1,255.255.255.0
    dhcp-option=3,192.168.0.1
    dhcp-option=6,192.168.0.1
    dhcp-option=28,192.168.0.255

Problems found today:

  • changing the name of the local machine in /etc/hosts breaks sudo, and without sudo it's impossible to edit the file. The only way to fix this is a reboot in recovery mode.

  • dhclient -n -w is different than dhclient -nw

Quick start examples with tar:

    # Create an archive
    tar zcvf nmap.tar.gz *.deb

    # Extract an archive
    tar zxvf nmap.tar.gz

    # Look at the contents of an archive
    tar ztvf nmap.tar.gz

Quick & dirty way to send a file between two computers without web server, e-mail, shared disk space or any other infrastructure:

    # To send
    nc -l -p 12345 -q 1 < nmap.tar.gz

    # To receive
    nc 10.5.15.123 12345 > nmap.tar.gz

    # To repeat the send command 20 times
    for i in `seq 1 20`; do nc -l -p 12345 -q 1 < nmap.tar.gz ; done

Update: Javier Fernandez-Sanguino writes:

Your "XXX day in Addis" is certainly good reading, nice to see somebody reviewing common tools from a novice point of view. Some comments:

  • Regarding your comments on how to troubleshoot network connectivity problems I just wanted to point you to the network test script I wrote and submited to the debian-goodies package ages ago. It's available at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=307694 and should do automatically most of the stuff you commented on your blog.

  • Your example to test hosts alive in the network using nmap -sP 10.5.15.0/24 is good. However, newer (v4) versions can do ARP ping in the local network which is much more efficient (some systems might block ICMP outbount), that's the -PR option and should be enabled (by default). See http://www.insecure.org/nmap/man/man-host-discovery.html Also, you might want to add a '-n' there so that nmap does not try to do DNS resolution of the hosts (which might take up some time if your DNS does not include local IPs)

  • tcpdump, it would be wiser to turn novice users to ethereal since it has a much better UI than tcpdump and it is able to dissect (interpret) protocols that tcpdump can't analyse.

  • you are missing arp as a tool in itself, it is useful to debug network issues since if the host is local and does not show up in arp output either a) it's down or b) you don't have proper network connectivity. (If you are missing an ARP entry for your default gateway your setup is broken)

Update: Marius Gedminas writes:

Re: http://www.enricozini.org/blog/eng/third-day-in-addis

In my experience if sudo cannot resolve the hostname (e.g. if you break /etc/hosts), you can still use sudo, but you have to wait something like 30 seconds until the DNS request times out.

I tried to break my /etc/hosts (while keeping a root shell so I can fix it if something goes wrong), but couldn't even get the timeout now. Sudo just said unable to lookup $hostname via gethostbyname() and gave me a root shell.

Posted Sat Jun 6 00:57:39 2009 Tags:

Eight day in Addis

Useful things to keep in mind when setting up a service:

  • always take note of what you do
  • make yourself always able to explain to another person what you did
  • keep a copy of the configuration files before changing them, so that you can see what you changed
  • be always able to move the service to another computer
  • make sure that it works after reboot

Example use of vim block selection:

  • ESC: exits insert mode.
  • ^V: starts block selection. Move the arrows to form a rectangle.
  • c: change. Type the new content for the line.
  • ESC: gets out of insert mode, and the change will happen in all the lines.

To change network configuration with config files, edit:

/etc/network/interfaces

To also setup DNS in /etc/network/interfaces, use dns-search and dns-nameservers (for this to work, you need to have the package resolvconf):

dns-search dream.edu.et
dns-nameservers 192.168.0.1 192.168.0.2

To make a router that connects to the internet on demand using a modem:

apt-get install diald

To see the path of network packets:

mtr 4.2.2.2

Basic NAT script:

OUT=eth2
IN=eth0

modprobe iptable_nat
iptables -t nat -A POSTROUTING -o $OUT -j MASQUERADE
echo 1 > /proc/sys/net/ipv4/ip_forward

What happens at system startup:

  1. the BIOS loads and runs the boot loader
  2. the boot loader loads the kernel and the inintrd ramdisk and runs the kernel
  3. the kernel runs the script 'init' in the initrd ramdisk
  4. the script 'init' mounts the root directory
  5. the script 'init' runs the command /sbin/init in the new root directory
  6. 'init' starts the system with the configuration in /etc/inittab

To install a new startup script:

sudo ln -s /usr/local/sbin/firewall /etc/init.d
sudo update-rc.d firewall defaults 16 75

Normally you can just do:

sudo update-rc.d [servicename] defaults

To have a look at the start and stop order numbers, look at /etc/rc2.d for other start scripts and /etc/rc0.d for other stop scripts

To test a proxy, low level way:

$ telnet proxy 8080
Trying 192.168.0.6...
Connected to proxy.dream.edu.et.
Escape character is '^]'.
GET http://www.google.com HTTP/1.0 [press enter twice]
Posted Sat Jun 6 00:57:39 2009 Tags:

First day in Addis

First day in Addis Ababa, after the introductory session for this 10 days Linux training.

Interesting new quotes I picked up from the excellent presentation of Dr. Dawit:

Much that I bound I could not free Much that I freed returned to me

(I didn't manage to transcribe the attribution)

And this one for Bubulle, about translation:

When you speak to me in my language you speak to my heart when you speak to me in English you speak to my head

(sb.)

Incomplete list of questions I've been asked, in bogosort -n order:

  • How do I get support?
  • Are the configuration files always the same accross different distributions?
  • What is the level of interoperatibility between the various Linux distributions? And between different Unix-like systems?
  • Does plug and play work well when I change hardware?
  • Can I access NTFS partitions?
  • How do I play multimedia files in restricted formats?
  • I heard that NFS has security problems: can it be secured, or are there other file sharing alternatives?
  • Can I access a desktop remotely?
  • Can I install Linux on a computer where there's Windows already? Do I need to partition?
  • Can I be sure to find drivers for my hardware?

I'm happy to find that we've been successful in building more and more good answers for these questions.

Posted Sat Jun 6 00:57:39 2009 Tags:

First pratical lesson

Notes after today's training session.

Small index of most used shell commands:

  • ls - list directory contents
  • cp - copy files and directories
  • mv - move (rename) files
  • rm - remove files or directories
  • find - search for files in a directory hierarchy
  • cat - concatenate files and print on the standard output
  • more - file perusal filter for crt viewing
  • less - opposite of more (quit with 'q')
  • cd - Change the current directory to DIR. (use "help cd" instead of "man cd")
  • mkdir - make directories
  • rmdir - remove empty directories

Small index of commands useful for combining in pipelines:

  • grep, egrep, fgrep, rgrep - print lines matching a pattern
  • tail - output the last part of files
  • head - output the first part of files
  • sort - sort lines of text files
  • uniq - report or omit repeated lines
  • sed - stream editor
  • wc - print the number of newlines, words, and bytes in files

Problems found during the lesson:

  • You set the system default locale to Amharic, and the gdm login will be in Amharic input mode. We didn't find out how to switch it back to input roman characters. Right click on the input field to set the input method doesn't work. Since usernames are not in Amharic, you're locked out.
  • So you CTRL+ALT+F1, login and try dpkg-reconfigure locales. On Ubuntu Dapper, it does not work anymore.
  • So you dig and dig and dig and finally find that you can force a locale in /etc/default/gdm (but not in /etc/gdm/locale.conf, nor in /etc/gdm/gdm.conf).
  • Then the internet works for a bit and you look up how to reconfigure locales in Ubuntu. Turns out you have to use localeconf, which is not installed by default, is not in universe and thus not on the CDs, and needs to be downloaded from the Internet.
  • The Ubuntu wiki is all on https, which defeats any attempt of proxy caching.
  • An Internet proxy needs to be configured 3 times: in Gnome, in Firefox and in Synaptic (well, apt). This is especially tricky when you forgot to setup the proxy in Synaptic and seemingly unrelated applications fail, like the Ubuntu language selector, which internally invokes the package manager to download missing langpacks.
  • Some short descriptions in the NAME section of manpages are hard to understand, or wrong. Noted on apt-get, apt-cache and less. Top prize goes to apt-cache:

     NAME
            apt-cache - APT package handling utility -- cache manipulator
     DESCRIPTION
            [...] apt-cache does not manipulate the state of the system but
            does provide operations to search and generate interesting output
            from the package metadata. [...]
    

    So apt-cache is a manipulator that doesn't manipulate. A possible improvement can be "query the APT package cache".

  • The language selector in Ubuntu Breezy doesn't really exit and keeps the package database locked. This seems to be fixed in Dapper, and probably had been fixed in some Breezy update. System updates here are a problem: my Dapper (with some Universe things in it) wanted to download more than 120Mb of data, and the Uni network was giving me 14Kbps. It's been a nice opportunity to teach about fuser -uva and kill.
  • dict, squid and many other packages from 'main' are not on the normal Ubuntu CDs: is there an easy way to build a CD with them? Or do Ubuntu CDs with extra packages already exist? I'll have to find out.
  • cupsys has documentation outside of /usr/share/doc, in /usr/share/cups/doc-root.
  • man works on all commands, except cd, which is an internal shell command and thus needs help instead of man. I should remember to ponder about autogenerating manpages from help output.
  • Is there an index-like manpage with a list of the core Unix commands and their short descriptions? It there's not, it's easy to generate:

     #!/bin/sh
     DIR=${1:-"/bin"}
     (
     find $DIR | while read FILE
     do
         if [ -x $FILE ] && ! [ -d $FILE ]
         then
             LANG=C COLUMNS=2000 man `basename $FILE` | \
                      grep ^SYNOPSIS -B 100 | grep ^NAME -A 100 | \
                      tail -n +2 | head -n +2 | \
                      grep -v '^[ \t]*$'
         fi
     done
     ) | sort | uniq | sed 's/^ \+//'
    

    Try running it on /bin and /sbin: it's great!. Also, since it doesn't redirect stderr, it nicely exposes a number of manpage problems.

Lots of bugs to report when I come home: from here it'll take ages, and lots of money on the hotel internet connection, and some are Ubuntu-specific so I'd need to do everything online with Malone.

As usual, teaching is one of the best ways to find bugs.

I propose an Etch training session a month before release.

Other things to do:

  • Find more info about that Wikipedia live CD with Wikipedia browsable without the Internet.
  • Make a collection of Free technical E-books: even those Indian low-cost book editions are too expensive here, so E-books mean a lot.

Update: Matt Zimmerman writes:

I read your blog entry at http://www.enricozini.org/blog/eng/second-day-in-addis and wanted to respond as follows:

  • localeconf is not the standard way to configure locales in Ubuntu; what documentation told you that? It's an unsupported package from Progeny. If what you wanted was to set the system default locale from the command line, editing /etc/environment is probably the best way.

  • I suggest filing a bug report at <https://launchpad.net/products/ubuntu-website about the HTTPS issue>; I don't think it's necessary for the entire wiki to be HTTPS, only authentication.

  • Synaptic may be able to use the GNOME proxy settings without introducing undesirable dependencies; please file a wishlist bug

  • dict, squid and other packages from main are not on the Ubuntu CDs because there is no space. The DVD contains these packages.

  • The cupsys documentation bug was quite likely inherited from Debian and should be reported there

  • You can file bugs in Malone via email; this has been possible for a long time now. Please don't reinforce this misconception.

    https://help.launchpad.net/UsingMaloneEmail

Update:

Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009

Posts containing useful tips.

Resolving IP addresses in vim

A friend on IRC said: "I wish vim had a command to resolve all the IP addresses in a block of text".

But it does:

:<block>!perl -MSocket -pe 's/(\d+\.\d+\.\d+\.\d+)/gethostbyaddr(inet_aton($1), AF_INET)/ge'

If you use it often, put the perl command in a one-liner script and call it an editor macro. It works on other editors, too, and even without an editor at all. And it can be scripted!

We live with the power of Unix every day, so much that we risk forgetting how awesome it is.

Posted Wed Mar 7 14:07:07 2012 Tags:

SQLAlchemy, MySQL and sql_mode=traditional

As everyone should know, by default MySQL is an embarassing stupid toy:

mysql> create table foo (val integer not null);
Query OK, 0 rows affected (0.03 sec)

mysql> insert into foo values (1/0);
ERROR 1048 (23000): Column 'val' cannot be null

mysql> insert into foo values (1);
Query OK, 1 row affected (0.00 sec)

mysql> update foo set val=1/0 where val=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 1

mysql> select * from foo;
+-----+
| val |
+-----+
|   0 |
+-----+
1 row in set (0.00 sec)

Luckily, you can tell it to stop being embarassingly stupid:

mysql> set sql_mode="traditional";
Query OK, 0 rows affected (0.00 sec)

mysql> update foo set val=1/0 where val=0;
ERROR 1365 (22012): Division by 0

(There is an even better sql mode you can choose, though: it is called "Install PostgreSQL")

Unfortunately, I've been hired to work on a project that relies on the embarassing stupid behaviour of MySQL, so I cannot set sql_mode=traditional globally or the existing house of cards will collapse.

Here is how you set it session-wide with SQLAlchemy 0.6.x: it took me quite a while to find out:

import sqlalchemy.interfaces

# Without this, MySQL will silently insert invalid values in the
# database, causing very long debugging sessions in the long run
class DontBeSilly(sqlalchemy.interfaces.PoolListener):
    def connect(self, dbapi_con, connection_record):
        cur = dbapi_con.cursor()
        cur.execute("SET SESSION sql_mode='TRADITIONAL'")
        cur = None
engine = create_engine(..., listeners=[DontBeSilly()])

Why does it take all that effort is beyond me. I'd have expected this to be turned on by default, possibly with a switch that insane people could use to turn it off.

Posted Mon Feb 27 19:45:58 2012 Tags:

Quei simpatici spammer di Aruba

Aruba ha deciso, di punto in bianco, di iscrivermi a tutte le loro newsletter.

Le newsletter non hanno link di deiscrizione. O meglio, forse ce l'hanno, ma si vedono solo decodificando la mail usando programmi che io non ho intenzione di usare. A prescindere dal link di deiscrizione, perché dovrei deiscrivermi da delle newsletter alle quali non mi sono mai iscritto?

Ho mandato questa mail a abuse@staff.aruba.it, e altre 3 segnalazioni dopo di questa, che ovviamente sono state ignorate:

Buon giorno,

vi segnalo questo spam inviato da voi stessi (in allegato la mail con
gli header intatti).

Potreste per favore procedere con provvedimenti disciplinari contro voi
stessi? Il vostro comportamento su internet viola le piú banali regole
di netiquette, ed è vostro interesse, come provider, istruire voi stessi
sulle stesse e farvele rispettare.


Cordiali saluti,

Enrico

Che dire, una nazione del terzo mondo si merita ISP da terzo mondo.

È pur sempre un'ottima scusa per studiarsi gli header_check di postfix: ora le mail delle newsletter di Aruba, che son tra l'altro dei patozzi da 300Kb l'una, incontrano un REJECT direttamente nella sessione SMTP:

550 5.7.1 Criminal third-world ISP spammers not accepted here.

Per farlo, ho aggiunto a /etc/postfix/main.cf:

# Reject aruba spam right away
header_checks = pcre:/etc/postfix/known_idiots.pcre

E poi ho creato il file /etc/postfix/known_idiots.pcre:

/^Received:.+smtpnewsletter[0-9]+.aruba.it/
REJECT Criminal third-world ISP spammers not accepted here.

Nel frattempo ho mandato un'email al Garante Privacy e una all'AGCOM, piú per curiosità che altro. Non mi aspetto nessuna risposta, ma se succede qualcosa lo aggiungo volentieri qui.

Posted Fri Oct 14 16:14:54 2011 Tags:

Award winning code

Me and Yuwei had a fun day at hhhmcr (#hhhmcr) and even managed to put together a prototype that won the first prize \o/

We played with the gmp24 dataset kindly extracted from Twitter by Michael Brunton-Spall of the Guardian into a convenient JSON dataset. The idea was to find ways of making it easier to look at the data and making sense of it.

This is the story of what we did, including the code we wrote.

The original dataset has several JSON files, so the first task was to put them all together:

#!/usr/bin/python

# Merge the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)

import simplejson
import os

res = []
for f in os.listdir("."):
    if not f.startswith("gmp24"): continue
    data = open(f).read().strip()
    if data == "[]": continue
    parsed = simplejson.loads(data)
    res.extend(parsed)

print simplejson.dumps(res)

The results however were not ordered by date, as GMP had to use several accounts to twit because Twitter was putting Greather Manchester Police into jail for generating too much traffic. There would be quite a bit to write about that, but let's stick to our work.

Here is code to sort the JSON data by time:

#!/usr/bin/python

# Sort the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)

import simplejson
import sys
import datetime as dt

all_recs = simplejson.load(sys.stdin)
all_recs.sort(key=lambda x: dt.datetime.strptime(x["created_at"], "%a %b %d %H:%M:%S +0000 %Y"))

simplejson.dump(all_recs, sys.stdout)

I then wanted to play with Tf-idf for extracting the most important words of every tweet:

#!/usr/bin/python

# tfifd - Annotate JSON elements with Tf-idf extracted keywords
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import sys, math
import simplejson
import re

# Read all the twits
records = simplejson.load(sys.stdin)

# All the twits by ID
byid = dict(((x["id"], x) for x in records))

# Stopwords we ignore
stopwords = set(["by", "it", "and", "of", "in", "a", "to"])

# Tokenising engine
re_num = re.compile(r"^\d+$")
re_word = re.compile(r"(\w+)")
def tokenise(tweet):
    "Extract tokens from a tweet"
    for tok in tweet["text"].split():
        tok = tok.strip().lower()
        if re_num.match(tok): continue
        mo = re_word.match(tok)
        if not mo: continue
        if mo.group(1) in stopwords: continue
        yield mo.group(1)

# Extract tokens from tweets
tokenised = dict(((x["id"], list(tokenise(x))) for x in records))

# Aggregate token counts
aggregated = {}
for d in byid.iterkeys():
    for t in tokenised[d]:
        if t in aggregated:
            aggregated[t] += 1
        else:
            aggregated[t] = 1

def tfidf(doc, tok):
    "Compute TFIDF score of a token in a document"
    return doc.count(tok) * math.log(float(len(byid)) / aggregated[tok])

# Annotate tweets with keywords
res = []
for name, tweet in byid.iteritems():
    doc = tokenised[name]
    keywords = sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]
    tweet["keywords"] = keywords
    res.append(tweet)

simplejson.dump(res, sys.stdout)

I thought this was producing a nice summary of every tweet but nobody was particularly interested, so we moved on to adding categories to tweet.

Thanks to Yuwei who put together some useful keyword sets, we managed to annotate each tweet with a place name (i.e. "Stockport"), a social place name (i.e. "pub", "bank") and a social category (i.e. "man", "woman", "landlord"...)

The code is simple; the biggest work in it was the dictionary of keywords:

#!/usr/bin/python

# categorise - Annotate JSON elements with categories
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
# Copyright (C) 2010  Yuwei Lin <yuwei@ylin.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import sys, math
import simplejson
import re

# Electoral wards from http://en.wikipedia.org/wiki/List_of_electoral_wards_in_Greater_Manchester
placenames = ["Altrincham", "Sale West",
"Altrincham", "Ashton upon Mersey", "Bowdon", "Broadheath", "Hale Barns", "Hale Central", "St Mary", "Timperley", "Village",
"Ashton-under-Lyne",
"Ashton Hurst", "Ashton St Michael", "Ashton Waterloo", "Droylsden East", "Droylsden West", "Failsworth East", "Failsworth West", "St Peter",
"Blackley", "Broughton",
"Broughton", "Charlestown", "Cheetham", "Crumpsall", "Harpurhey", "Higher Blackley", "Kersal",
"Bolton North East",
"Astley Bridge", "Bradshaw", "Breightmet", "Bromley Cross", "Crompton", "Halliwell", "Tonge with the Haulgh",
"Bolton South East",
"Farnworth", "Great Lever", "Harper Green", "Hulton", "Kearsley", "Little Lever", "Darcy Lever", "Rumworth",
"Bolton West",
"Atherton", "Heaton", "Lostock", "Horwich", "Blackrod", "Horwich North East", "Smithills", "Westhoughton North", "Chew Moor", "Westhoughton South",
"Bury North",
"Church", "East", "Elton", "Moorside", "North Manor", "Ramsbottom", "Redvales", "Tottington",
"Bury South",
"Besses", "Holyrood", "Pilkington Park", "Radcliffe East", "Radcliffe North", "Radcliffe West", "St Mary", "Sedgley", "Unsworth",
"Cheadle",
"Bramhall North", "Bramhall South", "Cheadle", "Gatley", "Cheadle Hulme North", "Cheadle Hulme South", "Heald Green", "Stepping Hill",
"Denton", "Reddish",
"Audenshaw", "Denton North East", "Denton South", "Denton West", "Dukinfield", "Reddish North", "Reddish South",
"Hazel Grove",
"Bredbury", "Woodley", "Bredbury Green", "Romiley", "Hazel Grove", "Marple North", "Marple South", "Offerton",
"Heywood", "Middleton",
"Bamford", "Castleton", "East Middleton", "Hopwood Hall", "Norden", "North Heywood", "North Middleton", "South Middleton", "West Heywood", "West Middleton",
"Leigh",
"Astley Mosley Common", "Atherleigh", "Golborne", "Lowton West", "Leigh East", "Leigh South", "Leigh West", "Lowton East", "Tyldesley",
"Makerfield",
"Abram", "Ashton", "Bryn", "Hindley", "Hindley Green", "Orrell", "Winstanley", "Worsley Mesnes",
"Manchester Central",
"Ancoats", "Clayton", "Ardwick", "Bradford", "City Centre", "Hulme", "Miles Platting", "Newton Heath", "Moss Side", "Moston",
"Manchester", "Gorton",
"Fallowfield", "Gorton North", "Gorton South", "Levenshulme", "Longsight", "Rusholme", "Whalley Range",
"Manchester", "Withington",
"Burnage", "Chorlton", "Chorlton Park", "Didsbury East", "Didsbury West", "Old Moat", "Withington",
"Oldham East", "Saddleworth",
"Alexandra", "Crompton", "Saddleworth North", "Saddleworth South", "Saddleworth West", "Lees", "St James", "St Mary", "Shaw", "Waterhead",
"Oldham West", "Royton",
"Chadderton Central", "Chadderton North", "Chadderton South", "Coldhurst", "Hollinwood", "Medlock Vale", "Royton North", "Royton South", "Werneth",
"Rochdale",
"Balderstone", "Kirkholt", "Central Rochdale", "Healey", "Kingsway", "Littleborough Lakeside", "Milkstone", "Deeplish", "Milnrow", "Newhey", "Smallbridge", "Firgrove", "Spotland", "Falinge", "Wardle", "West Littleborough",
"Salford", "Eccles",
"Claremont", "Eccles", "Irwell Riverside", "Langworthy", "Ordsall", "Pendlebury", "Swinton North", "Swinton South", "Weaste", "Seedley",
"Stalybridge", "Hyde",
"Dukinfield Stalybridge", "Hyde Godley", "Hyde Newton", "Hyde Werneth", "Longdendale", "Mossley", "Stalybridge North", "Stalybridge South",
"Stockport",
"Brinnington", "Central", "Davenport", "Cale Green", "Edgeley", "Cheadle Heath", "Heatons North", "Heatons South", "Manor",
"Stretford", "Urmston",
"Bucklow-St Martins", "Clifford", "Davyhulme East", "Davyhulme West", "Flixton", "Gorse Hill", "Longford", "Stretford", "Urmston",
"Wigan",
"Aspull New Springs Whelley", "Douglas", "Ince", "Pemberton", "Shevington with Lower Ground", "Standish with Langtree", "Wigan Central", "Wigan West",
"Worsley", "Eccles South",
"Barton", "Boothstown", "Ellenbrook", "Cadishead", "Irlam", "Little Hulton", "Walkden North", "Walkden South", "Winton", "Worsley",
"Wythenshawe", "Sale East",
"Baguley", "Brooklands", "Northenden", "Priory", "Sale Moor", "Sharston", "Woodhouse Park"]

# Manual coding from Yuwei
placenames.extend(["City centre", "Tameside", "Oldham", "Bury", "Bolton",
"Trafford", "Pendleton", "New Moston", "Denton", "Eccles", "Leigh", "Benchill",
"Prestwich", "Sale", "Kearsley", ])
placenames.extend(["Trafford", "Bolton", "Stockport", "Levenshulme", "Gorton",
"Tameside", "Blackley", "City centre", "Airport", "South Manchester",
"Rochdale", "Chorlton", "Uppermill", "Castleton", "Stalybridge", "Ashton",
"Chadderton", "Bury", "Ancoats", "Whalley Range", "West Yorkshire",
"Fallowfield", "New Moston", "Denton", "Stretford", "Eccles", "Pendleton",
"Leigh", "Altrincham", "Sale", "Prestwich", "Kearsley", "Hulme", "Withington",
"Moss Side", "Milnrow", "outskirt of Manchester City Centre", "Newton Heath",
"Wythenshawe", "Mancunian Way", "M60", "A6", "Droylesden", "M56", "Timperley",
"Higher Ince", "Clayton", "Higher Blackley", "Lowton", "Droylsden",
"Partington", "Cheetham Hill", "Benchill", "Longsight", "Didsbury",
"Westhoughton"])


# Social categories from Yuwei
soccat = ["man", "woman", "men", "women", "youth", "teenager", "elderly",
"patient", "taxi driver", "neighbour", "male", "tenant", "landlord", "child",
"children", "immigrant", "female", "workmen", "boy", "girl", "foster parents",
"next of kin"]
for i in range(100):
    soccat.append("%d-year-old" % i)
    soccat.append("%d-years-old" % i)

# Types of social locations from Yuwei
socloc = ["car park", "park", "pub", "club", "shop", "premises", "bus stop",
"property", "credit card", "supermarket", "garden", "phone box", "theatre",
"toilet", "building site", "Crown court", "hard shoulder", "telephone kiosk",
"hotel", "restaurant", "cafe", "petrol station", "bank", "school",
"university"]


extras = { "placename": placenames, "soccat": soccat, "socloc": socloc }

# Normalise keyword lists
for k, v in extras.iteritems():
    # Remove duplicates
    v = list(set(v))
    # Sort by length
    v.sort(key=lambda x:len(x), reverse=True)

# Add keywords
def add_categories(tweet):
    text = tweet["text"].lower()
    for field, categories in extras.iteritems():
        for cat in categories:
            if cat.lower() in text:
                tweet[field] = cat
                break
    return tweet

# Read all the twits
records = (add_categories(x) for x in simplejson.load(sys.stdin))

simplejson.dump(list(records), sys.stdout)

All these scripts form a nice processing chain: each script takes a list of JSON records, adds some bit and passes it on.

In order to see what we have so far, here is a simple script to convert the JSON twits to CSV so they can be viewed in a spreadsheet:

#!/usr/bin/python

# Convert the JSON twits to CSV
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)

import simplejson
import sys
import csv

rows = ["id", "created_at", "text", "keywords", "placename"]

writer = csv.writer(sys.stdout)
for rec in simplejson.load(sys.stdin):
    rec["keywords"] = " ".join(rec["keywords"])
    rec["placename"] = rec.get("placename", "")
    writer.writerow([rec[row] for row in rows])

At this point we were coming up with lots of questions: "were there more reports on women or men?", "which place had most incidents?", "what were the incidents involving animals?"... Time to bring Xapian into play.

This script reads all the JSON tweets and builds a Xapian index with them:

#!/usr/bin/python

# toxapian - Index JSON tweets in Xapian
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import simplejson
import sys
import os, os.path
import xapian

DBNAME = sys.argv[1]

db = xapian.WritableDatabase(DBNAME, xapian.DB_CREATE_OR_OPEN)

stemmer = xapian.Stem("english")
indexer = xapian.TermGenerator()
indexer.set_stemmer(stemmer)
indexer.set_database(db)

data = simplejson.load(sys.stdin)
for rec in data:
    doc = xapian.Document()
    doc.set_data(str(rec["id"]))

    indexer.set_document(doc)
    indexer.index_text_without_positions(rec["text"])

    # Index categories as categories
    if "placename" in rec:
        doc.add_boolean_term("XP" + rec["placename"].lower())
    if "soccat" in rec:
        doc.add_boolean_term("XS" + rec["soccat"].lower())
    if "socloc" in rec:
        doc.add_boolean_term("XL" + rec["socloc"].lower())

    db.add_document(doc)

db.flush()

# Also save the whole dataset so we know where to find it later if we want to
# show the details of an entry
simplejson.dump(data, open(os.path.join(DBNAME, "all.json"), "w"))

And this is a simple command line tool to query to the database:

#!/usr/bin/python

# xgrep - Command line tool to query the GMP24 tweet Xapian database
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import simplejson
import sys
import os, os.path
import xapian

DBNAME = sys.argv[1]

db = xapian.Database(DBNAME)

stem = xapian.Stem("english")

qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
qp.add_boolean_prefix("place", "XP")
qp.add_boolean_prefix("soc", "XS")
qp.add_boolean_prefix("loc", "XL")

query = qp.parse_query(sys.argv[2],
    xapian.QueryParser.FLAG_BOOLEAN |
    xapian.QueryParser.FLAG_LOVEHATE |
    xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE |
    xapian.QueryParser.FLAG_WILDCARD |
    xapian.QueryParser.FLAG_PURE_NOT |
    xapian.QueryParser.FLAG_SPELLING_CORRECTION |
    xapian.QueryParser.FLAG_AUTO_SYNONYMS)

enquire = xapian.Enquire(db)
enquire.set_query(query)

count = 40
matches = enquire.get_mset(0, count)
estimated = matches.get_matches_estimated()
print "%d/%d results" % (matches.size(), estimated)

data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))

for m in matches:
    rec = data[m.document.get_data()]
    print rec["text"]

print "%d/%d results" % (matches.size(), matches.get_matches_estimated())

total = db.get_doccount()
estimated = matches.get_matches_estimated()
print "%d results over %d documents, %d%%" % (estimated, total, estimated * 100 / total)

Neat! Now that we have a proper index that supports all sort of cool things, like stemming, tag clouds, full text search with complex queries, lookup of similar documents, suggest keywords and so on, it was just fair to put together a web service to share it with other people at the event.

It helped that I had already written similar code for apt-xapian-index and dde before.

Here is the server, quickly built on bottle. The very last line starts the server and it is where you can configure the listening interface and port.

#!/usr/bin/python

# xserve - Make the GMP24 tweet Xapian database available on the web
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import bottle
from bottle import route, post
from cStringIO import StringIO
import cPickle as pickle
import simplejson
import sys
import os, os.path
import xapian
import urllib
import math

bottle.debug(True)

DBNAME = sys.argv[1]
QUERYLOG = os.path.join(DBNAME, "queries.txt")

data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))

prefixes = { "place": "XP", "soc": "XS", "loc": "XL" }
prefix_desc = { "place": "Place name", "soc": "Social category", "loc": "Social location" }

db = xapian.Database(DBNAME)

stem = xapian.Stem("english")

qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
for k, v in prefixes.iteritems():
    qp.add_boolean_prefix(k, v)

def make_query(qstring):
    return qp.parse_query(qstring,
        xapian.QueryParser.FLAG_BOOLEAN |
        xapian.QueryParser.FLAG_LOVEHATE |
        xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE |
        xapian.QueryParser.FLAG_WILDCARD |
        xapian.QueryParser.FLAG_PURE_NOT |
        xapian.QueryParser.FLAG_SPELLING_CORRECTION |
        xapian.QueryParser.FLAG_AUTO_SYNONYMS)


@route("/")
def index():
    query = urllib.unquote_plus(bottle.request.GET.get("q", ""))

    out = StringIO()
    print >>out, '''
<html>
<head>
<title>Query</title>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript">
$(function(){
    $("#queryfield")[0].focus()
})
</script>
</head>
<body>
<h1>Search</h1>
<form method="POST" action="/query">
Keywords: <input type="text" name="query" value="%s" id="queryfield">
<input type="submit">
<a href="http://xapian.org/docs/queryparser.html">Help</a>
</form>''' % query

    print >>out, '''
<p>Example: "car place:wigan"</p>

<p>Available prefixes:</p>

<ul>
'''
    for pfx in prefixes.keys():
        print >>out, "<li><a href='/catinfo/%s'>%s - %s</a></li>" % (pfx, pfx, prefix_desc[pfx])
    print >>out, '''
</ul>
'''

    oldqueries = []
    if os.path.exists(QUERYLOG):
        total = db.get_doccount()
        fd = open(QUERYLOG, "r")
        while True:
            try:
                q = pickle.load(fd)
            except EOFError:
                break
            oldqueries.append(q)
        fd.close()

        def print_query(q):
            count = q["count"]
            print >>out, "<li><a href='/query?query=%s'>%s (%d/%d %.2f%%)</a></li>" % (urllib.quote_plus(q["q"]), q["q"], count, total, count * 100.0 / total)

        print >>out, "<p>Last 10 queries:</p><ul>"
        for q in oldqueries[:-10:-1]:
            print_query(q)
        print >>out, "</ul>"

        # Remove duplicates
        oldqueries = dict(((x["q"], x) for x in oldqueries)).values()

        print >>out, "<table>"
        print >>out, "<tr><th>10 queries with most results</th><th>10 queries with least results</th></tr>"
        print >>out, "<tr><td>"

        print >>out, "<ul>"
        oldqueries.sort(key=lambda x:x["count"], reverse=True)
        for q in oldqueries[:10]:
            print_query(q)
        print >>out, "</ul>"

        print >>out, "</td><td>"

        print >>out, "<ul>"
        nonempty = [x for x in oldqueries if x["count"] > 0]
        nonempty.sort(key=lambda x:x["count"])
        for q in nonempty[:10]:
            print_query(q)
        print >>out, "</ul>"

        print >>out, "</td></tr>"
        print >>out, "</table>"

    print >>out, '''
</body>
</html>'''
    return out.getvalue()

@route("/query")
@route("/query/")
@post("/query")
@post("/query/")
def query():
    query = bottle.request.POST.get("query", bottle.request.GET.get("query", ""))
    enquire = xapian.Enquire(db)
    enquire.set_query(make_query(query))

    count = 40
    matches = enquire.get_mset(0, count)
    estimated = matches.get_matches_estimated()
    total = db.get_doccount()

    out = StringIO()
    print >>out, '''
<html>
<head><title>Results</title></head>
<body>
<h1>Results for "<b>%s</b>"</h1>
''' % query

    if estimated == 0:
        print >>out, "No results found."
    else:
        # Give as results the first 30 documents; also use them as the key
        # ones to use to compute relevant terms
        rset = xapian.RSet()
        for m in enquire.get_mset(0, 30):
            rset.add_document(m.document.get_docid())

        # Compute the tag cloud
        class NonTagFilter(xapian.ExpandDecider):
            def __call__(self, term):
                return not term[0].isupper() and not term[0].isdigit()
        cloud = []
        maxscore = None
        for res in enquire.get_eset(40, rset, NonTagFilter()):
            # Normalise the score in the interval [0, 1]
            weight = math.log(res.weight)
            if maxscore == None: maxscore = weight
            tag = res.term
            cloud.append([tag, float(weight) / maxscore])
        max_weight = cloud[0][1]
        min_weight = cloud[-1][1]
        cloud.sort(key=lambda x:x[0])

        def mklink(query, term):
            return "/query?query=%s" % urllib.quote_plus(query + " and " + term)
        print >>out, "<h2>Tag cloud</h2>"
        print >>out, "<blockquote>"
        for term, weight in cloud:
            size = 100 + 100.0 * (weight - min_weight) / (max_weight - min_weight)
            print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(query, term), size, term)
        print >>out, "</blockquote>"

        print >>out, "<h2>Results</h2>"
        print >>out, "<p><a href='/'>Search again</a></p>"

        print >>out, "<p>%d results over %d documents, %.2f%%</p>" % (estimated, total, estimated * 100.0 / total)
        print >>out, "<p>%d/%d results</p>" % (matches.size(), estimated)

        print >>out, "<ul>"
        for m in matches:
            rec = data[m.document.get_data()]
            print >>out, "<li><a href='/item/%s'>%s</a></li>" % (rec["id"], rec["text"])
        print >>out, "</ul>"

        fd = open(QUERYLOG, "a")
        qinfo = dict(q=query, count=estimated)
        pickle.dump(qinfo, fd)
        fd.close()

    print >>out, '''
<a href="/">Search again</a>

</body>
</html>'''
    return out.getvalue()

@route("/item/:id")
@route("/item/:id/")
def show(id):
    rec = data[id]

    out = StringIO()
    print >>out, '''
<html>
<head><title>Result %s</title></head>
<body>
<h1>Raw JSON record for twit %s</h1>
<pre>''' % (rec["id"], rec["id"])

    print >>out, simplejson.dumps(rec, indent=" ")

    print >>out, '''
</pre>
</body>
</html>'''
    return out.getvalue()

@route("/catinfo/:name")
@route("/catinfo/:name/")
def catinfo(name):
    prefix = prefixes[name]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Values for %s</title></head>
<body>
''' % name

    terms = [(x.term[len(prefix):], db.get_termfreq(x.term)) for x in db.allterms(prefix)]
    terms.sort(key=lambda x:x[1], reverse=True)
    freq_min = terms[0][1]
    freq_max = terms[-1][1]

    def mklink(name, term):
        return "/query?query=%s" % urllib.quote_plus(name + ":" + term)

    # Build tag cloud
    print >>out, "<h1>Tag cloud</h1>"
    print >>out, "<blockquote>"
    for term, freq in sorted(terms[:20], key=lambda x:x[0]):
        size = 100 + 100.0 * (freq - freq_min) / (freq_max - freq_min)
        print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(name, term), size, term)
    print >>out, "</blockquote>"

    print >>out, "<h1>All terms</h1>"
    print >>out, "<table>"
    print >>out, "<tr><th>Occurrences</th><th>Name</th></tr>"
    for term, freq in terms:
        print >>out, "<tr><td>%d</td><td><a href='/query?query=%s'>%s</a></td></tr>" % (freq, urllib.quote_plus(name + ":" + term), term)
    print >>out, "</table>"

    print >>out, '''
</body>
</html>'''
    return out.getvalue()

# Change here for bind host and port
bottle.run(host="0.0.0.0", port=8024)

...and then we presented our work and ended up winning the contest.

This was the story of how we wrote this set of award winning code.

Posted Sat Oct 16 01:36:08 2010 Tags:

Computing time offsets between EXIF and GPS

I like the idea of matching photos to GPS traces. In Debian there is gpscorrelate but it's almost unusable to me because of bug #473362 and it has an awkward way of specifying time offsets.

Here at SoTM10 someone told me that exiftool gained -geosync and -geotag options. So it's just a matter of creating a little tool that shows a photo and asks you to type the GPS time you see in it.

Apparently there are no bindings or GIR files for gtkimageview in Debian, so I'll have to use C.

Here is a C prototype:

/*
 * gpsoffset - Compute EXIF time offset from a photo of a gps display
 *
 * Use with exiftool -geosync=... -geotag trace.gpx DIR
 *
 * Copyright (C) 2009--2010  Enrico Zini <enrico@enricozini.org>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */


#define _XOPEN_SOURCE /* glibc2 needs this */
#include <time.h>
#include <gtkimageview/gtkimageview.h>
#include <libexif/exif-data.h>
#include <stdio.h>
#include <stdlib.h>

static int load_time(const char* fname, struct tm* tm)
{
    ExifData* exif_data = exif_data_new_from_file(fname);
    ExifEntry* exif_time = exif_data_get_entry(exif_data, EXIF_TAG_DATE_TIME);
    if (exif_time == NULL)
    {
        fprintf(stderr, "Cannot find EXIF timetamp\n");
        return -1;
    }

    char buf[1024];
    exif_entry_get_value(exif_time, buf, 1024);
    //printf("val2: %s\n", exif_entry_get_value(t2, buf, 1024));

    if (strptime(buf, "%Y:%m:%d %H:%M:%S", tm) == NULL)
    {
        fprintf(stderr, "Cannot match EXIF timetamp\n");
        return -1;
    }

    return 0;
}

static time_t exif_ts;
static GtkWidget* res_lbl;

void date_entry_changed(GtkEditable *editable, gpointer user_data)
{
    const gchar* text = gtk_entry_get_text(GTK_ENTRY(editable));
    struct tm parsed;
    if (strptime(text, "%Y-%m-%d %H:%M:%S", &parsed) == NULL)
    {
        gtk_label_set_text(GTK_LABEL(res_lbl), "Please enter a date as YYYY-MM-DD HH:MM:SS");
    } else {
        time_t img_ts = mktime(&parsed);
        int c;
        int res;
        if (exif_ts < img_ts)
        {
            c = '+';
            res = img_ts - exif_ts;
        }
        else
        {
            c = '-';
            res = exif_ts - img_ts;
        }
        char buf[1024];
        if (res > 3600)
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d:%02d:%02d",
                    c, res, c, res / 3600, (res / 60) % 60, res % 60);
        else if (res > 60)
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%02d:%02d",
                    c, res, c, (res / 60) % 60, res % 60);
        else 
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d",
                    c, res, c, res);
        gtk_label_set_text(GTK_LABEL(res_lbl), buf);
    }
}

int main (int argc, char *argv[])
{
    // Work in UTC to avoid mktime applying DST or timezones
    setenv("TZ", "UTC");

    const char* filename = "/home/enrico/web-eddie/galleries/2010/04-05-Uppermill/P1080932.jpg";

    gtk_init (&argc, &argv);

    struct tm exif_time;
    if (load_time(filename, &exif_time) != 0)
        return 1;

    printf("EXIF time: %s\n", asctime(&exif_time));
    exif_ts = mktime(&exif_time);

    GtkWidget* window = gtk_window_new(GTK_WINDOW_TOPLEVEL);
    GtkWidget* vb = gtk_vbox_new(FALSE, 0);
    GtkWidget* hb = gtk_hbox_new(FALSE, 0);
    GtkWidget* lbl = gtk_label_new("Timestamp:");
    GtkWidget* exif_lbl;
    {
        char buf[1024];
        strftime(buf, 1024, "EXIF time: %Y-%m-%d %H:%M:%S", &exif_time);
        exif_lbl = gtk_label_new(buf);
    }
    GtkWidget* date_ent = gtk_entry_new();
    res_lbl = gtk_label_new("Result:");
    GtkWidget* view = gtk_image_view_new();
    GdkPixbuf* pixbuf = gdk_pixbuf_new_from_file(filename, NULL);

    gtk_box_pack_start(GTK_BOX(hb), lbl, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(hb), date_ent, TRUE, TRUE, 0);

    gtk_signal_connect(GTK_OBJECT(date_ent), "changed", (GCallback)date_entry_changed, NULL);
    {
        char buf[1024];
        strftime(buf, 1024, "%Y-%m-%d %H:%M:%S", &exif_time);
        gtk_entry_set_text(GTK_ENTRY(date_ent), buf);
    }

    gtk_widget_set_size_request(view, 500, 400);
    gtk_image_view_set_pixbuf(GTK_IMAGE_VIEW(view), pixbuf, TRUE);
    gtk_container_add(GTK_CONTAINER(window), vb);
    gtk_box_pack_start(GTK_BOX(vb), view, TRUE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), hb, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), exif_lbl, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), res_lbl, FALSE, TRUE, 0);
    gtk_widget_show_all(window);

    gtk_main ();

    return 0;
}

And here is its simple makefile:

CFLAGS=$(shell pkg-config --cflags gtkimageview libexif)
LDFLAGS=$(shell pkg-config --libs gtkimageview libexif)

gpsoffset: gpsoffset.c

It's a simple prototype but it's a working prototype and seems to do the job for me.

I currently cannot find out why after I click on the text box, there seems to be no way to give the focus back to the image viewer so I can control it with keys.

There is another nice algorithm to compute time offsets to be implemented: you choose a photo taken from a known place and drag it on that place on a map: you can then look for the nearest point on your GPX trace and compute the time offset from that.

I have seen that there are programs for geotagging photos that implement all such algorithms, and have a nice UI, but I haven't seen any in Debian.

Are there any such softwares that can be packaged?

If not, the interpolation and annotation tasks can now already be performed by exiftool, so it's just a matter of building a good UI, and I would love to see someone picking up the task.

Posted Sun Jul 11 12:34:04 2010 Tags:

Searching OSM nodes in Spatialite

Third step of my SoTM10 pet project: finding the POIs.

I put together a query to find all nodes with a given tag inside a bounding box, and also a query to find all the tag values for a given tag name inside a bounding box.

The result is this simple POI search engine:

#
# poisearch - simple geographical POI search engine
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

from pysqlite2 import dbapi2 as sqlite

class PoiDB(object):
    def __init__(self):
        self.db = sqlite.connect("pois.db")
        self.db.enable_load_extension(True)
        self.db.execute("SELECT load_extension('libspatialite.so')")
        self.oldsearch = []
        self.bbox = None

    def set_bbox(self, xmin, xmax, ymin, ymax):
        '''Set bbox for searches'''
        self.bbox = (xmin, xmax, ymin, ymax)

    def tagid(self, name, val):
        '''Get the database ID for a tag'''
        c = self.db.cursor()
        c.execute("SELECT id FROM tag WHERE name=? AND value=?", (name, val))
        res = None
        for row in c:
            res = row[0]
        return res

    def tagnames(self):
        '''Get all tag names'''
        c = self.db.cursor()
        c.execute("SELECT DISTINCT name FROM tag ORDER BY name")
        for row in c:
            yield row[0]

    def tagvalues(self, name, use_bbox=False):
        '''
        Get all tag values for a given tag name,
        optionally in the current bounding box
        '''
        c = self.db.cursor()
        if self.bbox is None or not use_bbox:
            c.execute("SELECT DISTINCT value FROM tag WHERE name=? ORDER BY value", (name,))
        else:
            c.execute("SELECT DISTINCT tag.value FROM poi, poitag, tag"
                      " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                      "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                      "   AND poitag.tag = tag.id AND poitag.poi = poi.id"
                      "   AND tag.name=?",
                      self.bbox + (name,))
        for row in c:
            yield row[0]

    def search(self, name, val):
        '''Get all name:val tags in the current bounding box'''
        # First resolve the tagid
        tagid = self.tagid(name, val)
        if tagid is None: return

        c = self.db.cursor()
        c.execute("SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag"
                  " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                  "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                  "   AND poitag.tag = ? AND poitag.poi = poi.id",
                  self.bbox + (tagid,))
        self.oldsearch = []
        for row in c:
            self.oldsearch.append(row)
            yield row[0], simplejson.loads(row[1]), row[2], row[3]

    def count(self, name, val):
        '''Count all name:val tags in the current bounding box'''
        # First resolve the tagid
        tagid = self.tagid(name, val)
        if tagid is None: return

        c = self.db.cursor()
        c.execute("SELECT COUNT(*) FROM poi, poitag"
                  " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                  "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                  "   AND poitag.tag = ? AND poitag.poi = poi.id",
                  self.bbox + (tagid,))
        for row in c:
            return row[0]

    def replay(self):
        for row in self.oldsearch:
            yield row[0], simplejson.loads(row[1]), row[2], row[3]

Problem 3 solved: now on to the next step, building a user interface for it.

Posted Sat Jul 10 15:50:31 2010 Tags:

Importing OSM nodes into Spatialite

Second step of my SoTM10 pet project: creating a searchable database with the points. What a fantastic opportunity to learn Spatialite.

Learning Spatialite is easy. For example, you can use the two tutorials with catchy titles that assume your best wish in life is to create databases out of shapefiles using a pre-built, i386-only executable GUI binary downloaded over an insecure HTTP connection.

To be fair, the second of those tutorials is called "An almost Idiot's Guide", thus expliciting the requirement of being an almost idiot in order to happily acquire and run software in that way.

Alternatively, you can use A quick tutorial to SpatiaLite which is so quick it has examples that lead you to write SQL queries that trigger all sorts of vague exceptions at insert time. But at least it brought me a long way forward, at which point I could just cross reference things with PostGIS documentation to find out the right way of doing things.

So, here's the importer script, which will probably become my reference example for how to get started with Spatialite, and how to use Spatialite from Python:

#!/usr/bin/python

#
# poiimport - import nodes from OSM into a spatialite DB
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

import xml.sax
import xml.sax.handler
from pysqlite2 import dbapi2 as sqlite
import simplejson
import sys
import os

class OSMPOIReader(xml.sax.handler.ContentHandler):
    '''
    Filter SAX events in a OSM XML file to keep only nodes with names
    '''
    def __init__(self, consumer):
        self.consumer = consumer

    def startElement(self, name, attrs):
        if name == "node":
            self.attrs = attrs
            self.tags = dict()
        elif name == "tag":
            self.tags[attrs["k"]] = attrs["v"]

    def endElement(self, name):
        if name == "node":
            lat = float(self.attrs["lat"])
            lon = float(self.attrs["lon"])
            id = int(self.attrs["id"])
            #dt = parse(self.attrs["timestamp"])
            uid = self.attrs.get("uid", None)
            uid = int(uid) if uid is not None else None
            user = self.attrs.get("user", None)

            self.consumer(lat, lon, id, self.tags, user=user, uid=uid)

class Importer(object):
    '''
    Create the spatialite database and populate it
    '''
    TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])

    def __init__(self, filename):
        self.db = sqlite.connect(filename)
        self.db.enable_load_extension(True)
        self.db.execute("SELECT load_extension('libspatialite.so')")
        self.db.execute("SELECT InitSpatialMetaData()")
        self.db.execute("INSERT INTO spatial_ref_sys (srid, auth_name, auth_srid,"
                        " ref_sys_name, proj4text) VALUES (4326, 'epsg', 4326,"
                        " 'WGS 84', '+proj=longlat +ellps=WGS84 +datum=WGS84"
                        " +no_defs')")
        self.db.execute("CREATE TABLE poi (id int not null unique primary key,"
                        " name char, data text)")
        self.db.execute("SELECT AddGeometryColumn('poi', 'geom', 4326, 'POINT', 2)")
        self.db.execute("SELECT CreateSpatialIndex('poi', 'geom')")
        self.db.execute("CREATE TABLE tag (id integer primary key autoincrement,"
                        " name char, value char)")
        self.db.execute("CREATE UNIQUE INDEX tagidx ON tag (name, value)")
        self.db.execute("CREATE TABLE poitag (poi int not null, tag int not null)")
        self.db.execute("CREATE UNIQUE INDEX poitagidx ON poitag (poi, tag)")
        self.tagid_cache = dict()

    def tagid(self, k, v):
        key = (k, v)
        res = self.tagid_cache.get(key, None)
        if res is None:
            c = self.db.cursor()
            c.execute("SELECT id FROM tag WHERE name=? AND value=?", key)
            for row in c:
                self.tagid_cache[key] = row[0]
                return row[0]
            self.db.execute("INSERT INTO tag (id, name, value) VALUES (NULL, ?, ?)", key)
            c.execute("SELECT last_insert_rowid()")
            for row in c:
                res = row[0]
            self.tagid_cache[key] = res
        return res

    def __call__(self, lat, lon, id, tags, user=None, uid=None):
        # Acquire tag IDs
        tagids = []
        for k, v in tags.iteritems():
            if k not in self.TAG_WHITELIST: continue
            for val in v.split(";"):
                tagids.append(self.tagid(k, val))

        # Skip elements that don't have the tags we want
        if not tagids: return

        geom = "POINT(%f %f)" % (lon, lat)
        self.db.execute("INSERT INTO poi (id, geom, name, data)"
                        "     VALUES (?, GeomFromText(?, 4326), ?, ?)", 
                (id, geom, tags["name"], simplejson.dumps(tags)))

        for tid in tagids:
            self.db.execute("INSERT INTO poitag (poi, tag) VALUES (?, ?)", (id, tid))


    def done(self):
        self.db.commit()

# Get the output file name
filename = sys.argv[1]

# Ensure we start from scratch
if os.path.exists(filename):
    print >>sys.stderr, filename, "already exists"
    sys.exit(1)

# Import
parser = xml.sax.make_parser()
importer = Importer(filename)
handler = OSMPOIReader(importer)
parser.setContentHandler(handler)
parser.parse(sys.stdin)
importer.done()

Let's run it:

$ ./poiimport pois.db < pois.osm 
SpatiaLite version ..: 2.4.0    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct Dbf access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 Rel. 4.7.1, 23 September 2009
GEOS version 3.2.0-CAPI-1.6.0
$ ls -l --si pois*
-rw-r--r-- 1 enrico enrico 17M Jul  9 23:44 pois.db
-rw-r--r-- 1 enrico enrico 37M Jul  9 16:20 pois.osm
$ spatialite pois.db
SpatiaLite version ..: 2.4.0    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct DBF access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 version ......: Rel. 4.7.1, 23 September 2009
GEOS version ........: 3.2.0-CAPI-1.6.0
SQLite version ......: 3.6.23.1
Enter ".help" for instructions
spatialite> select id from tag where name="amenity" and value="fountain";
24
spatialite> SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.56 AND xmax <= 2.90 AND ymin >= 41.84 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
Font Picant de la Cellera|{"amenity": "fountain", "name": "Font Picant de la Cellera"}|2.616045|41.952449
Font de Can Pla|{"amenity": "fountain", "name": "Font de Can Pla"}|2.622354|41.974724
Font de Can Ribes|{"amenity": "fountain", "name": "Font de Can Ribes"}|2.62311|41.979193

It's impressive: I've got all sort of useful information for the whole of Spain in just 17Mb!

Let's put it to practice: I'm thirsty, is there any water fountain nearby?

spatialite> SELECT count(1) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.80 AND xmax <= 2.85 AND ymin >= 41.97 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
0

Ouch! No water fountains mapped in Girona... yet.

Problem 2 solved: now on to the next step, trying to show the results in some usable way.

Posted Sat Jul 10 09:10:35 2010 Tags:

Filtering nodes out of OSM files

I have a pet project here at SoTM10: create a tool for searching nearby POIs while offline.

The idea is to have something in my pocket (FreeRunner or N900), which doesn't require an internet connection, and which can point me at the nearest fountains, post offices, atm machines, bars and so on.

The first step is to obtain a list of POIs.

In theory one can use Xapi but all the known Xapi servers appear to be down at the moment.

Another attempt is to obtain it by filtering all nodes with the tags we want out of a planet OSM extract. I downloaded the Spanish one and set to work.

First I tried with xmlstarlet, but it ate all the RAM and crashed my laptop, because for some reason, on my laptop the Linux kernels up to 2.6.32 (don't now about later ones) like to swap out ALL running apps to cache I/O operations, which mean that heavy I/O operations swap out the very programs performing them, so the system gets caught in some infinite I/O loop and dies. Or at least this is what I've figured out so far.

So, we need SAX. I put together this prototype in Python, which can process a nice 8MB/s of OSM data for quite some time with a constant, low RAM usage:

#!/usr/bin/python

#
# poifilter - extract interesting nodes from OSM XML files
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#


import xml.sax
import xml.sax.handler
import xml.sax.saxutils
import sys

class XMLSAXFilter(xml.sax.handler.ContentHandler):
    '''
    A SAX filter that is a ContentHandler.

    There is xml.sax.saxutils.XMLFilterBase in the standard library but it is
    undocumented, and most of the examples using it you find online are wrong.
    You can look at its source code, and at that point you find out that it is
    an offensive practical joke.
    '''
    def __init__(self, downstream):
        self.downstream = downstream

    # ContentHandler methods

    def setDocumentLocator(self, locator):
        self.downstream.setDocumentLocator(locator)

    def startDocument(self):
        self.downstream.startDocument()

    def endDocument(self):
        self.downstream.endDocument()

    def startPrefixMapping(self, prefix, uri):
        self.downstream.startPrefixMapping(prefix, uri)

    def endPrefixMapping(self, prefix):
        self.downstream.endPrefixMapping(prefix)

    def startElement(self, name, attrs):
        self.downstream.startElement(name, attrs)

    def endElement(self, name):
        self.downstream.endElement(name)

    def startElementNS(self, name, qname, attrs):
        self.downstream.startElementNS(name, qname, attrs)

    def endElementNS(self, name, qname):
        self.downstream.endElementNS(name, qname)

    def characters(self, content):
        self.downstream.characters(content)

    def ignorableWhitespace(self, chars):
        self.downstream.ignorableWhitespace(chars)

    def processingInstruction(self, target, data):
        self.downstream.processingInstruction(target, data)

    def skippedEntity(self, name):
        self.downstream.skippedEntity(name)

class OSMPOIHandler(XMLSAXFilter):
    '''
    Filter SAX events in a OSM XML file to keep only nodes with names
    '''
    PASSTHROUGH = ["osm", "bound"]
    TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])

    def startElement(self, name, attrs):
        if name in self.PASSTHROUGH:
            self.downstream.startElement(name, attrs)
        elif name == "node":
            self.attrs = attrs
            self.tags = []
            self.propagate = False
        elif name == "tag":
            if self.tags is not None:
                self.tags.append(attrs)
                if attrs["k"] in self.TAG_WHITELIST:
                    self.propagate = True
        else:
            self.tags = None
            self.attrs = None

    def endElement(self, name):
        if name in self.PASSTHROUGH:
            self.downstream.endElement(name)
        elif name == "node":
            if self.propagate:
                self.downstream.startElement("node", self.attrs)
                for attrs in self.tags:
                    self.downstream.startElement("tag", attrs)
                    self.downstream.endElement("tag")
                self.downstream.endElement("node")

    def ignorableWhitespace(self, chars):
        pass

    def characters(self, content):
        pass

# Simple stdin->stdout XMl filter
parser = xml.sax.make_parser()
handler = OSMPOIHandler(xml.sax.saxutils.XMLGenerator(sys.stdout, "utf-8"))
parser.setContentHandler(handler)
parser.parse(sys.stdin)

Let's run it:

$ bzcat /store/osm/spain.osm.bz2 | pv | ./poifilter > pois.osm
[...]
$ ls -l --si pois.osm
-rw-r--r-- 1 enrico enrico 19M Jul 10 23:56 pois.osm
$ xmlstarlet val pois.osm 
pois.osm - valid

Problem 1 solved: now on to the next step: importing the nodes in a database.

Posted Fri Jul 9 16:28:15 2010 Tags:

Tweaking locale settings

I sometimes meet some Italian programmer who prefers his system to be in English, so they get untranslated manpages and error messages.

I sometimes notice that their solution often leaves them something to complain about.

Some set LANG=C and complain they can't see accented characters.

Some set LANG=en_US.UTF-8 and complain that OpenOffice Calc wants dates in the format MM/DD/YYYY which is an Abomination unto Nuggan, as well as unto Me.

But the locales system can do much better than that. In fact, most times people would be extremely happy with LC_MESSAGES in English, and everything else in Italian:

$ locale
LANG=it_IT.UTF-8
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="it_IT.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="it_IT.UTF-8"
LC_NAME="it_IT.UTF-8"
LC_ADDRESS="it_IT.UTF-8"
LC_TELEPHONE="it_IT.UTF-8"
LC_MEASUREMENT="it_IT.UTF-8"
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=

A way to do this (is there a better one?) is to tell the display manager (GDM, KDM...) to use the Italian locale, and then override the right LC_* bits in ~/.xsessionrc:

$ cat ~/.xsessionrc
export LC_MESSAGES=en_US.UTF-8

That does the trick: English messages, with Proper currency, Proper dates, accented letters sort Properly, Proper A4 printer paper, Proper SI units. Even Nuggan would be happy.

Posted Sat Apr 17 15:39:19 2010 Tags:

Temporarily disabling file caching

Does it happen to you that you cp a big, big file (say, similar in order of magnitude to the amount of RAM) and the system becomes rather unusable?

It looks like Linux is saying "let's cache this", and as you copy it will try to free more and more ram in order to cache the big file you're copying. In the end, all the RAM is full with file data that you are not going to need.

This varies according to how /proc/sys/vm/swappiness is set.

I learnt about posix_fadvise and I tried to play with it. The result is this preloadable library that hooks into open(2) and fadvises everything as POSIX_FADV_DONTNEED.

It is all rather awkward. fadvise in that way will discard existing cache pages if the file is already cached, which is too much. Ideally one would like to say "don't cache this because of me" without stepping on the feet of other system activities.

Also, I found I need to also hook into write(2) and run fadvise after every single write, because you can't fadvise a file to be written in its entirety, unless you pass fadvise the file size in advance. But the size of the output file cannot be known by the preloaded library, so meh.

So, now I can run: nocache cp bigfile someplace/ without trashing the existing caches. I can also run nocache tar zxf foo.tar.gz and so on. I wish, of course, that there were no need to do so in the first place.

Here is the nocache library source code, for reference:

/*
 * nocache - LD_PRELOAD library to fadvise written files to not be cached
 *
 * Copyright (C) 2009--2010 Enrico Zini <enrico@enricozini.org>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */

#define _XOPEN_SOURCE 600
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <dlfcn.h>
#include <stdarg.h>
#include <errno.h>
#include <stdio.h>

typedef int (*open_t)(const char*, int, ...);
typedef int (*write_t)(int fd, const void *buf, size_t count);

int open(const char *pathname, int flags, ...)
{
    static open_t func = 0;
    int res;
    if (!func)
        func = (open_t)dlsym(RTLD_NEXT, "open");

    // Note: I wanted to add O_DIRECT, but it imposes restriction on buffer
    // alignment
    if (flags & O_CREAT)
    {
        va_list ap;
        va_start(ap, flags);
        mode_t mode = va_arg(ap, mode_t);
        res = func(pathname, flags, mode);
        va_end(ap);
    } else
        res = func(pathname, flags);

    if (res >= 0)
    {
        int saved_errno = errno;
        int z = posix_fadvise(res, 0, 0, POSIX_FADV_DONTNEED);
        if (z != 0) fprintf(stderr, "Cannot fadvise on %s: %m\n", pathname);
        errno = saved_errno;
    }

    return res;
}

int write(int fd, const void *buf, size_t count)
{
    static write_t func = 0;
    int res;
    if (!func)
        func = (write_t)dlsym(RTLD_NEXT, "write");

    res = func(fd, buf, count);

    if (res > 0)
    {
        int saved_errno = errno;
        int z = posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
        if (z != 0) fprintf(stderr, "Cannot fadvise during write: %m\n");
        errno = saved_errno;
    }

    return res;
}

Updates

Steve Schnepp writes:

Robert Love did a O_STREAMING patch for 2.4. It wasn't merged in 2.6 since POSIX_FADV_NOREUSE should be used instead.

But unfortunatly it's currently mapped either as WILLNEED or as a noop.

It seems that there is a google code project that has spawned to control this.

Posted Mon Mar 8 13:26:28 2010 Tags:
Posted Sat Jun 6 00:57:39 2009

Pages about Ubuntu.

Live CD on a removable disk, the Debian way

In [live-cd-on-removable-disk] at some point I wrote:

Enrico's note: do we have anything in Debian that we can install and just does that?

Here are the answers:

Sven Mueller writes:

Well, Enrico, a tool I really grew fond of, which auto-configures X on Debian systems is xdebconfigurator, it lacks being auto-run on each system start, which I consider a feature on normal systems, but for your proposed usage (i.e. a portable USB-storage based Debian system), it would certainly be the right thing.

Essentially, it never failed on me. Except for VMware virtual machines, where all it did wrong was that it proposed too high resolutions which resulted from my dual-screen Windows setup I ran VMware on. You might want to give it a try.

Tollef Fog Heen writes:

I added the support in casper for doing this almost a year ago and it has saved me lots of debugging time. Booting the live CD that way is almost as fast as booting an installed system. If you couple this with using the persistent storage support in casper, you can get the configure-on-boot support together with persistency.

In a later update, slh is quited saying that xresprobe doesn't work on AMD64. This is wrong, I wrote that support based on code by Matthew Garret a little more than nine months ago. I wouldn't recommend incorporating it in new-written code, but rather use libx86

And finally, Marco Amadori writes:

Without needing to look for tools external to Debian, there is already the Debian Live software in sid: live-package, that creates a live system, and casper, that generates an initramfs that can configure a Debian system on the fly.

So far there is no hard disk target for live-package, but the "Iso" target can already do the job quite well. At boot time, Casper's initramfs scans all the block devices, so it works also for USB keys and hard drives.

To obtain a hard drive image, you just need to invoke "make-live" with the options to have the required software, then copy the content of the iso (or of the directory ./debian-live/binary) on a partition and install the boot loader.

This is what the future "HD" target of live-package will do; so far it can only build ISO and Netboot images.

Posted Sat Jun 6 00:57:39 2009 Tags:

First pratical lesson

Notes after today's training session.

Small index of most used shell commands:

  • ls - list directory contents
  • cp - copy files and directories
  • mv - move (rename) files
  • rm - remove files or directories
  • find - search for files in a directory hierarchy
  • cat - concatenate files and print on the standard output
  • more - file perusal filter for crt viewing
  • less - opposite of more (quit with 'q')
  • cd - Change the current directory to DIR. (use "help cd" instead of "man cd")
  • mkdir - make directories
  • rmdir - remove empty directories

Small index of commands useful for combining in pipelines:

  • grep, egrep, fgrep, rgrep - print lines matching a pattern
  • tail - output the last part of files
  • head - output the first part of files
  • sort - sort lines of text files
  • uniq - report or omit repeated lines
  • sed - stream editor
  • wc - print the number of newlines, words, and bytes in files

Problems found during the lesson:

  • You set the system default locale to Amharic, and the gdm login will be in Amharic input mode. We didn't find out how to switch it back to input roman characters. Right click on the input field to set the input method doesn't work. Since usernames are not in Amharic, you're locked out.
  • So you CTRL+ALT+F1, login and try dpkg-reconfigure locales. On Ubuntu Dapper, it does not work anymore.
  • So you dig and dig and dig and finally find that you can force a locale in /etc/default/gdm (but not in /etc/gdm/locale.conf, nor in /etc/gdm/gdm.conf).
  • Then the internet works for a bit and you look up how to reconfigure locales in Ubuntu. Turns out you have to use localeconf, which is not installed by default, is not in universe and thus not on the CDs, and needs to be downloaded from the Internet.
  • The Ubuntu wiki is all on https, which defeats any attempt of proxy caching.
  • An Internet proxy needs to be configured 3 times: in Gnome, in Firefox and in Synaptic (well, apt). This is especially tricky when you forgot to setup the proxy in Synaptic and seemingly unrelated applications fail, like the Ubuntu language selector, which internally invokes the package manager to download missing langpacks.
  • Some short descriptions in the NAME section of manpages are hard to understand, or wrong. Noted on apt-get, apt-cache and less. Top prize goes to apt-cache:

     NAME
            apt-cache - APT package handling utility -- cache manipulator
     DESCRIPTION
            [...] apt-cache does not manipulate the state of the system but
            does provide operations to search and generate interesting output
            from the package metadata. [...]
    

    So apt-cache is a manipulator that doesn't manipulate. A possible improvement can be "query the APT package cache".

  • The language selector in Ubuntu Breezy doesn't really exit and keeps the package database locked. This seems to be fixed in Dapper, and probably had been fixed in some Breezy update. System updates here are a problem: my Dapper (with some Universe things in it) wanted to download more than 120Mb of data, and the Uni network was giving me 14Kbps. It's been a nice opportunity to teach about fuser -uva and kill.
  • dict, squid and many other packages from 'main' are not on the normal Ubuntu CDs: is there an easy way to build a CD with them? Or do Ubuntu CDs with extra packages already exist? I'll have to find out.
  • cupsys has documentation outside of /usr/share/doc, in /usr/share/cups/doc-root.
  • man works on all commands, except cd, which is an internal shell command and thus needs help instead of man. I should remember to ponder about autogenerating manpages from help output.
  • Is there an index-like manpage with a list of the core Unix commands and their short descriptions? It there's not, it's easy to generate:

     #!/bin/sh
     DIR=${1:-"/bin"}
     (
     find $DIR | while read FILE
     do
         if [ -x $FILE ] && ! [ -d $FILE ]
         then
             LANG=C COLUMNS=2000 man `basename $FILE` | \
                      grep ^SYNOPSIS -B 100 | grep ^NAME -A 100 | \
                      tail -n +2 | head -n +2 | \
                      grep -v '^[ \t]*$'
         fi
     done
     ) | sort | uniq | sed 's/^ \+//'
    

    Try running it on /bin and /sbin: it's great!. Also, since it doesn't redirect stderr, it nicely exposes a number of manpage problems.

Lots of bugs to report when I come home: from here it'll take ages, and lots of money on the hotel internet connection, and some are Ubuntu-specific so I'd need to do everything online with Malone.

As usual, teaching is one of the best ways to find bugs.

I propose an Etch training session a month before release.

Other things to do:

  • Find more info about that Wikipedia live CD with Wikipedia browsable without the Internet.
  • Make a collection of Free technical E-books: even those Indian low-cost book editions are too expensive here, so E-books mean a lot.

Update: Matt Zimmerman writes:

I read your blog entry at http://www.enricozini.org/blog/eng/second-day-in-addis and wanted to respond as follows:

  • localeconf is not the standard way to configure locales in Ubuntu; what documentation told you that? It's an unsupported package from Progeny. If what you wanted was to set the system default locale from the command line, editing /etc/environment is probably the best way.

  • I suggest filing a bug report at <https://launchpad.net/products/ubuntu-website about the HTTPS issue>; I don't think it's necessary for the entire wiki to be HTTPS, only authentication.

  • Synaptic may be able to use the GNOME proxy settings without introducing undesirable dependencies; please file a wishlist bug

  • dict, squid and other packages from main are not on the Ubuntu CDs because there is no space. The DVD contains these packages.

  • The cupsys documentation bug was quite likely inherited from Debian and should be reported there

  • You can file bugs in Malone via email; this has been possible for a long time now. Please don't reinforce this misconception.

    https://help.launchpad.net/UsingMaloneEmail

Update:

Posted Sat Jun 6 00:57:39 2009 Tags:

Live CD on a removable disk

Eros is a hardware guru that happened to be the unknown guy sitting next to me on a plane.

He happens to be a happy Kubuntuer. While chatting, he told me one of his systems is an external hard drive made by copying a Kubuntu live CD image on it.

Why did you do so? I asked.

Because this way I can plug it in any computer, and it'll do hardware detection at boot. However it's a hard drive, so it's fast, and I can keep my home and all my customisations on it.

I had never thought of it.

That's an interesting and smart (ab)use of a live CD.

Now I wonder: what would be required to plug the live CD boot time hardware detection infrastructure on an existing Debian or Ubuntu instalation?

Update: slh on IRC suggests (a bit edited by me):

A lot of the former "obscure black magic" for live CDs isn't needed anymore. What is needed is: a kernel with static usb-storage, libusual, ehci-hcd, ohci-hcd, uhci-hdc (or an appropriate initrd/ initramfs). udev takes care of most h/w detection issues these days.

As long as everything needed to boot is contained in a single partition you don't need a fstab: udev, hal and pmount take care of the rest, procfs, sysvfs, devpts, usbfs, shm are mounted by sysvinit.

All what is left is a tool to create the xorg.conf while booting (those tools exist and just need to be called early).

Everything else is just a matter of convenience: enhancing the live span of the USB key by changing data into tmpfs, etc.; if passwordless logins are required then xsession and inittab need to be changed; new ssh host keys generated on boot; small stuff.

With ordinary flash storage, jffs2 and something to reduce write access is a good idea: perhaps unionfs for /var/ and /home/, bind mounting /tmp/ on /var/tmp/), but that's also not strictly necessary.

Mostly it boils down to running the xorg-creation script at every boot time.

There are various tools to do that. Some are here, but there is surely more. (Enrico's note: do we have anything in Debian that we can install and just does that?)

Since USB and PS/2 mice share the same device since kernel 2.6, that part of xorg.conf doesn't strictly need to be detected, same for the keyboard (alps and synaptic touchpads can be easily detected) and X.org can use the screen's ddc info although it's not always reliable.

It can boil down to just detecting the video chipset: something like this, that uses PCI IDs from discover1-data.

It can also become a lot easier with X.org's own ddc detection, which almost boils down to configuring input devices and selecting the video driver. If I understand Daniel Stone correctly, X.org will soon improve its detection routines (fail safe X (auto-)configuration) as well in X.org 7.3.

xresprobe is in debian: it's pretty similar to ddcxinfo-kanotix, both forked off RedHat's kudzu package - and all fail miserably on amd64. That's why ddcxinfo has a fallback to 1024*768 @75 Hz which "always works (+manual overrides)".

Posted Sat Jun 6 00:57:39 2009 Tags:

Fixing problems after upgrade to Dapper

Laptop: Asus M3Ae

Problem: Can't mount root partition because of various ACPI errors. Breezy kernel works.

Solution:

1) boot with old kernel 2) sudo echo "libata noacpi=1" >> /etc/mkinitramfs/modules 3) sudo mv /boot/initrd.img-2.6.15-25-686 /boot/initrd.img-2.6.15-25-686.backup 4) mkinitramfs -o /boot/initrd.img-2.6.15-25-686 2.6.15-25-686

Thanks: Matthew Garrett

Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009

Pages about OpenMoko.

Released nodm 0.7

I have released version 0.7 of nodm.

It only fixes one silly typo in autotools, which made it fail to build on Fedora.

Posted Sun May 23 21:36:52 2010 Tags:

Released nodm 0.6

I have released version 0.6 of nodm.

It is purely a bug fix release, trying harder to detect a console in order to get rid of a bug introduced with version 0.5

Posted Mon Aug 3 12:34:16 2009 Tags:

Released nodm 0.5

I have released version 0.5 of nodm.

New features:

  • truncate ~/.xsession-errors on startup: finally that file stops growing, and growing, and growing...
  • dynamic VT allocation: it can now avoid opening a virtual terminal if it is already in use.
Posted Fri Jul 24 02:29:55 2009 Tags:

Getting dbus signatures right from Vala

I am trying to play a bit with Vala on the FreeRunner.

The freesmartphone.org stack on the OpenMoko is heavily based on DBus. Using DBus from Vala is rather simple, if mostly undocumented: you get a few examples in the Vala wiki and you make do with those.

All works fine with simple methods. But what with providing callbacks to signals that have complex nested structures in their signatures, like aa{sv}? You try, and then if you don't get the method signature right, the signal is just silently not delivered because it does not match the method signature.

So this is how to provide a callback to org.freesmartphone.Usage.ResourceChanged, with signature sba{sv}:

public void on_resourcechanged(dynamic DBus.Object pos,
                   string name,
                   bool state,
                   HashTable<string, Value?> attributes)
{
    stderr.printf("Resource %s changed\n", name);
}

And this is how to provide a callback to org.freesmartphone.GPS.UBX.DebugPacket, with signature siaa{sv}:

protected void on_ubxdebug_packet(dynamic DBus.Object ubx, string clid, int length,
        HashTable<string, Value?>[] wrongdata)
{
    stderr.printf("Received UBX debug packet");

    // Ugly ugly work-around
    PtrArray< HashTable<string, Value?> >* data = (PtrArray< HashTable<string, Value?> >)wrongdata;

    stderr.printf("%u elements received", data->len);
}

What is happening here is that the only method signature that I found matching the dbus signature is this one. However, the unmarshaller for some reason gets it wrong, and passes a PtrArray instead of a HashTable array. So you need to cast it back to what you've actually been passed.

Figuring all this out took several long hours and was definitely not fun.

Posted Wed Jul 15 12:30:50 2009 Tags:

Mapping using the Openmoko FreeRunner headset

The FreeRunner has a headset which includes a microphone and a button. When doing OpenStreetMap mapping, it would be very useful to be able to keep tangogps on the display and be able to mark waypoints using the headset button, and to record an audio track using the headset microphone.

In this way, I can use tangogps to see where I need to go, where it's already mapped and where it isn't, and then I can use the headset to mark waypoints corresponding to the audio track, so that later I can take advantage of JOSM's audio mapping features.

Enter audiomap:

$ audiomap --help
Usage: audiomap [options]

Create a GPX and audio trackFind the times in the wav file when there is clear
voice among the noise

Options:
  --version      show program's version number and exit
  -h, --help     show this help message and exit
  -v, --verbose  verbose mode
  -m, --monitor  only keep the GPS on and monitor satellite status
  -l, --levels   only show input levels

If called without parameters, or with -v which is suggested, it will:

  1. Fix the mixer settings so that it can record from the headset and detect headset button presses.
  2. Show a monitor of GPS satellite information until it gets a fix.
  3. Synchronize the system time with the GPS time so that the timestamps of the files that are created afterwards are accurate.
  4. Start recording a GPX track.
  5. Start recording audio.
  6. Record a GPX waypoint for every headset button press.

When you are done, you stop audiomap with ^C and it will properly close the .wav file, close the tags in the GPX waypoint and track files and restore the mixer settings.

You can plug the headset out and record using the handset microphone, but then you will not be able to set waypoints until you plug the headset back in.

After you stop audiomap, you will have a track, waypoints and .wav file ready to be loaded in JOSM.

Big thanks go to Luca Capello for finding out how to detect headset button presses.

Posted Sun Jun 7 23:51:37 2009 Tags:

Simple tool to query the GPS using the OpenMoko FSO stack

I was missing a simple command line tool that allows me to perform basic GPS queries in shellscripts.

Enter getgps:

# getgps --help
Usage: getgps [options]

Simple GPS query tool for the FSO stack

Options:
  --version          show program's version number and exit
  -h, --help         show this help message and exit
  -v, --verbose      verbose mode
  -q, --quiet        suppress normal output
  --fix              check if we have a fix
  -s, --sync-time    set system time from GPS time
  --info             get all GPS information
  --info-connection  get GPS connection information
  --info-fix         get GPS fix information
  --info-position    get GPS position information
  --info-accuracy    get GPS accuracy information
  --info-course      get GPS course information
  --info-time        get GPS time information
  --info-satellite   get GPS satellite information

So finally I can write little GPS-aware scripts:

if getgps --fix -q
then
    start_gps_aware_program
else
    start_gps_normal_program
fi

Or this.

Posted Sun Jun 7 17:59:32 2009 Tags:

Voice-controlled waypoints

I have it in my TODO list to implement taking waypoints when pressing the headset button of the openmoko, but that is not done yet.

In the meantime, I did some experiments with audio mapping, and since I did not manage to enter waypoints while recording them, I was looking for a way to make use of them anyway.

Enter findvoice:

$ ./findvoice  --help
Usage: findvoice [options] wavfile

Find the times in the wav file when there is clear voice among the noise

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -v, --verbose         verbose mode
  -p NUM, --percentile=NUM
            percentile to use to discriminate noise from voice
            (default: 90)
  -t, --timestamps      print timestamps instead of human readable information

You give it a wav file, and it will output a list of timestamps corresponding to where it things that you were talking clearly and near the FreeRunner / voice recorder instead of leaving the recorder dangling to pick up background noise.

Its algorithm is crude and improvised because I have no background whatsoever in audio processing, but it basically finds those parts of the audio file where the variance of the samples is above a given percentile: the higher the percentile, the less timestamps you get; the lower the percentile, the more likely it is that it picks a period of louder noise.

For example, you can automatically extract waypoints out of an audio file by using it together with gpxinterpolate:

./findvoice -t today.wav | ./gpxinterpolate today.gpx > today-waypoints.gpx

The timestamps it outputs are computed using the modification time of the .wav file: if your system clock was decently synchronised (which you can do with getgps), then the mtime of the wav is the time of the end of the recording, which gives the needed reference to compute timestamps that are absolute in time.

For example:

getgps --sync-time
arecord file.wav
^C
./findvoice -t file.wav | ./gpxinterpolate today.gpx > today-waypoints.gpx
Posted Sun Jun 7 02:48:40 2009 Tags:

Geocoding Unix timestamps

Geocoding EXIF tags in JPEG images is fun, but there is more that can benefit from interpolating timestamps over a GPX track.

Enter gpxinterpolate:

$ ./gpxinterpolate --help
Usage: gpxinterpolate [options] gpxfile [gpxfile...]

Read one or more GPX files and a list of timestamps on standard input. Output
a GPX file with waypoints at the location of the GPX track at the given
timestamps.

Options:
  --version      show program's version number and exit
  -h, --help     show this help message and exit
  -v, --verbose  verbose mode

For example, you can create waypoints interpolating file modification times:

find . -printf "%Ts %p\n" | ./gpxinterpolate ~/tracks/*.gpx > myfiles.gpx

In case you wonder where you were when you modified or accessed a file, now you can find out.

Posted Sun Jun 7 02:07:43 2009 Tags:

Recording audio on the FreeRunner

The FreeRunner can record audio. It is nice to record audio: for example I can run the recording in background while I keep tangogps in the screen, and take audio notes about where I am while I am doing mapping for OpenStreetMap.

Here is the script that I put together to create geocoded audio notes:

#!/bin/sh

WORKDIR=~/rec
TMPINFO=`mktemp $WORKDIR/info.XXXXXXXX`

# Sync system time and get GPS info
echo "Synchronising system time..."
getgps --sync-time --info > $TMPINFO

# Compute an accurate basename for the files we generate
BASENAME=~/rec/rec-$(date +%Y-%m-%d-%H-%M-%S)
# Then give a proper name to the file with saved info
mv $TMPINFO $BASENAME.info

# Proper mixer settings for recording
echo "Recording..."
alsactl -f /usr/share/openmoko/scenarios/voip-handset.state restore
arecord -D hw -f cd -r 8000 -t wav $BASENAME.wav

echo "Done"

It works like this:

  1. It synchronizes the system time from the GPS (if there is a fix) so that the timestamps on the wav files will be as accurate as possible.
  2. It also gets all sort of information from the GPS and stores them into a file, should you want to inspect it later.
  3. It records audio until it gets interrupted.

The file name of the files that it generates corresponds to the beginning of the recording. The mtime of the wav file obviously corresponds to the end of the recording. This can be used to later georeference the start and end point of the recording.

You can use this to check mixer levels and that you're actually getting any input:

arecord -D hw -f cd -r 8000 -t wav -V mono /dev/null

The getgps script is now described in its own post.

You may now want to experiment, in JOSM, with "Preferences / Audio settings / Modified times (time stamps) of audio files".

Posted Sun Jun 7 01:30:37 2009 Tags:

Logging on the Freerunner

By default, the FreeRunner comes without a system log daemon, and rightfully so, because you don't want a daemon filling up /var, and a nightly cron job to rotate logs, on a mobile phone.

However, I do not want to give up logging. If there are bugs I want to investigate, I like to be able to look at the logs.

The solution: remote logging. Configure the phone to send all log info via UDP to my laptop, and configure my laptop to listen for UDP syslog messages from the phone only.

How to do it:

  1. apt-get install rsyslog (see below how other logging daemons fail)
  2. Add a /etc/hosts entry for the laptop on the phone: 192.168.0.200 base
  3. Add a /etc/hosts entry for the phone on the laptop: 192.168.0.202 openmoko
  4. Replace the phone's rsyslog rules to send everything to the laptop:

    ################
    ##### RULES ####
    ################
    
    # Everything goes to the main computer, via UDP, if available
    *.*     @base
    
    #
    # Emergencies are sent to everybody logged in.
    #
    *.emerg                         *
    
  5. Tell the laptop's rsyslog to accept UDP messages from the phone only:

    # provides UDP syslog reception
    $ModLoad imudp
    $UDPServerRun 514
    $UDPServerAddress 192.168.0.200
    

That is all. When the phone is not connected, the UDP packets from the logger will go nowhere. When it gets connected to the laptop, a script is triggered that creates the 192.168.0.200 interface, rsyslogd will listen on that interface and the UDP packets will be logged.

This has the extra potential to be able to get error messages in case of bugs in the SD card driver, which I seem to hit from time to time.

Why rsyslog

sysklogd could receive the messages from the phone, but can only listen on all interfaces, and therefore requires me to set up firewall on my laptop in order to get packets from the phone only.

syslog-ng can and can be told to listen on a given interface, but it will refuse to start if that interface isn't available. Since the phone is on an hotplugged interface that comes and goes, this is totally unacceptable.

And finally, rsyslog is going to replace sysklogd both in Debian and in Fedora, so it's just as well a good excuse to switch.

Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009

Pages about Debian.

The smell of email

This was written in response to a message with a list of demotivating behaviours in email interactions, like fingerpointing, aggressiveness, resistance when being called out for misbehaving, public humiliation for mistakes, and so on

There are times when I stumble on an instance of the set of things that were mentioned, and I think "ok, today I feel like doing some paid work rather than working on Debian".

If another day I wake up deciding to enjoy working on Debian, which I greatly do, I try and make sure that I can focus on bits of Debian where I don't stumble on any instances of the set of things that were mentioned.

Then I stumble on Gregor's GDAC and I feel like I'd happily lose one day of pay right now, and have fun with Debian.

I feel like Debian is this big open kitchen populated by a lot of people:

  • some dump shit
  • some poke the shit with a stick, contributing to the spread of the smell
  • some carefully clean up the shit, which in the short term still contributes to the smell, but makes things better in the long term
  • some prepare and cook, making a nice smell of food and NOMs
  • some try out the food and tell us how good it was

I have fun cooking and tring out the food. I have fun being around people who cook and try out the food.

The fun in the kitchen seems to be correlated to several things, one of which is that it seems to be inversely proportional to the stink.

I find this metaphore interesting, and I will start thinking about the smell of a mailing list post. I expect it should put posts into perspective, I expect I will develop an instinct for it, so that I won't give a stinky post the same importance of a post that smells of food.

I also expect that the more I learn to tell the smell of food from the smell of shit, the more I can help cleaning it, and the more I can help telling people who repeatedly contribute to the stink to please try cooking instead, or failing that, just try and stay out of the kitchen.

Posted Fri Dec 5 11:51:49 2014 Tags:

Fun and Sanity in Debian

A friend of mine recently asked: "is there anything happening in Debian besides systemd?"

Of course there is. He asked it 2 days after the freeze, which happened in time, and with an amazingly low RC bug count.

The most visible thing right now seems to be this endless init system argument, but there are fun and sane things in Debian. Many of them.

I think someone should put the spotlight on them, and here's my attempt.

Yesterday I set up a gobby document asking "What is now happening in Debian that is exciting, fun and sane?", and passed the link around the Cambridge Miniconf and some IRC channels.

Here are a few quotations that I collected:

The armhf and arm64 ports have for me been wonderful and exciting, and were a great time for me to start getting involved. (Jon "Aardvark" Ward)

We have a way of tracking random contributors, and as far as I know no other project has anything like it. (Enrico Zini)

codesearch.debian.net is an incredibly important resource, not just for us but for the free software community at large. (Ben Hutchings)

sources.debian.net is a very useful resource with lots of interested contributors, it received 10 OPW applicants (Stefano Zacchiroli)

It has never been easier to work on new infrastructure project thanks to the awesome work of the DSA team. We have dozens of contribution opportunities outside of just plain packaging. (Raphaël Hertzog)

The work on reproducible builds has achieved excellent results with 61.3% of packages being reproducible. (Paul Wise)

Porting arm64 has been (peversely) great fun. It's remarkably morish and I like nothing more than a tedious argument with autoconf macros. Working with lots of enthusiastic people from other teams, helping getting the port set up and build has been great - thank you everybody. (Wookey)

And here are random exciting things that were listed:

  • build-profile support (for bootstrapping) is all in jessie (dpkg, apt, sbuild, python-apt, debhelper, libconfig-model-dpkg-perl, lintian).
  • PointCloudLibrary (PCL) got migrated from Ubuntu to Debian
  • Long Term Support has arrived!
  • http://ci.debian.net
  • Debian is participating for the second time in OPW as mentor orga
  • ftp-master is getting an API
  • cross-toolchains for jessie are available
  • arm64/ppc64el ready to go into jessie
  • wheezy-backports is more useful and used than ever
  • we froze, in time, with a remarkably low RC bug count, and we have a concrete plan for getting from that to a release
Posted Sun Nov 9 16:10:48 2014 Tags:

cryptsetup password and parallel boot

Since parallel boot happened, during boot the cryptsetup password prompt in my system gets flooded with other boot messages.

I fixed it, as suggested in #764555, installing plymouth, then editing /etc/default/grub to add splash to GRUB_CMDLINE_LINUX_DEFAULT:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

Besides showing pretty pictures (and most importantly, getting them out of my way if I press ESC), plymouth also provides a user prompt that works with parallel boot which sounds like what I needed.

Posted Fri Oct 24 10:20:22 2014 Tags:

Alternate rescue boot entry with systemd

Since systemd version 215, adding systemd.debug-shell to the kernel command line activates the debug shell on tty9 alongside the normal boot. I like the idea of that, and I'd like to have it in my standard 'rescue' entry in my grub menu.

Unfortunately, by default update-grub does not allow to customize the rescue menu entry options. I have just filed #766530 hoping for that to change.

After testing the patch I proposed for /etc/grub.d/10_linux, I now have this in my /etc/default/grub, with some satisfaction:

GRUB_CMDLINE_LINUX_RECOVERY="systemd.log_target=kmsg systemd.log_level=debug systemd.debug-shell"

Further information:

Thanks to sjoerd and uau on #debian-systemd for their help.

Posted Thu Oct 23 22:06:30 2014 Tags:

Pressure

I've just stumbled on this bit that seems relevant to me:

Insist on using objective criteria

The final step is to use mutually agreed and objective criteria for evaluating the candidate solutions. During this stage they encourage openness and surrender to principle not pressure.

http://www.wikisummaries.org/Getting_to_Yes

I find the concept of "pressure" very relevant, and I like the idea of discussions being guided by content rather than pressure.

I'm exploring the idea of filing under this concept of "pressure" most of the things described in code of conducts, and I'm toying with looking at gender or race issues from the point of view of making people surrender to pressure.

In that context, most code of conducts seem to be giving a partial definition of "pressure". I've been uncomfortable at DebConf this year, because the conference PG12 code of conduct would cause me trouble for talking about what lessons can Debian learn from consent culture in BDSM communities, but it would still allow situations in which people would have to yield to pressure, as long as the pressure was done avoiding the behaviours blacklisted by the CoC.

Pressure could be the phrase "you are wrong" without further explanation, spoken by someone with more reputation than I have in a project. It could be someone with the time for writing ten emails a day discussing with someone with barely the time to write one. It could be someone using elaborate English discussing with someone who needs to look up every other word in a dictionary. It could be just ignoring emails from people who have issues different than mine.

I like the idea of having "please do not use pressure to bring your issues forward" written somewhere, rather than spend time blacklisting all possible ways of pressuring people.

I love how the Diversity Statement is elegantly getting all this where it says: «We welcome contributions from everyone as long as they interact constructively with our community.»

However, I also find it hard not to fall back to using pressure, even just for self-preservation: I have often found myself in the situation of having the responsibility to get a job done, and not having the time or emotional resources to even read the emails I get about the subject. All my life I've seen people in such a situation yell "shut up and let me work!", and I feel a burning thirst for other kinds of role models.

A CoC saying "do not use pressure" would not help me much here, but being around people who do that, learning to notice when and how they do it, and knowing that I could learn from them, that certainly would.

If you can link to examples, I'd like to add them here.

Posted Tue Sep 23 16:18:22 2014 Tags:

Laptop, I demand that you suspend!

Dear Lazyweb,

Sometimes some application prevents suspend on my laptop. I want to disable that feature: how?

I understand that there may exist some people who like that feature. I, on the other hand, consider a scenario like this inconceivable:

  1. I'm on a plane working with my laptop, the captain announces preparations for landing, so I quickly hit the suspend button (or close the lid) on my laptop and stow it away.
  2. One connecting flight later, I pick up my backpack, I feel it unusually hot and realise that my laptop has been on all along, and is now dead from either running out of battery or thermal protection.
  3. I think things that, if spoken aloud in front of a pentacle, might invoke major lovecraftian horrors.

I do not want this scenario to ever be possible. I want my suspend button to suspend the laptop no matter what. If a process does not agree, I'm fine with suspending it anyway, or killing it.

If I want my laptop to suspend, I generally have a good enough real-world reason for it, and I cannot conceive that a software could ever be allowed to override my command.

How do I change this? I don't know if I should look into systemd, upowerd, pm-utils, the kernel, the display manager or something else entirely. I worry that I cannot even figure where to start looking for a solution.

This happened to me multiple times already, and I consider it ridiculous. I know that it can cause me data loss. I know that it can cause me serious trouble in case I was relying on having some battery or state left at my arrival. I know that depending on what is in my backpack, this could also be physically dangerous.

So, what knob do I tweak for this? How do I make suspend reliable?

Update

Systemd has an inhibitor system, and systemd-inhibit --list only lists 'delay' blocks in my system. It is an interesting feature that seems to be implemented in the right way, and it could mean that I finally can get my screen to be locked before the system is suspended.

It is possible to configure the inhibitor system in /etc/systemd/logind.conf, including ways to ignore inhibitors, and a maximum time after which inhibitors are ignored if not yet released.

Try as I might to run everything that I was running on the plane that time, I could not manage to see anything take an inhibitor block that could have prevented my suspend. I now suspect that what happened to me was a glitch caused by something else (hardware? kernel? cosmic rays!) during that specific suspend.

When I had this issue in the past it looks like the infrastructure at the time was far more primitive that what we have now with systemd, so I guess that when writing my blog post I had simply correlated my old experiences with a one-off suspend glitch.

If I want to investigate or tune further, to test the situation with a runaway block, I can use commands like systemd-inhibit --mode=block sleep 3600.

I'm quite happy to see that we're moving to a standard and sane system for this. In the meantime, I have learnt that pm-utils has now become superfluous and can be deinstalled, and so can acpi-support and acpi-support-base.

Thanks vbernat, mbiebl, and ah, on #debian-devel for all the help.

Posted Thu Sep 11 14:32:40 2014 Tags:

Wheezy for industrial software development

I'm helping with setting up a wheezy-based toolchain for industrial automation.

The basic requirements are: live-build, C++11, Qt 5.3, and a frozen internal wheezy mirror.

debmirror

A good part of a day's work was lost because of #749734 and possibly #628779. Mirror rebuild is still ongoing, and fingers crossed.

This is Italy, and you can't simply download 21Gb of debs just to see how it goes.

C++11

Stable toolchains for C++11 now exist and have gained fast adoption. It makes sense, since given what is in C++11 it is unthinkable to start a new C++ project with the old standard nowadays.

C++11 is supported by g++ 4.8+ or clang 3.3+. None of them is available on wheezy or wheezy-backports.

Backports exist of g++ 4.8 only for Ubuntu 12.04, but they are uninstallable on wheezy due at least to a different libc6. I tried rebuilding g++4.8 on wheezy but quickly gave up.

clang 3.3 has a build dependency on g++ 4.8. LOL.

However, LLVM provides an APT repository with their most recent compiler, and it works, too. C++11 problem solved!

Qt 5.3

Qt 5.3 is needed because of the range of platforms it can target. There is no wheezy backport that I can find.

I cannot simply get it from Qt's Download page and install it, since we need it packaged, to build live ISOs with it.

I'm attempting to backport the packages from experimental to wheezy.

Here are its build dependencies:

libxcb-1.10 (needed by qt5)

Building this is reasonably straightforward.

libxkbcommon 0.4.0 (needed by qt5)

The version from jessie builds fine on wheezy, provided you remove --fail-missing from the dh_install invocation.

libicu 52.1 (needed by harfbuzz)

The jessie packages build on wheezy, provided that mentions of clang are deleted from source/configure.ac, since it fails to build with clang 3.5 (the one currently available for wheezy on llvm.org).

libharfbuzz-dev

Backporting this is a bloodbath: the Debian packages from jessie depend on a forest of gobject hipsterisms of doom, all unavailable on wheezy. I gave up.

qt 5.3

qtbase-opensource-src-5.3.0+dfsg can be made to build with an embedded version of harfbuzz, with just this change:

diff -Naur a/debian/control a/debian/control
--- a/debian/control    2014-05-20 18:48:27.000000000 +0200
+++ b/debian/control    2014-05-29 17:45:31.037215786 +0200
@@ -28,7 +28,6 @@
                libgstreamer-plugins-base0.10-dev,
                libgstreamer0.10-dev,
                libgtk2.0-dev,
-               libharfbuzz-dev,
                libicu-dev,
                libjpeg-dev,
                libmysqlclient-dev,
diff -Naur a/debian/rules b/debian/rules
--- a/debian/rules  2014-05-18 01:56:37.000000000 +0200
+++ b/debian/rules  2014-05-29 17:45:25.738634371 +0200
@@ -108,7 +108,6 @@
                -plugin-sql-tds \
                -system-sqlite \
                -platform $(platform_arg) \
-               -system-harfbuzz \
                -system-zlib \
                -system-libpng \
                -system-libjpeg \

(thanks Lisandro Damián Nicanor Pérez Meyer for helping me there!)

There are probably going to be further steps in the Qt5 toolchain.

Actually, let's try prebuilt binaries

The next day with a fresh mind we realised that it is preferable to reduce our tampering with the original wheezy to a minimum. Our current plan is to use wheezy's original Qt and Qt-using packages, and use Qt's prebuilt binaries in /opt for all our custom software.

We run Qt's installer, tarred the result, and wrapped it in a Debian package like this:

$ cat debian/rules
#!/usr/bin/make -f

QT_VERSION = 5.3

%:
    dh $@

override_dh_auto_build:
    dh_auto_build
    sed -re 's/@QT_VERSION@/$(QT_VERSION)/g' debian-rules.inc.in > debian-rules.inc

override_dh_auto_install:
    dh_auto_install
    # Download and untar the prebuild Qt5 binaries
    install -d -o root -g root -m 0755 debian/our-qt5-sdk/opt/Qt
    curl http://localserver/Qt$(QT_VERSION).tar.xz | xz -d | tar -C debian/our-qt5-sdk/opt -xf -
    # Move the runtime part to our-qt5
    install -d -o root -g root -m 0755 debian/our-qt5/opt/Qt
    mv debian/our-qt5-sdk/opt/Qt/$(QT_VERSION) debian/our-qt5/opt/Qt/
    # Makes dpkg-shlibdeps work on packages built with Qt from /opt
    # Hack. Don't try this at home. Don't ever do this unless you
    # know what you are doing. This voids your warranty. If you
    # know what you are doing, you won't do this.
    find debian/our-qt5/opt/Qt/$(QT_VERSION)/gcc_64/lib -maxdepth 1 -type f -name "lib*.so*" \
        | sed -re 's,^.+/(lib[^.]+)\.so.+$$,\1 5 our-qt5 (>= $(QT_VERSION)),' > debian/our-qt5.shlibs


$ cat debian-rules.inc.in
export PATH := /opt/Qt/@QT_VERSION@/gcc_64/bin:$(PATH)
export QMAKESPEC=/opt/Qt/@QT_VERSION@/gcc_64/mkspecs/linux-clang/

To build one of our packages using Qt5.3 and clang, we just add this to its debian/rules:

include /usr/share/our-qt5/debian-rules.inc

Wrap up

We got the dependencies sorted. Hopefully the mirror will rebuild itself tonight and tomorrow we can resume working on our custom live system.

Posted Thu May 29 18:05:17 2014 Tags:

On responsibilities

I feel like in my Debian projects I have two roles: the person with the responsibility of making the project happen, and the person who does the work to make it happen.

As the person responsible for the project, I need to keep track of vision, goals, milestones, status. To make announcements, find contributors, motivate them, deal with users and bug reports, maintain documentation, digest feedback.

As the person who does the work to make it happen, I need quiet time, I need to study technology, design code, write unit tests, merge patches, code, code, code, ask around about deployment information, more code.

I have a hard time doing both things at the same time: the first engages my social skills and extroversion, requires low-latency interaction, and acting when outside things happen. The second engages my technical skills and introversion, requires quiet uninterrupted periods of flow, and acting when inspiration strikes. I never managed to make good use of "gift bugs" or "minions": I often found the phrase "it's easier for me to do than to explain it" sadly relevant. Now I understand that it's not because of the objective difficulty of explaining or doing things, nor about the value of doing or of involving people. It's about switching from one kind of workflow to another. If I rephrase that as "it's easier for me to stay in flux and fix it, than to switch my entire attitude to ask for help".

Of course this does not scale: we've all been saying it since I can remember.

Looking at the situation from the point of view of those two roles, however, I now wonder if those two roles shouldn't really require two people. In other worlds they are: the project managers, taking responsibility for making the project happen, and the software designers, artists, and all other kind of artisans doing the work to make it happen.

Of course I don't want the kind of project manager that shifts responsibilities to artisans, does nothing and takes the credit for the project: not in paid work, not in Debian.

Project management is something else.

I would be interested instead in having the kind of project manager that takes responsibility for the project, checks how the artisans are doing and communicates what is happening to the rest of the world, deals with the community, motivates more people to help, test, try, use, give feedback on things as they happen. A project manager / community manager.

So that while I'm flux there is someone who tags bugs as "gift", mentors people to find code and documentation, and remembers to write an announcement if I implemented three cool things in a row and I'm already busy working on the fourth.

So that I don't write cool ideas in my todo list where nobody can read them, but I can share them to a mailing list where someone picks up a relevant one and finds someone to make it happen while I'm busy refactoring old code that only I can understand.

So that if I say "sorry, paid work calls, I won't be able to work on this project for a month", I'll be able to completely forget about that project for a whole month, without leaving the community out there to die.

That's an interesting job for non-uploading DDs: please take over my projects. Let's share a vision, and team up to make it happen. Give me the freedom of being the craftsman I enjoy being, and take away from me those responsibilities that I've never asked for.

The worst project managers are those that never asked to be one, but were promoted to it. Let's not repeat that mistake in Debian.

A good part of the credits for this post go to Francesca Ciceri, for the discussions we had on our way back from MiniDebConf Barcelona 2014.

P.S. I'm seeing how a non-uploading DD could be in the Maintainer field for one or more packages, with uploading DDs being, well, uploaders. Food for thought.

Posted Tue Mar 18 16:58:58 2014 Tags:

An absolute truth

Every time people phrase their own opinions as absolute truths, they look grotesque and they incite violence.

If you now feel like stabbing me, then you may be seeing my point.

Posted Tue Feb 11 14:33:18 2014 Tags:

Debops

What I like the most about being a Developer is building tools to (hopefully) make someone's life better.

I like it when my software gets used, and people thank me for it, because there was a need they had that wasn't met before, and thanks to my software now it is being met.

I am maintaining software for meteorological research that is soon going to be 10 years old, and is still evolving and getting Real Work done.

I like to develop software as if it is going to become a part of human cultural heritage, developing beyond my capacity, eventually surviving me, allowing society to declare that the need, small as it was, is now met, and move on to worry about some other problem.

I feel that if I'm not thinking of my software in that way, then I am not being serious. Then I am not developing something fit for other people to use and rely on.

This involves Development as much as it involves Operations: tracking security updates for all the components that make up a system. Testing. Quality assurance. Scalability. Stability. Hardening. Monitoring. Maintenance requirements. Deployment and upgrade workflows. Security.

I came to learn that the requirements put forward by sysadmins are to be taken seriously, because they are the ones whose phone will ring in the middle of the night when your software breaks.

I am also involved in more than one software project. I am responsible for about a dozen web applications deployed out there in the wild, and possibly another dozen of non-web projects, from terabyte-sized specialised archival tools to little utilities that are essential links in someone's complex toolchain.

I build my software targeting Debian Stable + Backports. At FOSDEM I noticed that some people consider it uncool. I was perplexed.

It provides me with a vast and reasonably recent set of parts to use to build my systems.

It provides me with a single bug tracking system for all of them, and tools to track known issues in the systems I deployed.

It provides me with a stable platform, with a well documented upgrade path to the next version.

It gives me a release rhythm that allows me to enjoy the sweet hum of spinning fans thinking about my next mischief, instead of spending my waking time chasing configuration file changes and API changes deep down in my dependency chain.

It allows me to rely on Debian for security updates, so I don't have to track upstream activity for each one of the building blocks of the systems I deploy.

It allows me not to worry about a lot of obscure domain specific integration issues. Coinstallability of libraries with different ABI versions. Flawless support for different versions of Python, or Lua, or for different versions of C++ compilers.

It has often happened to me to hear someone rant about a frustrating situation, wonder how come it had never happened to me, and realise that someone in Debian, who happens to be more expert than I can possibly be, had thought hard about how to deal with that issue, years before.

I know I cannot be an expert of the entire stack from bare iron all the way up, and I have learnt to stand on the shoulders of giants.

'Devops' makes sense for me in that it hints at this cooperation between developers and operators, having constructive communication, knowing that each side has their own needs, trying their best to meet them all.

It hints at a perfect world where developers and operators finally come to understand and trust each other's judgement.

I don't know that perfect world, but I, a developer, do like to try to understand and trust the judgement of sysadmins.

I sympathise with my sysadmin friends who feel that devops is turning into a trend of developers thinking they can do without sysadmins. Reinventing package managers. Bundling dependencies. Building "apps" instead of components.

I wish that people who deploy a system built on such premises, have it become so successful that they end up being paid to maintain them for their whole career. That is certainly what I wish and strive for, for me and my own projects.

In my experience, a sustainable and maintainable system won't come out of the startup mindset of building something quick&dirty, then sell it and move on to something else.

In my experience, the basis for having sustainable and maintainable systems have been well known and tested in Debian, and several other distributions, for over two decades.

At FOSDEM, we thought that we need a name for such a mindset.

Between beers, that name came to be "debops". (It's not just Debian, though: many other distributions get it right, too)

Posted Tue Feb 4 18:52:39 2014 Tags:
Posted Sat Jun 6 00:57:39 2009

Localising free software for Taiwanese Aboriginal cultures.

People who participated so far:

Character list for the Paiwan language

We mapped the available glyphs and accents for the Paiwan language.

The letters in alphabetical order:

a b c d e f h i j k l m n p q r s t u v w y z ḏ nġ ḻ ṟ ṯ 

No uppercase.

Update: this character list has been improved and the good version is found in the Debian wiki.

All the characters are in Unicode except nġ, which already needs to be requested for the Amis script.

We need to design an input method to enter the underlined letters and the nġ.

Update: there is now a wiki page on the Debian wiki.

Posted Sat Jun 6 00:57:39 2009 Tags:

Creating a new locale

I'm currently in Cilamitay, in the east of Taiwan. There is a little meeting of Taiwanese Free Software people and people from the Amis, Taroko and Puyuma tribes, with the idea of starting localisation efforts for some aboriginal languages.

These are some of the issues we are going to discuss:

Language code

A new ISO standard (639-3) will hopefully be formalised in January that will include the language codes for the Taiwanese aboriginal tribes. We'll have to work some temporary solution, but there's good hope that it won't have to be temporary for long.

List of characters

Because of Christian missionary influence, both Amis and Taroko use a roman alphabet, with accents. We need to work out the complete list of character and accent combination, see if everything is in Unicode, see how they sort.

We then need to find a comfortable way to input them using the keyboards normally available here (English US layout): compose key? Dead keys? How about on Windows?

Womble2 on IRC tells me that on Windows one can works with MSKLC.

Technical terms and country list

We need to work out how to map terms that do not exist in the language.

Technical terms are usually borrowed from Japanese.

Names for all the countries in the world probably do not exist.

Translation interface

We need to find an easy to use interface to input the translations.

There is Rosetta.

There is Pootle. (Thanks to Christian Perrier for pointing me at it)

There is Webpot.

Update: there is now a wiki page on the Debian wiki.

Posted Sat Jun 6 00:57:39 2009 Tags:

Happy new year

A year ago we got in touch with various Taiwanese aboriginal tribes to try to start localisation efforts.

Thanks to the research the Taroko people did during 2007 and the prototype work of tonight, the Taroko people in Taiwan can see the computer calendar of the new year in their own language:

trv_TZW Gnome calendar

Posted Sat Jun 6 00:57:39 2009 Tags:

Character list for the Amis language

We mapped the available glyphs and accents for the Amis language.

The letters in alphabetical order:

    a c d f ng h i k l m n o p r s t u w y

Everyone of them can get an acute or circumflex accent on top. ng can get a dot on top of the g.

The accents are literally on top: i would get the dot PLUS the accent on top.

Not all accented characters directly exist in Unicode; however Unicode developed various kinds of combination features to take care of these cases.

Then we need an input method that would insert ng instead of g and allow to type all the accent combinations.

Here is the full character set:

    a     á    â
    c     ć    ĉ
    d     d́    d̂
    f     f́    f̂
    ng    nǵ   nĝ  nġ
    h     h́    ĥ
    i     i̇́    i̇̂
    k     ḱ    k̂
    l     ĺ    l̂
    m     ḿ    m̂
    n     ń    n̂
    o     ó    ô
    p     ṕ    p̂
    r     ŕ    r̂
    s     ś    ŝ
    t     t́    t̂
    u     ú    û
    w     ẃ    ŵ
    y     ý    ŷ

Update: this character list has been improved and the good version is found in the Debian wiki.

The list is not displayed correctly with many fonts or rendering engines. Arne made a test page that explicitly sets a font that works.

The accents are not taken into account when sorting.

Uppercase letters are not used.

Note: the page has been updated to reflect further input from Unicode and Amis people.

Update: there is now a wiki page on the Debian wiki.

Posted Sat Jun 6 00:57:39 2009 Tags:

Amis and Paiwan input method and character set

Arne Götje (高盛華) created:

The scripts, especially Amis, make heavy use of Unicode combination characters. They should display well at least with the Dejavu Sans font in many applications.

Try it out: if it displays correctly, you should see:

  • accented letters instead of letters next to accents.
  • i with both the dot and the accent.

Update: there is now a wiki page on the Debian wiki.

Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009

Python-related posts.

Custom function decorators with TurboGears 2

I am exposing some library functions using a TurboGears2 controller (see web-api-with-turbogears2). It turns out that some functions return a dict, some a list, some a string, and TurboGears 2 only allows JSON serialisation for dicts.

A simple work-around for this is to wrap the function result into a dict, something like this:

@expose("json")
@validate(validator_dispatcher, error_handler=api_validation_error)
def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
    # Call API
    res = self.engine.list_colours(filter, productID, maxResults)

    # Return result
    return dict(r=res)

It would be nice, however, to have an @webapi() decorator that automatically wraps the function result with the dict:

def webapi(func):
    def dict_wrap(*args, **kw):
        return dict(r=func(*args, **kw))
    return dict_wrap

# ...in the controller...

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    @webapi
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

This works, as long as @webapi appears last in the list of decorators. This is because if it appears last it will be the first to wrap the function, and so it will not interfere with the tg.decorators machinery.

Would it be possible to create a decorator that can be put anywhere among the decorator list? Yes, it is possible but tricky, and it gives me the feeling that it may break in any future version of TurboGears:

class webapi(object):
    def __call__(self, func):
        def dict_wrap(*args, **kw):
            return dict(r=func(*args, **kw))
        # Migrate the decoration attribute to our new function
        if hasattr(func, 'decoration'):
            dict_wrap.decoration = func.decoration
            dict_wrap.decoration.controller = dict_wrap
            delattr(func, 'decoration')
        return dict_wrap

# ...in the controller...

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    @webapi
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

As a convenience, TurboGears 2 offers, in the decorators module, a way to build decorator "hooks":

class before_validate(_hook_decorator):
    '''A list of callables to be run before validation is performed'''
    hook_name = 'before_validate'

class before_call(_hook_decorator):
    '''A list of callables to be run before the controller method is called'''
    hook_name = 'before_call'

class before_render(_hook_decorator):
    '''A list of callables to be run before the template is rendered'''
    hook_name = 'before_render'

class after_render(_hook_decorator):
    '''A list of callables to be run after the template is rendered.

    Will be run before it is returned returned up the WSGI stack'''

    hook_name = 'after_render'

The way these are invoked can be found in the _perform_call function in tg/controllers.py.

To show an example use of those hooks, let's add a some polygen wisdom to every data structure we return:

class wisdom(decorators.before_render):
    def __init__(self, grammar):
        super(wisdom, self).__init__(self.add_wisdom)
        self.grammar = grammar
    def add_wisdom(self, remainder, params, output):
        from subprocess import Popen, PIPE
        output["wisdom"] = Popen(["polyrun", self.grammar], stdout=PIPE).communicate()[0]

# ...in the controller...

    @wisdom("genius")
    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)
    
        # Return result
        return res

These hooks cannot however be used for what I need, that is, to wrap the result inside a dict. The reason is because they are called in this way:

        controller.decoration.run_hooks(
                'before_render', remainder, params, output)

and not in this way:

        output = controller.decoration.run_hooks(
                'before_render', remainder, params, output)

So it is possible to modify the output (if it is a mutable structure) but not to exchange it with something else.

Can we do even better? Sure we can. We can assimilate @expose and @validate inside @webapi to avoid repeating those same many decorator lines over and over again:

class webapi(object):
    def __init__(self, error_handler = None):
        self.error_handler = error_handler

    def __call__(self, func):
        def dict_wrap(*args, **kw):
            return dict(r=func(*args, **kw))
        res = expose("json")(dict_wrap)
        res = validate(validator_dispatcher, error_handler=self.error_handler)(res)
        return res

# ...in the controller...

    @expose("json")
    def api_validation_error(self, **kw):
        pylons.response.status = "400 Error"
        return dict(e="validation error on input fields", form_errors=pylons.c.form_errors)

    @webapi(error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

This got rid of @expose and @validate, and provides almost all the default values that I need. Unfortunately I could not find out how to access api_validation_error from the decorator so that I can pass it to the validator, therefore I remain with the inconvenience of having to explicitly pass it every time.

Posted Wed Nov 4 17:52:38 2009 Tags:

Building a web-based API with Turbogears2

I am using TurboGears2 to export a python API over the web. Every API method is wrapper by a controller method that validates the parameters and returns the results encoded in JSON.

The basic idea is this:

@expose("json")
def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
    # Call API
    res = self.engine.list_colours(filter, productID, maxResults)

    # Return result
    return res

To validate the parameters we can use forms, it's their job after all:

class ListColoursForm(TableForm):
    fields = [
            # One field per parameter
            twf.TextField("filter", help_text="Please enter the string to use as a filter"),
            twf.TextField("productID", help_text="Please enter the product ID"),
            twf.TextField("maxResults", validator=twfv.Int(min=0), default=200, size=5, help_text="Please enter the maximum number of results"),
    ]
list_colours_form=ListColoursForm()

#...

    @expose("json")
    @validate(list_colours_form, error_handler=list_colours_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Parameter validation is done by the form
    
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)
    
        # Return result
        return res

All straightforward so far. However, this means that we need two exposed methods for every API call: one for the API call and one error handler. For every API call, we have to type the name several times, which is error prone and risks to get things mixed up.

We can however have a single error handler for all methonds:

def get_method():
    '''
    The method name is the first url component after the controller name that
    does not start with 'test'
    '''
    found_controller = False
    for name in pylons.c.url.split("/"):
        if not found_controller and name == "controllername":
            found_controller = True
            continue
        if name.startswith("test"):
            continue
        if found_controller:
            return name
    return None

class ValidatorDispatcher:
    '''
    Validate using the right form according to the value of the "method" field
    '''
    def validate(self, args, state):
        method = args.get("method", None)
    # Extract the method from the URL if it is missing
        if method is None:
            method = get_method()
            args["method"] = method
        return forms[method].validate(args, state)

validator_dispatcher = ValidatorDispatcher()

This validator will try to find the method name, either as a form field or by parsing the URL. It will then use the method name to find the form to use for validation, and pass control to the validate method of that form.

We then need to add an extra "method" field to our forms, and arrange the forms inside a dictionary:

class ListColoursForm(TableForm):
    fields = [
            # One hidden field to have a place for the method name
            twf.HiddenField("method")
            # One field per parameter
            twf.TextField("filter", help_text="Please enter the string to use as a filter"),
    #...

forms["list_colours"] = ListColoursForm()

And now our methods become much nicer to write:

    @expose("json")
    def api_validation_error(self, **kw):
        pylons.response.status = "400 Error"
        return dict(form_errors=pylons.c.form_errors)

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Parameter validation is done by the form
    
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)
    
        # Return result
        return res

api_validation_error is interesting: it returns a proper HTTP error status, and a JSON body with the details of the error, taken straight from the form validators. It took me a while to find out that the form errors are in pylons.c.form_errors (and for reference, the form values are in pylons.c.form_values). pylons.response is a WebOb Response that we can play with.

So now our client side is able to call the API methods, and get a proper error if it calls them wrong.

But now that we have the forms ready, it doesn't take much to display them in web pages as well:

def _describe(self, method):
    "Return a dict describing an API method"
    ldesc = getattr(self.engine, method).__doc__.strip()
    sdesc = ldesc.split("\n")[0]
    return dict(name=method, sdesc = sdesc, ldesc = ldesc)

@expose("myappserver.templates.myappapi")
def index(self):
    '''
    Show an index of exported API methods
    '''
    methods = dict()
    for m in forms.keys():
        methods[m] = self._describe(m)
    return dict(methods=methods)

@expose('myappserver.templates.testform')
def testform(self, method, **kw):
    '''
    Show a form with the parameters of an API method
    '''
    kw["method"] = method
    return dict(method=method, action="/myapp/test/"+method, value=kw, info=self._describe(method), form=forms[method])

@expose(content_type="text/plain")
@validate(validator_dispatcher, error_handler=testform)
def test(self, method, **kw):
    '''
    Run an API method and show its prettyprinted result
    '''
    res = getattr(self, str(method))(**kw)
    return pprint.pformat(res)

In a few lines, we have all we need: an index of the API methods (including their documentation taken from the docstrings!), and for each method a form to invoke it and a page to see the results.

Make the forms children of AjaxForm, and you can even see the results together with the form.

Posted Thu Oct 15 15:45:39 2009 Tags:

Creating pipelines with subprocess

It is possible to create process pipelines using subprocess.Popen, by just using stdout=subprocess.PIPE and stdin=otherproc.stdout.

Almost.

In a pipeline created in this way, the stdout of all processes except the last is opened twice: once in the script that has run the subprocess and another time in the standard input of the next process in the pipeline.

This is a problem because if a process closes its stdin, the previous process in the pipeline does not get SIGPIPE when trying to write to its stdout, because that pipe is still open on the caller process. If this happens, a wait on that process will hang forever: the child process waits for the parent to read its stdout, the parent process waits for the child process to exit.

The trick is to close the stdout of each process in the pipeline except the last just after creating them:

#!/usr/bin/python
# coding=utf-8

import subprocess

def pipe(*args):
    '''
    Takes as parameters several dicts, each with the same
    parameters passed to popen.

    Runs the various processes in a pipeline, connecting
    the stdout of every process except the last with the
    stdin of the next process.
    '''
    if len(args) < 2:
        raise ValueError, "pipe needs at least 2 processes"
    # Set stdout=PIPE in every subprocess except the last
    for i in args[:-1]:
        i["stdout"] = subprocess.PIPE

    # Runs all subprocesses connecting stdins and stdouts to create the
    # pipeline. Closes stdouts to avoid deadlocks.
    popens = [subprocess.Popen(**args[0])]
    for i in range(1,len(args)):
        args[i]["stdin"] = popens[i-1].stdout
        popens.append(subprocess.Popen(**args[i]))
        popens[i-1].stdout.close()

    # Returns the array of subprocesses just created
    return popens

At this point, it's nice to write a function that waits for the whole pipeline to terminate and returns an array of result codes:

def pipe_wait(popens):
    '''
    Given an array of Popen objects returned by the
    pipe method, wait for all processes to terminate
    and return the array with their return values.
    '''
    results = [0] * len(popens)
    while popens:
        last = popens.pop(-1)
        results[len(popens)] = last.wait()
    return results

And, look and behold, we can now easily run a pipeline and get the return codes of every single process in it:

process1 = dict(args='sleep 1; grep line2 testfile', shell=True)
process2 = dict(args='awk \'{print $3}\'', shell=True)
process3 = dict(args='true', shell=True)
popens = pipe(process1, process2, process3)
result = pipe_wait(popens)
print result

Update: Colin Watson suggests an improvement to compensate for Python's nonstandard SIGPIPE handling.

Colin Watson has a similar library for C.

Posted Wed Jul 1 09:08:06 2009 Tags:

Turbogears form quirk

I had a great idea:

@validate(model_form)
@error_handler()
@expose(template='kid:myproject.templates.new')
def new(self, id, tg_errors=None, **kw):
    """Create new records in model"""
    if tg_errors:
        # Ask until there is still something missing
        return dict(record = defaults, form = model_form)
    else:
        # We have everything: save it
        i = Item(**kw)
        flash("Item was successfully created.")
        raise redirect("../show/%d" % i.id)

It was perfect: one simple method, simple error handling, nice helpful messages all around. Except, check boxes and select fields would not get the default values while all other fields would.

After two hours searching and cursing and tracing things into widget code, I found this bit in InputWidget.adjust_value:

# there are some input fields that when nothing is checked/selected
# instead of sending a nice name="" are totally missing from
# input_values, this little workaround let's us manage them nicely
# without interfering with other types of fields, we need this to
# keep track of their empty status otherwise if the form is going to be
# redisplayed for some errors they end up to use their defaults values
# instead of being empty since FE doesn't validate a failing Schema.
# posterity note: this is also why we need if_missing=None in
# validators.Schema, see ticket #696.

So, what is happening here is that since check boxes and option fields don't have a nice behaviour when unselected, turbogears has to work around it. So in order to detect the difference between "I selected 'None'" and "I didn't select anything", it reasons that if the input has been validated, then the user has made some selections, so it defaults to "The user selected 'None'". If the input has not been validated, then we're showing the form for the first time, then a missing value means "Use the default provided".

Since I was doing the validation all the time, this meant that Checkboxes and Select fields would never use the default values.

Hence, if you use those fields then you necessarily need two different controller methods, one to present the form and one to save it:

@expose(template='kid:myproject.templates.new')
def new(self, id, **kw):
    """Create new records in model"""
    return dict(record = defaults(), form = model_form)

@validate(model_form)
@error_handler(new)
@expose()
def savenew(self, id, **kw):
    """Create new records in model"""
    i = Item(**kw)
    flash("Item was successfully created.")
    raise redirect("../show/%d"%i.id)

If someone else stumbles on the same problem, I hope they'll find this post and they won't have to spend another two awful hours tracking it down again.

Posted Sat Jun 6 00:57:39 2009 Tags:

Python scoping

How do you create a list of similar functions in Python?

As a simple example, let's say we want to create an array of 10 elements like this:

a[0] = lambda x: x
a[1] = lambda x: x+1
a[2] = lambda x: x+2
...
a[9] = lambda x: x+9

Simple:

>>> a = []
>>> for i in range(0,10): a.append(lambda x: x+i)
...

...but wrong:

>>> a[0](1)
10

What happened here? In Python, that lambda x: x+i uses the value that i will have when the function is invoked.

This is the trick to get it right:

>>> a = []
>>> for i in range(0,10): a.append(lambda x, i=i: x + i)
...
>>> a[0](1)
1

What happens here is explained in the section "A Jedi Mind Trick" of the Instant Python article: i=i assigns as the default value of the parameter i the current value of i.

Strangely enough the same article has "A Note About Python 2.1 and Nested Scopes" which seems to imply that from Python 2.2 the scoping has changed to "work as it should". I don't understand: the examples above are run on Python 2.4.4.

Googling for keywords related to python closure scoping only yields various sorts of complicated PEPs and an even uglier list trick:

a lot of people might not know about the trick of using a list to box variables within a closure.

Now I know about the trick, but I wish I didn't need to know :-(

Posted Sat Jun 6 00:57:39 2009 Tags:

Passing values to turbogears widgets at display time (the general case)

Last time I dug this up I was not clear enough in documenting my findings, so I had to find them again. Here is the second attempt.

In Turbogears, in order to pass parameters to arbitrary widgets in a compound widget, the syntax is:

form.display(PARAMNAME=dict(WIDGETNAME=VALUE))

And if you have more complex nested widgets and would like to know what goes on, this monkey patch is good for inspecting the params lookup functions:

import turbogears.widgets.forms
old_rpbp = turbogears.widgets.forms.retrieve_params_by_path
def inspect_rpbp(params, path):
    print "RPBP", repr(params), repr(path)
    res = old_rpbp(params, path)
    print "RPBP RES", res
    return res
turbogears.widgets.forms.retrieve_params_by_path = inspect_rpbp

The code for the lookup itself is, as the name suggests, in the retrieve_params_by_path function in the file widgets/forms.py in the Turbogears source code.

Posted Sat Jun 6 00:57:39 2009 Tags:

TurboGears RemoteForm tip

In case your RemoteForm misteriously behaves like a normal HTTP form, refreshing the page on submit, and the only hint that there's something wrong is this bit in the Iceweasel's error console:

Errore: uncaught exception: [Exception... "Component returned failure
code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIXMLHttpRequest.open]"
nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)"  location: "JS frame ::
javascript: eval(__firebugTemp__); :: anonymous :: line 1"  data: no]

the problem can just be a missing action= attribute to the form.

I found out after:

  1. reading the TurboGears remoteform wiki: "For some reason, the RemoteForm is acting like a regular html form, serving up a new page instead of performing the replacements we're looking for. I'll update this page as soon as I figure out why this is happening."

  2. finding this page on Google and meditating for a while while staring at it. I don't speak German, but often enough I manage to solve problems after meditating over Google results in all sorts of languages unknown or unreadable to me. I will call this practice Webomancy.

Posted Sat Jun 6 00:57:39 2009 Tags:

Turbogears i18n quirks

Collecting strings from .kid files

tg-admin i18n collect won't collect strings from your .kid files: you need the toolbox web interface for that.

Indentation problems in .kid files

The toolbox web interface chokes on intentation errors on your .kid files.

To see the name of the .kid file that causes the error, look at the tg-admin toolbox output in the terminal for lines like Working on app/Foo/templates/bar.kid.

What happens is that the .kid files are converted to python files, and if there are indentation glitches they end up in the python files, and python will complain.

Once you see from the tg-admin toolbox standard error what is the .kid file with the problem, edit it and try to make sure that all closing tags are at the exact indentation level as their coresponding opening tags. Even a single space would matter.

Bad i18n bug in TurboKid versions earlier than 1.0.1

faide on #turbogears also says:

It is of outmost importance that you use TurboKid 1.0.1 because it is the first version that corrects a BIG bug regarding i18n filters ...

The version below had a bug where the filters kept being added at each page load in such a way that after a few hundreds of pages you could have page loading times as long as 5 minutes!

If one has a previous version of TurboKid, one (and only one) of these is needed:

So, in short, all i18n users should upgrade to TurboGears 1.0.2.2 or patch TurboKid using http://trac.turbogears.org/ticket/1301.

Posted Sat Jun 6 00:57:39 2009 Tags:

Linking to self in turbogears

I want to put in my master.kid some icons that allow to change the current language for the session.

First, all user-accessible methods need to handle a 'language' parameter:

@expose(template="myapp.templates.foobar")
def index(self, someparam, **kw):
    if 'language' in kw: turbogears.i18n.set_session_locale(kw['language'])

Then, we need a way to edit the current URL so that we can generate modified links to self that preserve the existing path_info and query parameters. In your main controller, add:

def linkself(**kw):
    params = {}
    params.update(cherrypy.request.params)
    params.update(kw)
    url = cherrypy.request.browser_url.split('?', 1)[0]
    return url + '?' + '&'.join(['='.join(x) for x in params.iteritems()])

def add_custom_stdvars(vars):
    return vars.update({"linkself": linkself})

turbogears.view.variable_providers.append(add_custom_stdvars)

(see the turbogears stdvars documentation and the cherrypy request documentation (cherrypy 2 documentation at the bottom of the page))

And finally, in master.kid:

<div id="footer">
  <div id="langselector">
    <span class="language">
      <a href="${tg.linkself(language='it_IT')}">
        <img src="${tg.url('/static/images/it.png')}"/>
      </a>
    </span>

    <span class="language">
      <a href="${tg.linkself(language='C')}">
        <img src="${tg.url('/static/images/en.png')}"/>
      </a>
    </span>
  </div><!-- langselector -->
</div><!-- footer -->
Posted Sat Jun 6 00:57:39 2009 Tags:

Turbogears quirks when testing controllers that use SingleSelectField

Suppose you have a User that can be a member of a Company. In SQLObject you model it somehow like this:

    class Company(SQLObject):
        name = UnicodeCol(length=16, alternateID=True, alternateMethodName="by_name")
        display_name = UnicodeCol(length=255)

    class User(InheritableSQLObject):
        company = ForeignKey("Company", notNull=False, cascade='null')

Then you want to make a form that allows to choose what is the company of a user:

def companies():
    return [ [ -1, 'None' ] ] + [ [c.id, c.display_name] for c in Company.select() ]

class NewUserFields(WidgetsList):
    """Fields for editing general settings"""
    user_name = TextField(label="User name")
    companyID = SingleSelectField(label="Company", options=companies)

Ok. Now you want to run tests:

  1. nosetests imports the controller to see if there's any initialisation code.
  2. The NewUserFields class is created.
  3. The SingleSelectField is created.
  4. The SingleSelectField constructor tries to guess the validator and peeks at the first option.
  5. This calls companies.
  6. companies accesses the database.
  7. The testing database has not yet been created because nosetests imported the module before giving the test code a chance to setup the test database.
  8. Bang.

The solution is to add an explicit validator to disable this guessing code that is a source of so many troubles:

class NewUserFields(WidgetsList):
    """Fields for editing general settings"""
    user_name = TextField(label="User name")
    companyID = SingleSelectField(label="Company", options=companies, validator=v.Int(not_empty=True))
Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009

Pages about Turbogears.

Custom function decorators with TurboGears 2

I am exposing some library functions using a TurboGears2 controller (see web-api-with-turbogears2). It turns out that some functions return a dict, some a list, some a string, and TurboGears 2 only allows JSON serialisation for dicts.

A simple work-around for this is to wrap the function result into a dict, something like this:

@expose("json")
@validate(validator_dispatcher, error_handler=api_validation_error)
def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
    # Call API
    res = self.engine.list_colours(filter, productID, maxResults)

    # Return result
    return dict(r=res)

It would be nice, however, to have an @webapi() decorator that automatically wraps the function result with the dict:

def webapi(func):
    def dict_wrap(*args, **kw):
        return dict(r=func(*args, **kw))
    return dict_wrap

# ...in the controller...

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    @webapi
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

This works, as long as @webapi appears last in the list of decorators. This is because if it appears last it will be the first to wrap the function, and so it will not interfere with the tg.decorators machinery.

Would it be possible to create a decorator that can be put anywhere among the decorator list? Yes, it is possible but tricky, and it gives me the feeling that it may break in any future version of TurboGears:

class webapi(object):
    def __call__(self, func):
        def dict_wrap(*args, **kw):
            return dict(r=func(*args, **kw))
        # Migrate the decoration attribute to our new function
        if hasattr(func, 'decoration'):
            dict_wrap.decoration = func.decoration
            dict_wrap.decoration.controller = dict_wrap
            delattr(func, 'decoration')
        return dict_wrap

# ...in the controller...

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    @webapi
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

As a convenience, TurboGears 2 offers, in the decorators module, a way to build decorator "hooks":

class before_validate(_hook_decorator):
    '''A list of callables to be run before validation is performed'''
    hook_name = 'before_validate'

class before_call(_hook_decorator):
    '''A list of callables to be run before the controller method is called'''
    hook_name = 'before_call'

class before_render(_hook_decorator):
    '''A list of callables to be run before the template is rendered'''
    hook_name = 'before_render'

class after_render(_hook_decorator):
    '''A list of callables to be run after the template is rendered.

    Will be run before it is returned returned up the WSGI stack'''

    hook_name = 'after_render'

The way these are invoked can be found in the _perform_call function in tg/controllers.py.

To show an example use of those hooks, let's add a some polygen wisdom to every data structure we return:

class wisdom(decorators.before_render):
    def __init__(self, grammar):
        super(wisdom, self).__init__(self.add_wisdom)
        self.grammar = grammar
    def add_wisdom(self, remainder, params, output):
        from subprocess import Popen, PIPE
        output["wisdom"] = Popen(["polyrun", self.grammar], stdout=PIPE).communicate()[0]

# ...in the controller...

    @wisdom("genius")
    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)
    
        # Return result
        return res

These hooks cannot however be used for what I need, that is, to wrap the result inside a dict. The reason is because they are called in this way:

        controller.decoration.run_hooks(
                'before_render', remainder, params, output)

and not in this way:

        output = controller.decoration.run_hooks(
                'before_render', remainder, params, output)

So it is possible to modify the output (if it is a mutable structure) but not to exchange it with something else.

Can we do even better? Sure we can. We can assimilate @expose and @validate inside @webapi to avoid repeating those same many decorator lines over and over again:

class webapi(object):
    def __init__(self, error_handler = None):
        self.error_handler = error_handler

    def __call__(self, func):
        def dict_wrap(*args, **kw):
            return dict(r=func(*args, **kw))
        res = expose("json")(dict_wrap)
        res = validate(validator_dispatcher, error_handler=self.error_handler)(res)
        return res

# ...in the controller...

    @expose("json")
    def api_validation_error(self, **kw):
        pylons.response.status = "400 Error"
        return dict(e="validation error on input fields", form_errors=pylons.c.form_errors)

    @webapi(error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)

        # Return result
        return res

This got rid of @expose and @validate, and provides almost all the default values that I need. Unfortunately I could not find out how to access api_validation_error from the decorator so that I can pass it to the validator, therefore I remain with the inconvenience of having to explicitly pass it every time.

Posted Wed Nov 4 17:52:38 2009 Tags:

Building a web-based API with Turbogears2

I am using TurboGears2 to export a python API over the web. Every API method is wrapper by a controller method that validates the parameters and returns the results encoded in JSON.

The basic idea is this:

@expose("json")
def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
    # Call API
    res = self.engine.list_colours(filter, productID, maxResults)

    # Return result
    return res

To validate the parameters we can use forms, it's their job after all:

class ListColoursForm(TableForm):
    fields = [
            # One field per parameter
            twf.TextField("filter", help_text="Please enter the string to use as a filter"),
            twf.TextField("productID", help_text="Please enter the product ID"),
            twf.TextField("maxResults", validator=twfv.Int(min=0), default=200, size=5, help_text="Please enter the maximum number of results"),
    ]
list_colours_form=ListColoursForm()

#...

    @expose("json")
    @validate(list_colours_form, error_handler=list_colours_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Parameter validation is done by the form
    
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)
    
        # Return result
        return res

All straightforward so far. However, this means that we need two exposed methods for every API call: one for the API call and one error handler. For every API call, we have to type the name several times, which is error prone and risks to get things mixed up.

We can however have a single error handler for all methonds:

def get_method():
    '''
    The method name is the first url component after the controller name that
    does not start with 'test'
    '''
    found_controller = False
    for name in pylons.c.url.split("/"):
        if not found_controller and name == "controllername":
            found_controller = True
            continue
        if name.startswith("test"):
            continue
        if found_controller:
            return name
    return None

class ValidatorDispatcher:
    '''
    Validate using the right form according to the value of the "method" field
    '''
    def validate(self, args, state):
        method = args.get("method", None)
    # Extract the method from the URL if it is missing
        if method is None:
            method = get_method()
            args["method"] = method
        return forms[method].validate(args, state)

validator_dispatcher = ValidatorDispatcher()

This validator will try to find the method name, either as a form field or by parsing the URL. It will then use the method name to find the form to use for validation, and pass control to the validate method of that form.

We then need to add an extra "method" field to our forms, and arrange the forms inside a dictionary:

class ListColoursForm(TableForm):
    fields = [
            # One hidden field to have a place for the method name
            twf.HiddenField("method")
            # One field per parameter
            twf.TextField("filter", help_text="Please enter the string to use as a filter"),
    #...

forms["list_colours"] = ListColoursForm()

And now our methods become much nicer to write:

    @expose("json")
    def api_validation_error(self, **kw):
        pylons.response.status = "400 Error"
        return dict(form_errors=pylons.c.form_errors)

    @expose("json")
    @validate(validator_dispatcher, error_handler=api_validation_error)
    def list_colours(self, filter=None, productID=None, maxResults=100, **kw):
        # Parameter validation is done by the form
    
        # Call API
        res = self.engine.list_colours(filter, productID, maxResults)
    
        # Return result
        return res

api_validation_error is interesting: it returns a proper HTTP error status, and a JSON body with the details of the error, taken straight from the form validators. It took me a while to find out that the form errors are in pylons.c.form_errors (and for reference, the form values are in pylons.c.form_values). pylons.response is a WebOb Response that we can play with.

So now our client side is able to call the API methods, and get a proper error if it calls them wrong.

But now that we have the forms ready, it doesn't take much to display them in web pages as well:

def _describe(self, method):
    "Return a dict describing an API method"
    ldesc = getattr(self.engine, method).__doc__.strip()
    sdesc = ldesc.split("\n")[0]
    return dict(name=method, sdesc = sdesc, ldesc = ldesc)

@expose("myappserver.templates.myappapi")
def index(self):
    '''
    Show an index of exported API methods
    '''
    methods = dict()
    for m in forms.keys():
        methods[m] = self._describe(m)
    return dict(methods=methods)

@expose('myappserver.templates.testform')
def testform(self, method, **kw):
    '''
    Show a form with the parameters of an API method
    '''
    kw["method"] = method
    return dict(method=method, action="/myapp/test/"+method, value=kw, info=self._describe(method), form=forms[method])

@expose(content_type="text/plain")
@validate(validator_dispatcher, error_handler=testform)
def test(self, method, **kw):
    '''
    Run an API method and show its prettyprinted result
    '''
    res = getattr(self, str(method))(**kw)
    return pprint.pformat(res)

In a few lines, we have all we need: an index of the API methods (including their documentation taken from the docstrings!), and for each method a form to invoke it and a page to see the results.

Make the forms children of AjaxForm, and you can even see the results together with the form.

Posted Thu Oct 15 15:45:39 2009 Tags:

Turbogears form quirk

I had a great idea:

@validate(model_form)
@error_handler()
@expose(template='kid:myproject.templates.new')
def new(self, id, tg_errors=None, **kw):
    """Create new records in model"""
    if tg_errors:
        # Ask until there is still something missing
        return dict(record = defaults, form = model_form)
    else:
        # We have everything: save it
        i = Item(**kw)
        flash("Item was successfully created.")
        raise redirect("../show/%d" % i.id)

It was perfect: one simple method, simple error handling, nice helpful messages all around. Except, check boxes and select fields would not get the default values while all other fields would.

After two hours searching and cursing and tracing things into widget code, I found this bit in InputWidget.adjust_value:

# there are some input fields that when nothing is checked/selected
# instead of sending a nice name="" are totally missing from
# input_values, this little workaround let's us manage them nicely
# without interfering with other types of fields, we need this to
# keep track of their empty status otherwise if the form is going to be
# redisplayed for some errors they end up to use their defaults values
# instead of being empty since FE doesn't validate a failing Schema.
# posterity note: this is also why we need if_missing=None in
# validators.Schema, see ticket #696.

So, what is happening here is that since check boxes and option fields don't have a nice behaviour when unselected, turbogears has to work around it. So in order to detect the difference between "I selected 'None'" and "I didn't select anything", it reasons that if the input has been validated, then the user has made some selections, so it defaults to "The user selected 'None'". If the input has not been validated, then we're showing the form for the first time, then a missing value means "Use the default provided".

Since I was doing the validation all the time, this meant that Checkboxes and Select fields would never use the default values.

Hence, if you use those fields then you necessarily need two different controller methods, one to present the form and one to save it:

@expose(template='kid:myproject.templates.new')
def new(self, id, **kw):
    """Create new records in model"""
    return dict(record = defaults(), form = model_form)

@validate(model_form)
@error_handler(new)
@expose()
def savenew(self, id, **kw):
    """Create new records in model"""
    i = Item(**kw)
    flash("Item was successfully created.")
    raise redirect("../show/%d"%i.id)

If someone else stumbles on the same problem, I hope they'll find this post and they won't have to spend another two awful hours tracking it down again.

Posted Sat Jun 6 00:57:39 2009 Tags:

Passing values to turbogears widgets at display time (the general case)

Last time I dug this up I was not clear enough in documenting my findings, so I had to find them again. Here is the second attempt.

In Turbogears, in order to pass parameters to arbitrary widgets in a compound widget, the syntax is:

form.display(PARAMNAME=dict(WIDGETNAME=VALUE))

And if you have more complex nested widgets and would like to know what goes on, this monkey patch is good for inspecting the params lookup functions:

import turbogears.widgets.forms
old_rpbp = turbogears.widgets.forms.retrieve_params_by_path
def inspect_rpbp(params, path):
    print "RPBP", repr(params), repr(path)
    res = old_rpbp(params, path)
    print "RPBP RES", res
    return res
turbogears.widgets.forms.retrieve_params_by_path = inspect_rpbp

The code for the lookup itself is, as the name suggests, in the retrieve_params_by_path function in the file widgets/forms.py in the Turbogears source code.

Posted Sat Jun 6 00:57:39 2009 Tags:

TurboGears RemoteForm tip

In case your RemoteForm misteriously behaves like a normal HTTP form, refreshing the page on submit, and the only hint that there's something wrong is this bit in the Iceweasel's error console:

Errore: uncaught exception: [Exception... "Component returned failure
code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIXMLHttpRequest.open]"
nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)"  location: "JS frame ::
javascript: eval(__firebugTemp__); :: anonymous :: line 1"  data: no]

the problem can just be a missing action= attribute to the form.

I found out after:

  1. reading the TurboGears remoteform wiki: "For some reason, the RemoteForm is acting like a regular html form, serving up a new page instead of performing the replacements we're looking for. I'll update this page as soon as I figure out why this is happening."

  2. finding this page on Google and meditating for a while while staring at it. I don't speak German, but often enough I manage to solve problems after meditating over Google results in all sorts of languages unknown or unreadable to me. I will call this practice Webomancy.

Posted Sat Jun 6 00:57:39 2009 Tags:

Turbogears i18n quirks

Collecting strings from .kid files

tg-admin i18n collect won't collect strings from your .kid files: you need the toolbox web interface for that.

Indentation problems in .kid files

The toolbox web interface chokes on intentation errors on your .kid files.

To see the name of the .kid file that causes the error, look at the tg-admin toolbox output in the terminal for lines like Working on app/Foo/templates/bar.kid.

What happens is that the .kid files are converted to python files, and if there are indentation glitches they end up in the python files, and python will complain.

Once you see from the tg-admin toolbox standard error what is the .kid file with the problem, edit it and try to make sure that all closing tags are at the exact indentation level as their coresponding opening tags. Even a single space would matter.

Bad i18n bug in TurboKid versions earlier than 1.0.1

faide on #turbogears also says:

It is of outmost importance that you use TurboKid 1.0.1 because it is the first version that corrects a BIG bug regarding i18n filters ...

The version below had a bug where the filters kept being added at each page load in such a way that after a few hundreds of pages you could have page loading times as long as 5 minutes!

If one has a previous version of TurboKid, one (and only one) of these is needed:

So, in short, all i18n users should upgrade to TurboGears 1.0.2.2 or patch TurboKid using http://trac.turbogears.org/ticket/1301.

Posted Sat Jun 6 00:57:39 2009 Tags:

Linking to self in turbogears

I want to put in my master.kid some icons that allow to change the current language for the session.

First, all user-accessible methods need to handle a 'language' parameter:

@expose(template="myapp.templates.foobar")
def index(self, someparam, **kw):
    if 'language' in kw: turbogears.i18n.set_session_locale(kw['language'])

Then, we need a way to edit the current URL so that we can generate modified links to self that preserve the existing path_info and query parameters. In your main controller, add:

def linkself(**kw):
    params = {}
    params.update(cherrypy.request.params)
    params.update(kw)
    url = cherrypy.request.browser_url.split('?', 1)[0]
    return url + '?' + '&'.join(['='.join(x) for x in params.iteritems()])

def add_custom_stdvars(vars):
    return vars.update({"linkself": linkself})

turbogears.view.variable_providers.append(add_custom_stdvars)

(see the turbogears stdvars documentation and the cherrypy request documentation (cherrypy 2 documentation at the bottom of the page))

And finally, in master.kid:

<div id="footer">
  <div id="langselector">
    <span class="language">
      <a href="${tg.linkself(language='it_IT')}">
        <img src="${tg.url('/static/images/it.png')}"/>
      </a>
    </span>

    <span class="language">
      <a href="${tg.linkself(language='C')}">
        <img src="${tg.url('/static/images/en.png')}"/>
      </a>
    </span>
  </div><!-- langselector -->
</div><!-- footer -->
Posted Sat Jun 6 00:57:39 2009 Tags:

Turbogears quirks when testing controllers that use SingleSelectField

Suppose you have a User that can be a member of a Company. In SQLObject you model it somehow like this:

    class Company(SQLObject):
        name = UnicodeCol(length=16, alternateID=True, alternateMethodName="by_name")
        display_name = UnicodeCol(length=255)

    class User(InheritableSQLObject):
        company = ForeignKey("Company", notNull=False, cascade='null')

Then you want to make a form that allows to choose what is the company of a user:

def companies():
    return [ [ -1, 'None' ] ] + [ [c.id, c.display_name] for c in Company.select() ]

class NewUserFields(WidgetsList):
    """Fields for editing general settings"""
    user_name = TextField(label="User name")
    companyID = SingleSelectField(label="Company", options=companies)

Ok. Now you want to run tests:

  1. nosetests imports the controller to see if there's any initialisation code.
  2. The NewUserFields class is created.
  3. The SingleSelectField is created.
  4. The SingleSelectField constructor tries to guess the validator and peeks at the first option.
  5. This calls companies.
  6. companies accesses the database.
  7. The testing database has not yet been created because nosetests imported the module before giving the test code a chance to setup the test database.
  8. Bang.

The solution is to add an explicit validator to disable this guessing code that is a source of so many troubles:

class NewUserFields(WidgetsList):
    """Fields for editing general settings"""
    user_name = TextField(label="User name")
    companyID = SingleSelectField(label="Company", options=companies, validator=v.Int(not_empty=True))
Posted Sat Jun 6 00:57:39 2009 Tags:

File downloads with TurboGears

In TurboGears, I had to implement a file download method, but the file required access controls so it was put in a directory not exported by Apache.

In #turbogears I've been pointed at: http://cherrypy.org/wiki/FileDownload and this is everything put together:

from cherrypy.lib.cptools import serveFile
# In cherrypy 3 it should be:
#from cherrypy.lib.static import serve_file

@expose()
def get(self, *args, **kw):
    """Access the file pointed by the given path"""
    pathname = check_auth_and_compute_pathname()
    return serveFile(pathname)

Then I needed to export some CSV:

@expose()
def getcsv(self, *args, **kw):
    """Get the data in CSV format"""
    rows = compute_data_rows()
    headers = compute_headers(rows)
    filename = compute_file_name()

    cherrypy.response.headers['Content-Type'] = "application/x-download"
    cherrypy.response.headers['Content-Disposition'] = 'attachment; filename="'+filename+'"'

    csvdata = StringIO.StringIO()
    writer = csv.writer(csvdata)
    writer.writerow(headers)
    writer.writerows(rows)

    return csvdata.getvalue()

In my case it's not an issue as I can only compute the headers after I computed all the data, but I still have to find out how to serve the CSV file while I'm generating it, instead of storing it all into a big string and returning the big string.

Posted Sat Jun 6 00:57:39 2009 Tags:

Quirks when overriding SQLObject setters

Let's suppose you have a User that is, optionally, a member of a Company. In SQLObject you model it somehow like this:

    class Company(SQLObject):
        name = UnicodeCol(length=16, alternateID=True, alternateMethodName="by_name")
        display_name = UnicodeCol(length=255)

    class User(InheritableSQLObject):
        company = ForeignKey("Company", notNull=False, cascade='null')

Then you want to implement a user settings interface that uses a Select box to choose the company of the user.

For the Select widget to properly handle the validator for your data, you need to put a number in the first option. As my first option, I want to have the "None" entry, so I decided to use -1 to mean "None".

Now, to make it all blend nicely, I overrode the company setter to accept -1 and silently convert it to a None:

    class User(InheritableSQLObject):
        company = ForeignKey("Company", notNull=False, cascade='null')

        def _set_company(self, id):
            "Set the company id, using None if -1 is given"
            if id == -1: id = None
            self._SO_set_company(id)

In the controller, after parsing and validating all the various keyword arguments, I do something like this:

            user.set(**kw)

Now, the overridden method didn't get called.

After some investigation, and with the help of NandoFlorestan on IRC, we figured out the following things:

  1. That method needs to be rewritten as _set_companyID:

            def _set_companyID(self, id):
                "Set the company id, using None if -1 is given"
                if id == -1: id = None
                self._SO_set_companyID(id)
    
  2. Methods overridden in that way are alsop called by user.set(**kw), but not by the User(**kw) constructor, so using, for example, a similar override to transparently encrypt passwords would give you plaintext passwords for new users and encrypted passwords after they changed it.

Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009

Pages with tips about Debian.

Resolving IP addresses in vim

A friend on IRC said: "I wish vim had a command to resolve all the IP addresses in a block of text".

But it does:

:<block>!perl -MSocket -pe 's/(\d+\.\d+\.\d+\.\d+)/gethostbyaddr(inet_aton($1), AF_INET)/ge'

If you use it often, put the perl command in a one-liner script and call it an editor macro. It works on other editors, too, and even without an editor at all. And it can be scripted!

We live with the power of Unix every day, so much that we risk forgetting how awesome it is.

Posted Wed Mar 7 14:07:07 2012 Tags:

SQLAlchemy, MySQL and sql_mode=traditional

As everyone should know, by default MySQL is an embarassing stupid toy:

mysql> create table foo (val integer not null);
Query OK, 0 rows affected (0.03 sec)

mysql> insert into foo values (1/0);
ERROR 1048 (23000): Column 'val' cannot be null

mysql> insert into foo values (1);
Query OK, 1 row affected (0.00 sec)

mysql> update foo set val=1/0 where val=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 1

mysql> select * from foo;
+-----+
| val |
+-----+
|   0 |
+-----+
1 row in set (0.00 sec)

Luckily, you can tell it to stop being embarassingly stupid:

mysql> set sql_mode="traditional";
Query OK, 0 rows affected (0.00 sec)

mysql> update foo set val=1/0 where val=0;
ERROR 1365 (22012): Division by 0

(There is an even better sql mode you can choose, though: it is called "Install PostgreSQL")

Unfortunately, I've been hired to work on a project that relies on the embarassing stupid behaviour of MySQL, so I cannot set sql_mode=traditional globally or the existing house of cards will collapse.

Here is how you set it session-wide with SQLAlchemy 0.6.x: it took me quite a while to find out:

import sqlalchemy.interfaces

# Without this, MySQL will silently insert invalid values in the
# database, causing very long debugging sessions in the long run
class DontBeSilly(sqlalchemy.interfaces.PoolListener):
    def connect(self, dbapi_con, connection_record):
        cur = dbapi_con.cursor()
        cur.execute("SET SESSION sql_mode='TRADITIONAL'")
        cur = None
engine = create_engine(..., listeners=[DontBeSilly()])

Why does it take all that effort is beyond me. I'd have expected this to be turned on by default, possibly with a switch that insane people could use to turn it off.

Posted Mon Feb 27 19:45:58 2012 Tags:

Tips on using python datetime module

Python's datetime module is one of those bits of code that tends not to do what one would expect them to do.

I have come to adopt some extra usage guidelines in order to preserve my sanity:

  • Avoid using str(datetime_object) or isoformat to serialize a datetime: there is no function in the library that can parse all its possible outputs
  • datetime.strptime silently throws away all timezone information. If you look very closely, it even says so in its documentation
  • Timezones do not exist, all datetime objects have to be naive. aware means broken.
  • datetime objects must always contain UTC information
  • datetime.now() is never to be used. Always use datetime.utcnow()
  • Be careful of 3rd party python modules: people have a dangerous tendency to use datetime.now()
  • If a conversion to some local time is needed, it shall be done via either some ugly thing like time.localtime(int(dt.strftime("%s"))) or via the pytz module
  • pytz must be used directly, and never via timezone aware datetime objects, because datetime objects fail in querying pytz:

That’s right, the datetime object created by a call to datetime.datetime constructor now seems to think that Finland uses the ancient “Helsinki Mean Time” which was obsoleted in the 1920s. The reason for this behaviour is clearly documented on the pytz page: it seems the Python datetime implementation never asks the tzinfo object what the offset to UTC on the given date would be. And without knowing it pytz seems to default to the first historical definition. Now, some of you fellow readers could insist on the problem going away simply by defaulting to the latest time zone definition. However, the problem would still persist: For example, Venezuela switched to GMT-04:30 on 9th December, 2007, causing the datetime objects representing dates either before, or after the change to become invalid.

From: http://blog.redinnovation.com/2008/06/30/relativity-of-time-shortcomings-in-python-datetime-and-workaround/

  • Timezone-aware datetime objects have other bugs: for example, they fail to compute Unix timestamps correctly. The following example shows two timezone-aware objects that represent the same instant but produce two different timestamps.
>>> import datetime as dt
>>> import pytz
>>> utc = pytz.timezone("UTC")
>>> italy = pytz.timezone("Europe/Rome")
>>> a = dt.datetime(2008, 7, 6, 5, 4, 3, tzinfo=utc)
>>> b = a.astimezone(italy)
>>> str(a)
'2008-07-06 05:04:03+00:00'
>>> a.strftime("%s")
'1215291843'
>>> str(b)
'2008-07-06 07:04:03+02:00'
>>> b.strftime("%s")
'1215299043'
Posted Thu Jun 25 19:18:25 2009 Tags:

Undoable apt-get build-dep

If I do apt-get build-dep foobar, all the dependencies will be installed, but they will not be marked as autoinstalled: if I then do not need to build the package anymore, I have a hard time finding out what to remove.

What I would like is a script that turns build dependencies of a package into a metapackage, so that I can later remove the metapackage and apt-get autoclean will remove everything from the system.

It turns out that it already exists: it's called mk-build-deps and is part of devscripts:

# apt-get install equivs
$ mk-build-deps lxlauncher

That will give you lxlauncher-build-deps_0.2-2_all.deb. It also works on arbitrary control files.

There is also a repository of such build-deps packages maintained by Frank Lichtenheld.

Posted Sat Jun 6 00:57:39 2009 Tags:

How to view the fingerprint of the ssh host key

This way, ready to copy and paste:

ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key

Background:

It already takes a lot of resources to recall that to see the host key fingerprint you need to run something called 'keygen'. Then ssh-keygen doesn't support --help: it will try to generate a new key instead. We're in 2008. There should be a law against this sort of behaviour.

To figure out how to see the host key, you need to dig through a long manpage with no examples section. ssh-keygen does have commandline help, but does not implement any switch to invoke it (check the getopt invocation in the source code if you don't believe me). It will however show commandline help when given an unrecognised option, so it will mutter but at least give you love if you ask for it:

$ ssh-keygen -♥
ssh-keygen: illegal option -- 
Usage: ssh-keygen [options]
Options:
  [...]

After figuring out that it's -l -f, you still have to go and fish the file wherever it is. And luckily we had the recent Debian openssh problems, so now I can get the fingerprint of the RSA file only and be done with it.

But thanks to this blog entry, no more of that, at last.

Posted Sat Jun 6 00:57:39 2009 Tags:

Introducing the Humongous Merged Packages File From Hell

Suppose you want to build some service that requires information about all packages in all architectures. Like, for example, the debtags web tagging interface.

If that is the case, then you may be interested in the Humongous Merged Packages File From Hell.

It contains a merge of all Packages and Source files of etch, lenny, sid and experimental, for main, contrib and non-free, in all architectures.

The merge is done according to completely arbitrary criteria of mine:

  • One record per (package, version) couple found; multiple record per (package, version) are merged together
  • Size and Installed-Size are the average of all values found
  • MD5sum, SHA1, SHA256, Filename, Files, Directory, Checksums-Sha1, Checksums-Sha256, Binary are thrown away
  • Information (such as Homepage, Vcs* fields, Uploaders) found in the Sources file is repeated for every binary package
  • All other packages get as value a list of all the values found, with duplicates removed

It's there in case anyone other than me may find it useful.

Posted Sat Jun 6 00:57:39 2009 Tags:

rc-buggy packages sorted by popularity

After doing a round of tag reviews so that we release lenny with the latest tag contributions, I still had a bit of spare time. Since people are urging to fix the last bugs I thought I could contribute a list of RC-buggy packages sorted by popularity:

#!/usr/bin/python

import urllib2
import sys, re #, gzip

PKGRE = re.compile(r'<strong>Package: </strong><a href="[^"]+" name="([^"]+)">')
POPCONRE = re.compile(r'^(\d+)\s+(\S+)')
PKGMAPRE = re.compile(r'^([^(]+)\(([^)]+)')
COMMASPLIT = re.compile(r'\s*,\s*')

# Allow to download other package views
if len(sys.argv) > 1:
    url = sys.argv[1]
else:
    url = "http://bts.turmzimmer.net/details.php"

def read_packages(url):
    "Download the list of rc-buggy packages"
    fd = urllib2.urlopen(url)
    for line in fd:
        mo = PKGRE.search(line)
        if mo:
            yield mo.group(1)

# FIXME: this is a deprecated data source, but at the moment it is just what
#        is needed
def read_bintosrc(url = "http://qa.debian.org/data/ddpo/results/ddpo_packages"):
    "Download the binary to source package map"
    fd = urllib2.urlopen(url)
    for line in fd:
        mo = PKGMAPRE.search(line)
        if mo:
            src = mo.group(1)
            bin = COMMASPLIT.split(mo.group(2))
            for p in bin:   
                yield p, src

def read_popcon(url = "http://popcon.debian.org/by_inst"):
    "Download popcon information"
    fd = urllib2.urlopen(url)
    # It doesn't work on urllib2 objects because it wants tell()
    #fd = gzip.GzipFile(fileobj=fd, filename=url, mode="rb")
    for line in fd:
        mo = POPCONRE.match(line)
        if mo:
            yield mo.group(2), int(mo.group(1))

print "Reading mapping of binary packages to source packages..."
bin2src = dict(read_bintosrc())
print "Reading popcon scores..."
# We actually want the score of source packages, and for it we use the score of
# its binary package with the highest score
scores = dict()
for pkg, score in read_popcon():
    pkg = bin2src.get(pkg, pkg)
    if pkg not in scores:
        scores[pkg] = score
#scores = dict(read_popcon("http://popcon.debian.org/by_inst.gz"))
print "Reading list of rc-buggy packages..."
# The bug page has a mix of source and binary packages: convert all to source
# packages
packages = sorted(read_packages(url), key=lambda x:scores.get(bin2src.get(x,x), 0))
for p in packages:
    print scores.get(bin2src.get(p,p), 0), p

This can be run on people.debian.org as ~enrico/popzimmer (0 is the rank of packages that for some reason cannot be mapped to any rank):

$ ~enrico/popzimmer 
Reading mapping of binary packages to source packages...
Reading popcon scores...
Reading list of rc-buggy packages...
0 roxen-fonts-iso8859-2
2 dpkg
11 pam
61 glibc
61 libc6
123 initramfs-tools
149 bind9
150 grub
180 portmap
213 nfs-common
279 libgl1-mesa-dri
298 libsnmp-base
298 snmpd
309 avahi
397 mysql-dfsg-5.0
408 xserver-xorg-input-wacom
409 xserver-xorg-video-vesa
428 docbook-xml
499 openoffice.org-core
567 xine-lib
600 evolution-data-server
669 eog
671 xserver-xorg-video-intel
691 epiphany-browser
691 epiphany-gecko
695 kdelibs4c2a
698 uswsusp
754 guile-1.6
959 gpm
964 cups
990 kdebase
1002 ghostscript
1069 giflib
1076 xulrunner-1.9
1177 libapache2-mod-perl2
1217 vlc-nox
1229 kernel-package
1311 kdeutils-dev
1520 libparted1.8-10
1767 wireshark
1785 selinux-policy-default
1908 dia
1943 htop
1983 boost
2140 cantlr
2145 transmission
2159 emacs22
2312 swfdec-mozilla
2449 eclipse
2449 libswt3.2-gtk-java
2728 lilo
2898 libnss-ldap
3013 java-access-bridge
3051 blender
3099 libtemplate-perl
3499 fglrx-atieventsd
3602 virtualbox-ose
3619 jfsutils
3806 anjuta
3962 kbuild
4076 libgnustep-gui0.14
4222 libwebkit-1.0-1
4332 rosegarden
4412 matplotlib
4412 python-matplotlib
4950 mail-notification-evolution
5195 nut-cgi
5273 sbcl
5538 ruby1.9
5679 kmymoney2
6076 websvn
6147 miro
6264 gnat-4.1
6431 gallery2
6478 gnome-chess
6512 joystick
6862 istanbul
6937 wordpress
7476 luatex
7635 csound
7635 python-csoundac
8257 libpam-mount
8471 wmcpuload
8631 nbsmtp
8642 oxine
8774 axiom
8949 libengine-pkcs11-openssl
9052 clamav-getfiles
9221 open-iscsi
9262 ptex-bin
9701 egroupware-core
10012 debian-cd
10127 libjogl-java
10453 libjson-xs-perl
10698 gcjwebplugin
10736 request-tracker3.6
11294 python-extclass
11518 rkward
11600 alevt
12929 otrs2
13205 bindgraph
13956 icecc-monitor
14223 netmaze
14505 cdebconf
14526 mumble
14687 fnonlinear
14724 recite
14789 jbidwatcher
15766 libnss-ldapd
15812 glpi
15905 lustre-source
15982 win32-loader
16232 ampache
17168 libghc6-missingh-dev
17311 acl2
17830 apertium
18462 emacspeak
18469 glassfish
19257 isight-firmware-tools
19719 libnagios-object-perl
20144 libwx-perl
20465 dns2tcp
21614 python-kinterbasdb-dbg
21690 hf
21902 libcobra-java
22365 tomcat6
22573 libphp-snoopy
23501 coco-java
23926 boxbackup-server
24009 libghc6-wash-dev
26323 libdesktop-notify-perl
26439 opendb
28426 adolc
28795 xorp
29986 libgdata-java
31262 libjboss-serialization-java
31472 pixelpost
31658 cournol
32939 mediamate
34365 gforge-plugin-scmcvs
34659 libgearman-client-async-perl
42233 libhamcrest-java
48076 mahara
55303 mxallowd
85881 roxen-fonts-iso8859-1

So, if you don't know from what package to start fixing bugs, "Begin at the beginning,", the King said, very gravely, "and go on till you come to the end: then stop."

Posted Sat Jun 6 00:57:39 2009 Tags:

Live CD on a removable disk, the Debian way

In [live-cd-on-removable-disk] at some point I wrote:

Enrico's note: do we have anything in Debian that we can install and just does that?

Here are the answers:

Sven Mueller writes:

Well, Enrico, a tool I really grew fond of, which auto-configures X on Debian systems is xdebconfigurator, it lacks being auto-run on each system start, which I consider a feature on normal systems, but for your proposed usage (i.e. a portable USB-storage based Debian system), it would certainly be the right thing.

Essentially, it never failed on me. Except for VMware virtual machines, where all it did wrong was that it proposed too high resolutions which resulted from my dual-screen Windows setup I ran VMware on. You might want to give it a try.

Tollef Fog Heen writes:

I added the support in casper for doing this almost a year ago and it has saved me lots of debugging time. Booting the live CD that way is almost as fast as booting an installed system. If you couple this with using the persistent storage support in casper, you can get the configure-on-boot support together with persistency.

In a later update, slh is quited saying that xresprobe doesn't work on AMD64. This is wrong, I wrote that support based on code by Matthew Garret a little more than nine months ago. I wouldn't recommend incorporating it in new-written code, but rather use libx86

And finally, Marco Amadori writes:

Without needing to look for tools external to Debian, there is already the Debian Live software in sid: live-package, that creates a live system, and casper, that generates an initramfs that can configure a Debian system on the fly.

So far there is no hard disk target for live-package, but the "Iso" target can already do the job quite well. At boot time, Casper's initramfs scans all the block devices, so it works also for USB keys and hard drives.

To obtain a hard drive image, you just need to invoke "make-live" with the options to have the required software, then copy the content of the iso (or of the directory ./debian-live/binary) on a partition and install the boot loader.

This is what the future "HD" target of live-package will do; so far it can only build ISO and Netboot images.

Posted Sat Jun 6 00:57:39 2009 Tags:

Finding what was new in testing today

I played too much with sources.list and aptitude lost track of what packages are new and what was already in the system.

The place to recover this kind of information is the debian-testing-changes mailing list archive: new packages are those that say "(not in testing)" in the "previous" column.

Thank you dato for pointing me at it.

Posted Sat Jun 6 00:57:39 2009 Tags:

Converging to a solution

Sustain a discussion towards solving a problem is sometimes more important than solving the problem.

I can't decide if this is trivial or counterintuitive. Anyway it's been quite enlightening when it came out. I once took this note:

I found that with my projects, when someone posted a mail about a problem I would work maybe some days to find a solution, and just post the solution at the end.

However, now I realised it's more costructive to have the problem-solving process itself happen online. This way, instead of keeping people waiting in silence for a few days they can get quicker feedback and extra informations, and they also have a chance to participate to solving the problem before I manage to.

For example, when I have to interrupt to go home or sleep, someone else can pick it up and do another step.

Plus, the entire problem-solving process remains documented, which will provide more written information for future readers.

This note was from a few months ago; however, I still fail to do it. Bad habits are sometimes hard to change. Please kick me about it :)

Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009
osm

Pages about OpenStreetMap.

Computing time offsets between EXIF and GPS

I like the idea of matching photos to GPS traces. In Debian there is gpscorrelate but it's almost unusable to me because of bug #473362 and it has an awkward way of specifying time offsets.

Here at SoTM10 someone told me that exiftool gained -geosync and -geotag options. So it's just a matter of creating a little tool that shows a photo and asks you to type the GPS time you see in it.

Apparently there are no bindings or GIR files for gtkimageview in Debian, so I'll have to use C.

Here is a C prototype:

/*
 * gpsoffset - Compute EXIF time offset from a photo of a gps display
 *
 * Use with exiftool -geosync=... -geotag trace.gpx DIR
 *
 * Copyright (C) 2009--2010  Enrico Zini <enrico@enricozini.org>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */


#define _XOPEN_SOURCE /* glibc2 needs this */
#include <time.h>
#include <gtkimageview/gtkimageview.h>
#include <libexif/exif-data.h>
#include <stdio.h>
#include <stdlib.h>

static int load_time(const char* fname, struct tm* tm)
{
    ExifData* exif_data = exif_data_new_from_file(fname);
    ExifEntry* exif_time = exif_data_get_entry(exif_data, EXIF_TAG_DATE_TIME);
    if (exif_time == NULL)
    {
        fprintf(stderr, "Cannot find EXIF timetamp\n");
        return -1;
    }

    char buf[1024];
    exif_entry_get_value(exif_time, buf, 1024);
    //printf("val2: %s\n", exif_entry_get_value(t2, buf, 1024));

    if (strptime(buf, "%Y:%m:%d %H:%M:%S", tm) == NULL)
    {
        fprintf(stderr, "Cannot match EXIF timetamp\n");
        return -1;
    }

    return 0;
}

static time_t exif_ts;
static GtkWidget* res_lbl;

void date_entry_changed(GtkEditable *editable, gpointer user_data)
{
    const gchar* text = gtk_entry_get_text(GTK_ENTRY(editable));
    struct tm parsed;
    if (strptime(text, "%Y-%m-%d %H:%M:%S", &parsed) == NULL)
    {
        gtk_label_set_text(GTK_LABEL(res_lbl), "Please enter a date as YYYY-MM-DD HH:MM:SS");
    } else {
        time_t img_ts = mktime(&parsed);
        int c;
        int res;
        if (exif_ts < img_ts)
        {
            c = '+';
            res = img_ts - exif_ts;
        }
        else
        {
            c = '-';
            res = exif_ts - img_ts;
        }
        char buf[1024];
        if (res > 3600)
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d:%02d:%02d",
                    c, res, c, res / 3600, (res / 60) % 60, res % 60);
        else if (res > 60)
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%02d:%02d",
                    c, res, c, (res / 60) % 60, res % 60);
        else 
            snprintf(buf, 1024, "Result: %c%ds -geosync=%c%d",
                    c, res, c, res);
        gtk_label_set_text(GTK_LABEL(res_lbl), buf);
    }
}

int main (int argc, char *argv[])
{
    // Work in UTC to avoid mktime applying DST or timezones
    setenv("TZ", "UTC");

    const char* filename = "/home/enrico/web-eddie/galleries/2010/04-05-Uppermill/P1080932.jpg";

    gtk_init (&argc, &argv);

    struct tm exif_time;
    if (load_time(filename, &exif_time) != 0)
        return 1;

    printf("EXIF time: %s\n", asctime(&exif_time));
    exif_ts = mktime(&exif_time);

    GtkWidget* window = gtk_window_new(GTK_WINDOW_TOPLEVEL);
    GtkWidget* vb = gtk_vbox_new(FALSE, 0);
    GtkWidget* hb = gtk_hbox_new(FALSE, 0);
    GtkWidget* lbl = gtk_label_new("Timestamp:");
    GtkWidget* exif_lbl;
    {
        char buf[1024];
        strftime(buf, 1024, "EXIF time: %Y-%m-%d %H:%M:%S", &exif_time);
        exif_lbl = gtk_label_new(buf);
    }
    GtkWidget* date_ent = gtk_entry_new();
    res_lbl = gtk_label_new("Result:");
    GtkWidget* view = gtk_image_view_new();
    GdkPixbuf* pixbuf = gdk_pixbuf_new_from_file(filename, NULL);

    gtk_box_pack_start(GTK_BOX(hb), lbl, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(hb), date_ent, TRUE, TRUE, 0);

    gtk_signal_connect(GTK_OBJECT(date_ent), "changed", (GCallback)date_entry_changed, NULL);
    {
        char buf[1024];
        strftime(buf, 1024, "%Y-%m-%d %H:%M:%S", &exif_time);
        gtk_entry_set_text(GTK_ENTRY(date_ent), buf);
    }

    gtk_widget_set_size_request(view, 500, 400);
    gtk_image_view_set_pixbuf(GTK_IMAGE_VIEW(view), pixbuf, TRUE);
    gtk_container_add(GTK_CONTAINER(window), vb);
    gtk_box_pack_start(GTK_BOX(vb), view, TRUE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), hb, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), exif_lbl, FALSE, TRUE, 0);
    gtk_box_pack_start(GTK_BOX(vb), res_lbl, FALSE, TRUE, 0);
    gtk_widget_show_all(window);

    gtk_main ();

    return 0;
}

And here is its simple makefile:

CFLAGS=$(shell pkg-config --cflags gtkimageview libexif)
LDFLAGS=$(shell pkg-config --libs gtkimageview libexif)

gpsoffset: gpsoffset.c

It's a simple prototype but it's a working prototype and seems to do the job for me.

I currently cannot find out why after I click on the text box, there seems to be no way to give the focus back to the image viewer so I can control it with keys.

There is another nice algorithm to compute time offsets to be implemented: you choose a photo taken from a known place and drag it on that place on a map: you can then look for the nearest point on your GPX trace and compute the time offset from that.

I have seen that there are programs for geotagging photos that implement all such algorithms, and have a nice UI, but I haven't seen any in Debian.

Are there any such softwares that can be packaged?

If not, the interpolation and annotation tasks can now already be performed by exiftool, so it's just a matter of building a good UI, and I would love to see someone picking up the task.

Posted Sun Jul 11 12:34:04 2010 Tags:

Searching OSM nodes in Spatialite

Third step of my SoTM10 pet project: finding the POIs.

I put together a query to find all nodes with a given tag inside a bounding box, and also a query to find all the tag values for a given tag name inside a bounding box.

The result is this simple POI search engine:

#
# poisearch - simple geographical POI search engine
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

from pysqlite2 import dbapi2 as sqlite

class PoiDB(object):
    def __init__(self):
        self.db = sqlite.connect("pois.db")
        self.db.enable_load_extension(True)
        self.db.execute("SELECT load_extension('libspatialite.so')")
        self.oldsearch = []
        self.bbox = None

    def set_bbox(self, xmin, xmax, ymin, ymax):
        '''Set bbox for searches'''
        self.bbox = (xmin, xmax, ymin, ymax)

    def tagid(self, name, val):
        '''Get the database ID for a tag'''
        c = self.db.cursor()
        c.execute("SELECT id FROM tag WHERE name=? AND value=?", (name, val))
        res = None
        for row in c:
            res = row[0]
        return res

    def tagnames(self):
        '''Get all tag names'''
        c = self.db.cursor()
        c.execute("SELECT DISTINCT name FROM tag ORDER BY name")
        for row in c:
            yield row[0]

    def tagvalues(self, name, use_bbox=False):
        '''
        Get all tag values for a given tag name,
        optionally in the current bounding box
        '''
        c = self.db.cursor()
        if self.bbox is None or not use_bbox:
            c.execute("SELECT DISTINCT value FROM tag WHERE name=? ORDER BY value", (name,))
        else:
            c.execute("SELECT DISTINCT tag.value FROM poi, poitag, tag"
                      " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                      "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                      "   AND poitag.tag = tag.id AND poitag.poi = poi.id"
                      "   AND tag.name=?",
                      self.bbox + (name,))
        for row in c:
            yield row[0]

    def search(self, name, val):
        '''Get all name:val tags in the current bounding box'''
        # First resolve the tagid
        tagid = self.tagid(name, val)
        if tagid is None: return

        c = self.db.cursor()
        c.execute("SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag"
                  " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                  "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                  "   AND poitag.tag = ? AND poitag.poi = poi.id",
                  self.bbox + (tagid,))
        self.oldsearch = []
        for row in c:
            self.oldsearch.append(row)
            yield row[0], simplejson.loads(row[1]), row[2], row[3]

    def count(self, name, val):
        '''Count all name:val tags in the current bounding box'''
        # First resolve the tagid
        tagid = self.tagid(name, val)
        if tagid is None: return

        c = self.db.cursor()
        c.execute("SELECT COUNT(*) FROM poi, poitag"
                  " WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE ("
                  "       xmin >= ? AND xmax <= ? AND ymin >= ? AND ymax <= ?) )"
                  "   AND poitag.tag = ? AND poitag.poi = poi.id",
                  self.bbox + (tagid,))
        for row in c:
            return row[0]

    def replay(self):
        for row in self.oldsearch:
            yield row[0], simplejson.loads(row[1]), row[2], row[3]

Problem 3 solved: now on to the next step, building a user interface for it.

Posted Sat Jul 10 15:50:31 2010 Tags:

Importing OSM nodes into Spatialite

Second step of my SoTM10 pet project: creating a searchable database with the points. What a fantastic opportunity to learn Spatialite.

Learning Spatialite is easy. For example, you can use the two tutorials with catchy titles that assume your best wish in life is to create databases out of shapefiles using a pre-built, i386-only executable GUI binary downloaded over an insecure HTTP connection.

To be fair, the second of those tutorials is called "An almost Idiot's Guide", thus expliciting the requirement of being an almost idiot in order to happily acquire and run software in that way.

Alternatively, you can use A quick tutorial to SpatiaLite which is so quick it has examples that lead you to write SQL queries that trigger all sorts of vague exceptions at insert time. But at least it brought me a long way forward, at which point I could just cross reference things with PostGIS documentation to find out the right way of doing things.

So, here's the importer script, which will probably become my reference example for how to get started with Spatialite, and how to use Spatialite from Python:

#!/usr/bin/python

#
# poiimport - import nodes from OSM into a spatialite DB
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

import xml.sax
import xml.sax.handler
from pysqlite2 import dbapi2 as sqlite
import simplejson
import sys
import os

class OSMPOIReader(xml.sax.handler.ContentHandler):
    '''
    Filter SAX events in a OSM XML file to keep only nodes with names
    '''
    def __init__(self, consumer):
        self.consumer = consumer

    def startElement(self, name, attrs):
        if name == "node":
            self.attrs = attrs
            self.tags = dict()
        elif name == "tag":
            self.tags[attrs["k"]] = attrs["v"]

    def endElement(self, name):
        if name == "node":
            lat = float(self.attrs["lat"])
            lon = float(self.attrs["lon"])
            id = int(self.attrs["id"])
            #dt = parse(self.attrs["timestamp"])
            uid = self.attrs.get("uid", None)
            uid = int(uid) if uid is not None else None
            user = self.attrs.get("user", None)

            self.consumer(lat, lon, id, self.tags, user=user, uid=uid)

class Importer(object):
    '''
    Create the spatialite database and populate it
    '''
    TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])

    def __init__(self, filename):
        self.db = sqlite.connect(filename)
        self.db.enable_load_extension(True)
        self.db.execute("SELECT load_extension('libspatialite.so')")
        self.db.execute("SELECT InitSpatialMetaData()")
        self.db.execute("INSERT INTO spatial_ref_sys (srid, auth_name, auth_srid,"
                        " ref_sys_name, proj4text) VALUES (4326, 'epsg', 4326,"
                        " 'WGS 84', '+proj=longlat +ellps=WGS84 +datum=WGS84"
                        " +no_defs')")
        self.db.execute("CREATE TABLE poi (id int not null unique primary key,"
                        " name char, data text)")
        self.db.execute("SELECT AddGeometryColumn('poi', 'geom', 4326, 'POINT', 2)")
        self.db.execute("SELECT CreateSpatialIndex('poi', 'geom')")
        self.db.execute("CREATE TABLE tag (id integer primary key autoincrement,"
                        " name char, value char)")
        self.db.execute("CREATE UNIQUE INDEX tagidx ON tag (name, value)")
        self.db.execute("CREATE TABLE poitag (poi int not null, tag int not null)")
        self.db.execute("CREATE UNIQUE INDEX poitagidx ON poitag (poi, tag)")
        self.tagid_cache = dict()

    def tagid(self, k, v):
        key = (k, v)
        res = self.tagid_cache.get(key, None)
        if res is None:
            c = self.db.cursor()
            c.execute("SELECT id FROM tag WHERE name=? AND value=?", key)
            for row in c:
                self.tagid_cache[key] = row[0]
                return row[0]
            self.db.execute("INSERT INTO tag (id, name, value) VALUES (NULL, ?, ?)", key)
            c.execute("SELECT last_insert_rowid()")
            for row in c:
                res = row[0]
            self.tagid_cache[key] = res
        return res

    def __call__(self, lat, lon, id, tags, user=None, uid=None):
        # Acquire tag IDs
        tagids = []
        for k, v in tags.iteritems():
            if k not in self.TAG_WHITELIST: continue
            for val in v.split(";"):
                tagids.append(self.tagid(k, val))

        # Skip elements that don't have the tags we want
        if not tagids: return

        geom = "POINT(%f %f)" % (lon, lat)
        self.db.execute("INSERT INTO poi (id, geom, name, data)"
                        "     VALUES (?, GeomFromText(?, 4326), ?, ?)", 
                (id, geom, tags["name"], simplejson.dumps(tags)))

        for tid in tagids:
            self.db.execute("INSERT INTO poitag (poi, tag) VALUES (?, ?)", (id, tid))


    def done(self):
        self.db.commit()

# Get the output file name
filename = sys.argv[1]

# Ensure we start from scratch
if os.path.exists(filename):
    print >>sys.stderr, filename, "already exists"
    sys.exit(1)

# Import
parser = xml.sax.make_parser()
importer = Importer(filename)
handler = OSMPOIReader(importer)
parser.setContentHandler(handler)
parser.parse(sys.stdin)
importer.done()

Let's run it:

$ ./poiimport pois.db < pois.osm 
SpatiaLite version ..: 2.4.0    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct Dbf access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 Rel. 4.7.1, 23 September 2009
GEOS version 3.2.0-CAPI-1.6.0
$ ls -l --si pois*
-rw-r--r-- 1 enrico enrico 17M Jul  9 23:44 pois.db
-rw-r--r-- 1 enrico enrico 37M Jul  9 16:20 pois.osm
$ spatialite pois.db
SpatiaLite version ..: 2.4.0    Supported Extensions:
        - 'VirtualShape'        [direct Shapefile access]
        - 'VirtualDbf'          [direct DBF access]
        - 'VirtualText'         [direct CSV/TXT access]
        - 'VirtualNetwork'      [Dijkstra shortest path]
        - 'RTree'               [Spatial Index - R*Tree]
        - 'MbrCache'            [Spatial Index - MBR cache]
        - 'VirtualFDO'          [FDO-OGR interoperability]
        - 'SpatiaLite'          [Spatial SQL - OGC]
PROJ.4 version ......: Rel. 4.7.1, 23 September 2009
GEOS version ........: 3.2.0-CAPI-1.6.0
SQLite version ......: 3.6.23.1
Enter ".help" for instructions
spatialite> select id from tag where name="amenity" and value="fountain";
24
spatialite> SELECT poi.name, poi.data, X(poi.geom), Y(poi.geom) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.56 AND xmax <= 2.90 AND ymin >= 41.84 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
Font Picant de la Cellera|{"amenity": "fountain", "name": "Font Picant de la Cellera"}|2.616045|41.952449
Font de Can Pla|{"amenity": "fountain", "name": "Font de Can Pla"}|2.622354|41.974724
Font de Can Ribes|{"amenity": "fountain", "name": "Font de Can Ribes"}|2.62311|41.979193

It's impressive: I've got all sort of useful information for the whole of Spain in just 17Mb!

Let's put it to practice: I'm thirsty, is there any water fountain nearby?

spatialite> SELECT count(1) FROM poi, poitag WHERE poi.rowid IN (SELECT pkid FROM idx_poi_geom WHERE (xmin >= 2.80 AND xmax <= 2.85 AND ymin >= 41.97 AND ymax <= 42.00) ) AND poitag.tag = 24 AND poitag.poi = poi.id;
0

Ouch! No water fountains mapped in Girona... yet.

Problem 2 solved: now on to the next step, trying to show the results in some usable way.

Posted Sat Jul 10 09:10:35 2010 Tags:

Filtering nodes out of OSM files

I have a pet project here at SoTM10: create a tool for searching nearby POIs while offline.

The idea is to have something in my pocket (FreeRunner or N900), which doesn't require an internet connection, and which can point me at the nearest fountains, post offices, atm machines, bars and so on.

The first step is to obtain a list of POIs.

In theory one can use Xapi but all the known Xapi servers appear to be down at the moment.

Another attempt is to obtain it by filtering all nodes with the tags we want out of a planet OSM extract. I downloaded the Spanish one and set to work.

First I tried with xmlstarlet, but it ate all the RAM and crashed my laptop, because for some reason, on my laptop the Linux kernels up to 2.6.32 (don't now about later ones) like to swap out ALL running apps to cache I/O operations, which mean that heavy I/O operations swap out the very programs performing them, so the system gets caught in some infinite I/O loop and dies. Or at least this is what I've figured out so far.

So, we need SAX. I put together this prototype in Python, which can process a nice 8MB/s of OSM data for quite some time with a constant, low RAM usage:

#!/usr/bin/python

#
# poifilter - extract interesting nodes from OSM XML files
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#


import xml.sax
import xml.sax.handler
import xml.sax.saxutils
import sys

class XMLSAXFilter(xml.sax.handler.ContentHandler):
    '''
    A SAX filter that is a ContentHandler.

    There is xml.sax.saxutils.XMLFilterBase in the standard library but it is
    undocumented, and most of the examples using it you find online are wrong.
    You can look at its source code, and at that point you find out that it is
    an offensive practical joke.
    '''
    def __init__(self, downstream):
        self.downstream = downstream

    # ContentHandler methods

    def setDocumentLocator(self, locator):
        self.downstream.setDocumentLocator(locator)

    def startDocument(self):
        self.downstream.startDocument()

    def endDocument(self):
        self.downstream.endDocument()

    def startPrefixMapping(self, prefix, uri):
        self.downstream.startPrefixMapping(prefix, uri)

    def endPrefixMapping(self, prefix):
        self.downstream.endPrefixMapping(prefix)

    def startElement(self, name, attrs):
        self.downstream.startElement(name, attrs)

    def endElement(self, name):
        self.downstream.endElement(name)

    def startElementNS(self, name, qname, attrs):
        self.downstream.startElementNS(name, qname, attrs)

    def endElementNS(self, name, qname):
        self.downstream.endElementNS(name, qname)

    def characters(self, content):
        self.downstream.characters(content)

    def ignorableWhitespace(self, chars):
        self.downstream.ignorableWhitespace(chars)

    def processingInstruction(self, target, data):
        self.downstream.processingInstruction(target, data)

    def skippedEntity(self, name):
        self.downstream.skippedEntity(name)

class OSMPOIHandler(XMLSAXFilter):
    '''
    Filter SAX events in a OSM XML file to keep only nodes with names
    '''
    PASSTHROUGH = ["osm", "bound"]
    TAG_WHITELIST = set(["amenity", "shop", "tourism", "place"])

    def startElement(self, name, attrs):
        if name in self.PASSTHROUGH:
            self.downstream.startElement(name, attrs)
        elif name == "node":
            self.attrs = attrs
            self.tags = []
            self.propagate = False
        elif name == "tag":
            if self.tags is not None:
                self.tags.append(attrs)
                if attrs["k"] in self.TAG_WHITELIST:
                    self.propagate = True
        else:
            self.tags = None
            self.attrs = None

    def endElement(self, name):
        if name in self.PASSTHROUGH:
            self.downstream.endElement(name)
        elif name == "node":
            if self.propagate:
                self.downstream.startElement("node", self.attrs)
                for attrs in self.tags:
                    self.downstream.startElement("tag", attrs)
                    self.downstream.endElement("tag")
                self.downstream.endElement("node")

    def ignorableWhitespace(self, chars):
        pass

    def characters(self, content):
        pass

# Simple stdin->stdout XMl filter
parser = xml.sax.make_parser()
handler = OSMPOIHandler(xml.sax.saxutils.XMLGenerator(sys.stdout, "utf-8"))
parser.setContentHandler(handler)
parser.parse(sys.stdin)

Let's run it:

$ bzcat /store/osm/spain.osm.bz2 | pv | ./poifilter > pois.osm
[...]
$ ls -l --si pois.osm
-rw-r--r-- 1 enrico enrico 19M Jul 10 23:56 pois.osm
$ xmlstarlet val pois.osm 
pois.osm - valid

Problem 1 solved: now on to the next step: importing the nodes in a database.

Posted Fri Jul 9 16:28:15 2010 Tags:

Mapping using the Openmoko FreeRunner headset

The FreeRunner has a headset which includes a microphone and a button. When doing OpenStreetMap mapping, it would be very useful to be able to keep tangogps on the display and be able to mark waypoints using the headset button, and to record an audio track using the headset microphone.

In this way, I can use tangogps to see where I need to go, where it's already mapped and where it isn't, and then I can use the headset to mark waypoints corresponding to the audio track, so that later I can take advantage of JOSM's audio mapping features.

Enter audiomap:

$ audiomap --help
Usage: audiomap [options]

Create a GPX and audio trackFind the times in the wav file when there is clear
voice among the noise

Options:
  --version      show program's version number and exit
  -h, --help     show this help message and exit
  -v, --verbose  verbose mode
  -m, --monitor  only keep the GPS on and monitor satellite status
  -l, --levels   only show input levels

If called without parameters, or with -v which is suggested, it will:

  1. Fix the mixer settings so that it can record from the headset and detect headset button presses.
  2. Show a monitor of GPS satellite information until it gets a fix.
  3. Synchronize the system time with the GPS time so that the timestamps of the files that are created afterwards are accurate.
  4. Start recording a GPX track.
  5. Start recording audio.
  6. Record a GPX waypoint for every headset button press.

When you are done, you stop audiomap with ^C and it will properly close the .wav file, close the tags in the GPX waypoint and track files and restore the mixer settings.

You can plug the headset out and record using the handset microphone, but then you will not be able to set waypoints until you plug the headset back in.

After you stop audiomap, you will have a track, waypoints and .wav file ready to be loaded in JOSM.

Big thanks go to Luca Capello for finding out how to detect headset button presses.

Posted Sun Jun 7 23:51:37 2009 Tags:

Simple tool to query the GPS using the OpenMoko FSO stack

I was missing a simple command line tool that allows me to perform basic GPS queries in shellscripts.

Enter getgps:

# getgps --help
Usage: getgps [options]

Simple GPS query tool for the FSO stack

Options:
  --version          show program's version number and exit
  -h, --help         show this help message and exit
  -v, --verbose      verbose mode
  -q, --quiet        suppress normal output
  --fix              check if we have a fix
  -s, --sync-time    set system time from GPS time
  --info             get all GPS information
  --info-connection  get GPS connection information
  --info-fix         get GPS fix information
  --info-position    get GPS position information
  --info-accuracy    get GPS accuracy information
  --info-course      get GPS course information
  --info-time        get GPS time information
  --info-satellite   get GPS satellite information

So finally I can write little GPS-aware scripts:

if getgps --fix -q
then
    start_gps_aware_program
else
    start_gps_normal_program
fi

Or this.

Posted Sun Jun 7 17:59:32 2009 Tags:

Voice-controlled waypoints

I have it in my TODO list to implement taking waypoints when pressing the headset button of the openmoko, but that is not done yet.

In the meantime, I did some experiments with audio mapping, and since I did not manage to enter waypoints while recording them, I was looking for a way to make use of them anyway.

Enter findvoice:

$ ./findvoice  --help
Usage: findvoice [options] wavfile

Find the times in the wav file when there is clear voice among the noise

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -v, --verbose         verbose mode
  -p NUM, --percentile=NUM
            percentile to use to discriminate noise from voice
            (default: 90)
  -t, --timestamps      print timestamps instead of human readable information

You give it a wav file, and it will output a list of timestamps corresponding to where it things that you were talking clearly and near the FreeRunner / voice recorder instead of leaving the recorder dangling to pick up background noise.

Its algorithm is crude and improvised because I have no background whatsoever in audio processing, but it basically finds those parts of the audio file where the variance of the samples is above a given percentile: the higher the percentile, the less timestamps you get; the lower the percentile, the more likely it is that it picks a period of louder noise.

For example, you can automatically extract waypoints out of an audio file by using it together with gpxinterpolate:

./findvoice -t today.wav | ./gpxinterpolate today.gpx > today-waypoints.gpx

The timestamps it outputs are computed using the modification time of the .wav file: if your system clock was decently synchronised (which you can do with getgps), then the mtime of the wav is the time of the end of the recording, which gives the needed reference to compute timestamps that are absolute in time.

For example:

getgps --sync-time
arecord file.wav
^C
./findvoice -t file.wav | ./gpxinterpolate today.gpx > today-waypoints.gpx
Posted Sun Jun 7 02:48:40 2009 Tags:

Geocoding Unix timestamps

Geocoding EXIF tags in JPEG images is fun, but there is more that can benefit from interpolating timestamps over a GPX track.

Enter gpxinterpolate:

$ ./gpxinterpolate --help
Usage: gpxinterpolate [options] gpxfile [gpxfile...]

Read one or more GPX files and a list of timestamps on standard input. Output
a GPX file with waypoints at the location of the GPX track at the given
timestamps.

Options:
  --version      show program's version number and exit
  -h, --help     show this help message and exit
  -v, --verbose  verbose mode

For example, you can create waypoints interpolating file modification times:

find . -printf "%Ts %p\n" | ./gpxinterpolate ~/tracks/*.gpx > myfiles.gpx

In case you wonder where you were when you modified or accessed a file, now you can find out.

Posted Sun Jun 7 02:07:43 2009 Tags:

Recording audio on the FreeRunner

The FreeRunner can record audio. It is nice to record audio: for example I can run the recording in background while I keep tangogps in the screen, and take audio notes about where I am while I am doing mapping for OpenStreetMap.

Here is the script that I put together to create geocoded audio notes:

#!/bin/sh

WORKDIR=~/rec
TMPINFO=`mktemp $WORKDIR/info.XXXXXXXX`

# Sync system time and get GPS info
echo "Synchronising system time..."
getgps --sync-time --info > $TMPINFO

# Compute an accurate basename for the files we generate
BASENAME=~/rec/rec-$(date +%Y-%m-%d-%H-%M-%S)
# Then give a proper name to the file with saved info
mv $TMPINFO $BASENAME.info

# Proper mixer settings for recording
echo "Recording..."
alsactl -f /usr/share/openmoko/scenarios/voip-handset.state restore
arecord -D hw -f cd -r 8000 -t wav $BASENAME.wav

echo "Done"

It works like this:

  1. It synchronizes the system time from the GPS (if there is a fix) so that the timestamps on the wav files will be as accurate as possible.
  2. It also gets all sort of information from the GPS and stores them into a file, should you want to inspect it later.
  3. It records audio until it gets interrupted.

The file name of the files that it generates corresponds to the beginning of the recording. The mtime of the wav file obviously corresponds to the end of the recording. This can be used to later georeference the start and end point of the recording.

You can use this to check mixer levels and that you're actually getting any input:

arecord -D hw -f cd -r 8000 -t wav -V mono /dev/null

The getgps script is now described in its own post.

You may now want to experiment, in JOSM, with "Preferences / Audio settings / Modified times (time stamps) of audio files".

Posted Sun Jun 7 01:30:37 2009 Tags:

Saving waypoints with Holux M-241

The Holux M-241 is a nice unit, but it looks like it cannot store waypoints while you're recording a track.

In fact, if you set it to display latitude and longitude, every time you press enter it stores a track point with the current location. However, after downloading the data with gpsbabel, they are indistinguishable from all the other track points.

If you are not taking a track, this is sufficient: your track will be made by all the waypoints you recorded.

Together with Riccio, another M-241 user, we noticed that if you are taking a track, with one trackpoint per second, then when you press enter in the lat/lon screen you get two track points in the same second: one from the logger, and one from your keypress.

The duplicate timestamp is just enough information to be able to distinguish a waypoint. Here is a little python script that will add a waypoint every time there are two trackpoints with the same timestamp.

#!/usr/bin/python

import xml.dom.minidom
from xml.dom.minidom import Node
import sys

def get_text(node):
    res = ""
    for n in node.childNodes:
        if n.nodeType == Node.TEXT_NODE:
            res += n.data
    return res.strip()

def get_val(node, name):
    for n in node.childNodes:
        if n.nodeName == name:
            return get_text(n)
    return None

doc = xml.dom.minidom.parse(sys.argv[1])

# Scan the document looking for duplicate timestamps
wpt = []
last_ts = None
for node in doc.getElementsByTagName("trkpt"):
    ts = get_val(node, "time")
    if last_ts != None and last_ts == ts:
        wpt.append(node)
    last_ts = ts

# Add the nodes with duplicate timestamps as waypoints
if len(wpt) > 0:
    for i in wpt:
        node = doc.createElement("wpt")
        for a in i.attributes.values():
            node.setAttribute(a.name, a.value)
        for n in i.childNodes:
            node.appendChild(n.cloneNode(True))
        doc.documentElement.appendChild(node)
 
doc.writexml(sys.stdout)
Posted Sat Jun 6 00:57:39 2009 Tags:
Posted Sat Jun 6 00:57:39 2009