While answering to a long message in the debtags-devel mailing list I accidentally put together the pieces of a fun idea.

This is the bit of message I was answering:

This is my answer:

Good point. The idea has popped up in the past to list supported mime types among the package metadata, so that one could point to a file and get a list of all the packages that can work with it.

I'm not sure it's a good idea to encode mime types in debtags and I'd like to see something ad-hoc for it. In the meantime works-with-format is the best we can do, but we should limit it to the most common formats.

This is the fun idea: if works-with-format is the best we can do, what can we do with it?

Earlier today I worked on resurrecting some old code of mine to expand Zack's ls2rss with Dublin Core metadata extracted from the files. The mime type scanner was ready for action.

Some imports:

import sys
# Requires python-extractor, python-magic, python-apt
# and an unreleased python-debtags from http://bzr.debian.org/bzr/pkg-python-debian/trunk/
import extractor
import magic
from debian_bundle import debtags
import re
from optparse import OptionParser
import apt

A tenative mapping between mime types and debtags tags:

mime_map = (
        ( r'text/html\b', ("works-with::text","works-with-format::html") ),
        ( r'text/plain\b', ("works-with::text","works-with-format::plaintext") ),
        ( r'text/troff\b', ("works-with::text", "works-with-format::man") ),
        ( r'image/', ("works-with::image",) ),
        ( r'image/jpeg\b', ("works-with::image:raster","works-with-format::jpg") ),
        ( r'image/png\b', ("works-with::image:raster","works-with-format::png") ),
        ( r'application/pdf\b', ("works-with::text","works-with-format::pdf")),
        ( r'application/postscript\b', ("works-with::text","works-with-format::postscript")),
        ( r'application/x-iso9660\b', ('works-with-format::iso9660',)),
        ( r'application/zip\b', ('works-with::archive', 'works-with-format::zip')),
        ( r'application/x-tar\b', ('works-with::archive', 'works-with-format::tar')),
        ( r'audio/', ("works-with::audio",) ),
        ( r'audio/mpeg\b', ("works-with-format::mp3",) ),
        ( r'audio/x-wav\b', ("works-with-format::wav",) ),
        ( r'message/rfc822\b', ("works-with::mail",) ),
        ( r'video/', ("works-with::video",)),
        ( r'application/x-debian-package\b', ("works-with::software:package",)),
        ( r'application/vnd.oasis.opendocument.text\b', ("works-with::text",)),
        ( r'application/vnd.oasis.opendocument.graphics\b', ("works-with::image:vector",)),
        ( r'application/vnd.oasis.opendocument.spreadsheet\b', ("works-with::spreadsheet",)),
        ( r'application/vnd.sun.xml.base\b', ("works-with::db",)),
        ( r'application/rtf\b', ("works-with::text",)),
        ( r'application/x-dbm\b', ("works-with::db",)),
)

Code that does its best to extract a mime type:

extractor = extractor.Extractor()
magic = magic.open(magic.MAGIC_MIME)
magic.load()

def mimetype(fname):
    keys = extractor.extract(fname)
    xkeys = {}
    for k, v in keys:
        if xkeys.has_key(k):
            xkeys[k].append(v)
        else:
            xkeys[k] = [v]
    namemagic =  magic.file(fname)
    contentmagic = magic.buffer(file(fname, "r").read(4096))
    return xkeys.has_key("mimetype") and xkeys['mimetype'][0] or contentmagic or namemagic

Command line parser:

parser = OptionParser(usage="usage: %prog [options] filename",
        version="%prog "+ VERSION,
        description="search Debian packages that can handle a given file")
parser.add_option("--tagdb", default="/var/lib/debtags/package-tags", help="Tag database to use (default: %default)")
parser.add_option("--action", default=None, help="Show the packages that allow the given action on the file (default: %default)")

(options, args) = parser.parse_args()

if len(args) == 0:
    parser.error("Please provide the name of a file to scan")

And here starts the fun: first we load the debtags data:

# Read full database
fullcoll = debtags.DB()
tagFilter = re.compile(r"^special::.+$|^.+::TODO$")
fullcoll.read(open(options.tagdb, "r"), lambda x: not tagFilter.match(x))

Then we scan the mime type and look up tags in the mime_map above:

type = mimetype(args[0])
#print >>sys.stderr, "Mime type:", type
found = set()
for match, tags in mime_map:
    match = re.compile(match)
    if match.match(type):
        for t in tags:
            found.add(t)

if len(found) == 0:
    print >>sys.stderr, "Unhandled mime type:", type
else:

If the user only gave the file name, let's show what Debian can do with that file:

    if options.action == None:
        print "Debtags query:", " && ".join(found)

        query = found.copy()
        query.add("role::program")
        subcoll = fullcoll.filterPackagesTags(lambda pt: query.issubset(pt[1]))
        uses = map(lambda x:x[5:], filter(lambda x:x.startswith("use::"), subcoll.iterTags()))
        print "Available actions:", ", ".join(uses)

If the user picked one of the available actions, let's show the packages that do it:

    else:
        aptCache = apt.Cache()
        query = found.copy()
        query.add("role::program")
        query.add("use::"+options.action)
        print "Debtags query:", " && ".join(query)
        subcoll = fullcoll.filterPackagesTags(lambda pt: query.issubset(pt[1]))
        for i in subcoll.iterPackages():
            aptpkg = aptCache[i]
            desc = aptpkg.rawDescription.split("\n")[0]
            print i, "-", desc

\o/

The morale of the story:

debian debtags eng pdo tips

2009-06-06 00:57:39+02:00