Match package names across distributions

What would happen if we had a quick and reliable way to match package names across distributions?

These ideas came up at the appinstaller2011 meeting:

  • it would be easy to lookup screenshots in the local distro, and if there are none then fall back on other distributions;
  • it would be easy to port Debtags to other distributions, and possibly get changes back;
  • it would be trivial to add a [patches in $DISTRO] link to the PTS
  • it would be easy to point to other BTSes

We thought they were good ideas, so we started hacking.

To try it, you need to get the code and build the index first:

git clone git://git.debian.org/users/enrico/distromatch.git
cd distromatch
# Careful: 90Mb
wget http://people.debian.org/~enrico/dist-info.tar.gz
tar zxf dist-info.tar.gz
# Takes a long time to do the indexing
./distromatch --reindex --verbose

Then you can query it this way:

./distromatch $DISTRO $PKGNAME [$PKGNAME1 ...]

This would give you, for the package $PKGNAME in $DISTRO, the corresponding package names in all other distros for which we have data. If you do not provide package names, it automatically shows output for all packages in $DISTRO.

For example:

$ time ./distromatch debian libdigest-sha1-perl
debian:libdigest-sha1-perl fedora:perl-Digest-SHA1
debian:libdigest-sha1-perl mandriva:perl-Digest-SHA1
debian:libdigest-sha1-perl suse:perl-Digest-SHA1

real    0m0.073s
user    0m0.056s
sys 0m0.016s

Yes it's quick. It builds a Xapian index with the information it needs, and then it reuses it. As soon as I find a moment, I intend to deploy an instance of it on DDE.

It is using a range of different heuristics:

  • match packages by name;
  • match packages by desktop files contained within;
  • match packages by pkg-config metadata files contained within;
  • match packages by [/usr]/bin/* files contained within;
  • match packages by shared library files contained within;
  • match packages by devel library files contained within;
  • match packages by man pages contained within;
  • match stemmed form of development library package names;
  • match stemmed form of shared library package names;
  • match stemmed form of perl library package names;
  • match stemmed form of python library package names.

This list may get obsolete soon as more heuristics get implemented.

Euristics will never cover all corner cases we surely have, but the idea is that if we can match a sizable amout of packages, the rest can be somehow fixed by hand as needed.

The data it requires for a distribution should be rather straightforward to generate:

  1. a file which maps binary package names to source package names
  2. a file with the list of files in all the packages

For example:

$ ls -l dist-debian/
total 39688
-rw-r--r--  1 enrico enrico  1688249 Jan 20 17:37 binsrc
drwxr-xr-x  2 enrico enrico     4096 Jan 21 19:12 db
-rw-r--r--  1 enrico enrico 29960406 Jan 21 10:02 files.gz
-rw-r--r--  1 enrico enrico  8914771 Jan 21 18:39 interesting-files

$ head dist-debian/binsrc 
openoffice.org-dev openoffice.org
ext4-modules-2.6.32-5-4kc-malta-di linux-kernel-di-mipsel-2.6
linux-headers-2.6.30-2-common linux-2.6
libnspr4 nspr
ipfm ipfm
libforks-perl libforks-perl
med-physics debian-med
libntfs-3g-dev ntfs-3g
libguppi16 guppi
selinux selinux

$ zcat dist-debian/files.gz | head
memstat etc/memstat.conf
memstat usr/bin/memstat
memstat usr/share/doc/memstat/changelog.gz
memstat usr/share/doc/memstat/copyright
memstat usr/share/doc/memstat/memstat-tutorial.txt.gz
memstat usr/share/man/man1/memstat.1.gz
libdirectfb-dev usr/bin/directfb-config
libdirectfb-dev usr/bin/directfb-csource
libdirectfb-dev usr/include/directfb-internal/core/clipboard.h
libdirectfb-dev usr/include/directfb-internal/core/colorhash.h

interesting-files and db are generated when indexing.

To prove the usefulness of the idea (but does it need proving?), you can find in the same git repo a little example app (it took me 10 minutes to write it), that uses the distromatch engine to export Debtags tags to other distributions:

$ ./exportdebtags fedora | head
memstat: admin::benchmarking, interface::commandline, role::program, use::monitor
libdirectfb-dev: devel::lang:c, devel::library, implemented-in::c, interface::framebuffer, role::devel-lib
libkonqsidebarplugin4a: implemented-in::c++, role::shared-lib, suite::kde, uitoolkit::qt
libemail-simple-perl: devel::lang:perl, devel::library, implemented-in::perl, role::devel-lib, role::shared-lib, works-with::mail
libpoe-component-pluggable-perl: devel::lang:perl, devel::library, implemented-in::perl, role::shared-lib
manpages-ja: culture::japanese, made-of::man, role::documentation
libhippocanvas-dev: devel::library, qa::low-popcon, role::devel-lib
libexpat-ocaml-dev: devel::lang:ocaml, devel::library, implemented-in::c, implemented-in::ocaml, role::devel-lib, works-with-format::xml
libgnutls-dev: devel::library, role::devel-lib, suite::gnu

Just in case this made you itch to play with Debtags in a non-Debian distribution, I've generated the full datasets for Fedora, Mandriva and OpenSUSE.

Others have been working on the same matching problem. After we started writing code we started to become aware of existing work:

I'd like to make use of those efforts, maybe to cross-validate results, maybe even better as yet another heuristics.

Update:

I built a simple distromatch query system into DDE!