Match package names across distributions
What would happen if we had a quick and reliable way to match package names across distributions?
These ideas came up at the appinstaller2011 meeting:
- it would be easy to lookup screenshots in the local distro, and if there are none then fall back on other distributions;
- it would be easy to port Debtags to other distributions, and possibly get changes back;
- it would be trivial to add a
[patches in $DISTRO]link to the PTS
- it would be easy to point to other BTSes
We thought they were good ideas, so we started hacking.
To try it, you need to get the code and build the index first:
git clone git://git.debian.org/users/enrico/distromatch.git cd distromatch # Careful: 90Mb wget http://people.debian.org/~enrico/dist-info.tar.gz tar zxf dist-info.tar.gz # Takes a long time to do the indexing ./distromatch --reindex --verbose
Then you can query it this way:
./distromatch $DISTRO $PKGNAME [$PKGNAME1 ...]
This would give you, for the package $PKGNAME in $DISTRO, the corresponding package names in all other distros for which we have data. If you do not provide package names, it automatically shows output for all packages in $DISTRO.
$ time ./distromatch debian libdigest-sha1-perl debian:libdigest-sha1-perl fedora:perl-Digest-SHA1 debian:libdigest-sha1-perl mandriva:perl-Digest-SHA1 debian:libdigest-sha1-perl suse:perl-Digest-SHA1 real 0m0.073s user 0m0.056s sys 0m0.016s
It is using a range of different heuristics:
- match packages by name;
- match packages by desktop files contained within;
- match packages by pkg-config metadata files contained within;
- match packages by [/usr]/bin/* files contained within;
- match packages by shared library files contained within;
- match packages by devel library files contained within;
- match packages by man pages contained within;
- match stemmed form of development library package names;
- match stemmed form of shared library package names;
- match stemmed form of perl library package names;
- match stemmed form of python library package names.
This list may get obsolete soon as more heuristics get implemented.
Euristics will never cover all corner cases we surely have, but the idea is that if we can match a sizable amout of packages, the rest can be somehow fixed by hand as needed.
The data it requires for a distribution should be rather straightforward to generate:
- a file which maps binary package names to source package names
- a file with the list of files in all the packages
$ ls -l dist-debian/ total 39688 -rw-r--r-- 1 enrico enrico 1688249 Jan 20 17:37 binsrc drwxr-xr-x 2 enrico enrico 4096 Jan 21 19:12 db -rw-r--r-- 1 enrico enrico 29960406 Jan 21 10:02 files.gz -rw-r--r-- 1 enrico enrico 8914771 Jan 21 18:39 interesting-files $ head dist-debian/binsrc openoffice.org-dev openoffice.org ext4-modules-2.6.32-5-4kc-malta-di linux-kernel-di-mipsel-2.6 linux-headers-2.6.30-2-common linux-2.6 libnspr4 nspr ipfm ipfm libforks-perl libforks-perl med-physics debian-med libntfs-3g-dev ntfs-3g libguppi16 guppi selinux selinux $ zcat dist-debian/files.gz | head memstat etc/memstat.conf memstat usr/bin/memstat memstat usr/share/doc/memstat/changelog.gz memstat usr/share/doc/memstat/copyright memstat usr/share/doc/memstat/memstat-tutorial.txt.gz memstat usr/share/man/man1/memstat.1.gz libdirectfb-dev usr/bin/directfb-config libdirectfb-dev usr/bin/directfb-csource libdirectfb-dev usr/include/directfb-internal/core/clipboard.h libdirectfb-dev usr/include/directfb-internal/core/colorhash.h
db are generated when indexing.
To prove the usefulness of the idea (but does it need proving?), you can find in the same git repo a little example app (it took me 10 minutes to write it), that uses the distromatch engine to export Debtags tags to other distributions:
$ ./exportdebtags fedora | head memstat: admin::benchmarking, interface::commandline, role::program, use::monitor libdirectfb-dev: devel::lang:c, devel::library, implemented-in::c, interface::framebuffer, role::devel-lib libkonqsidebarplugin4a: implemented-in::c++, role::shared-lib, suite::kde, uitoolkit::qt libemail-simple-perl: devel::lang:perl, devel::library, implemented-in::perl, role::devel-lib, role::shared-lib, works-with::mail libpoe-component-pluggable-perl: devel::lang:perl, devel::library, implemented-in::perl, role::shared-lib manpages-ja: culture::japanese, made-of::man, role::documentation libhippocanvas-dev: devel::library, qa::low-popcon, role::devel-lib libexpat-ocaml-dev: devel::lang:ocaml, devel::library, implemented-in::c, implemented-in::ocaml, role::devel-lib, works-with-format::xml libgnutls-dev: devel::library, role::devel-lib, suite::gnu
Others have been working on the same matching problem. After we started writing code we started to become aware of existing work:
- Equivalent-Packages, statistically generated from package contents, more info in this post
I'd like to make use of those efforts, maybe to cross-validate results, maybe even better as yet another heuristics.