apt-xapian-index: searching for similar packages
I've recently posted:
- an introduction of apt-xapian-index;
- an example of how to query it;
- a way to add simple result filters to the query;
- a way to suggest keywords and tags to use to improve the query.
Today I'll show how to abuse Xapian to show a list of packages similar to a given one.
This time I'll try just linking to the code in wsvn and showing in the blog only show the most important bits.
So, we have a package name, and we want to show what are the packages similar to that one.
To do it, we simply build a big OR query with all the terms indexed for that package: Xapian will show us the packages whose terms are most similar, and that does the trick.
This works because Xapian gives us the best results first, therefore even if no package except the given one will give an exact match, we still get the nearest matches first.
In order to get the list of indexed terms given a package name we need to do two things:
- Get the Xapian document for the package.
- Get the termlist of the document.
To get the Xapian document we search for a term that only that document can have. In the index, the package name is indexed with the special prefix "XP", so we can search for that:
def docForPackage(pkgname): "Get the document corresponding to the package with the given name" # Query the term with the package name query = xapian.Query("XP"+pkgname) enquire = xapian.Enquire(db) enquire.set_query(query) # Get the top result only matches = enquire.get_mset(0, 1) if matches.size() == 0: return None else: m = matches return m[xapian.MSET_DOCUMENT]
Then we build the big term list, by iterating the termlist of the document:
# Build a term list with all the terms in the given packages terms =  # Get the document corresponding to the package name doc = docForPackage(pkgname) if not doc: continue # Retrieve all the terms in the document for t in doc.termlist(): if len(t.term) < 2 or t.term[:2] != 'XP': terms.append(t.term)
Note that it's trivial to fetch terms from more than one document, if you want to query "all packages a bit like this one and a bit like that one", although that's less of a useful feature.
Lastly, we build the final query:
# Build the big OR query query = xapian.Query(xapian.Query.OP_AND_NOT, # Terms we want xapian.Query(xapian.Query.OP_OR, terms), # AND NOT the input packages xapian.Query("XP"+pkgname))
I add an AND_NOT part with the input package name so that we don't get in the output the package that we asked for.
This is it:
$ ./axi-query-similar.py debtags 20309 results found. Results 1-20: 33% debtags-edit - GUI browser and editor for Debian Package Tags 27% tagcolledit - GUI editor for tagged collections 25% libtagcoll2-dev - Functions used to manipulate tagged collections (development version) 24% tagcoll - Commandline tool to perform operations on tagged collections 19% packagesearch - GUI for searching packages and viewing package information 18% doodle - Desktop Search Engine (client) 18% doodled - Desktop Search Engine (daemon) 18% libept0 - High-level library for managing Debian package information 18% upgrade-system - system upgrader from Konflux 18% libept-dev - High-level library for managing Debian package information 17% ept-cache - Commandline tool to search the package archive 16% tracker-utils - metadata database, indexer and search tool - commandline tools [...]
Next in the series: adaptive quality cutoff.