apt-xapian-index: performing a simple query

I've recently posted an introduction of apt-xapian-index.

Today I'll show how to make simple queries to apt-xapian-index. If you feel like reimplementing my examples in another language, let me know and I'll include it to the post.

What I'm going to build is a replacement for apt-cache search that:

This is just a beginning: in future blog posts I'll show how to enhance a search with interesting advanced features.

First thing, we need to import the Xapian module, found in the package python-xapian. Documentation on the Python Xapian API can be found in /usr/share/doc/python-xapian.

import xapian

Then we open the apt-xapian-index database:

# Instantiate a xapian.Database object for read only access to the index
db = xapian.Database("/var/lib/apt-xapian-index/index")

Now we build a query from the command line arguments. We'll assume that if an argument is in the form foo::bar, then it's a Debtags tag instead of a normal keyword.

For normal keywords, we also search for the stemmed version, so that we can, for example, find "editing" when the user searches for "edit".

# Stemmer function to generate stemmed search keywords
stemmer = xapian.Stem("english")

# Build the terms that will go in the query
terms = []
for word in args:
    if word.islower() and word.find("::") != -1:
        # According to /var/lib/apt-xapian-index/README, Debtags tags are
        # indexed with the 'XT' prefix.
        terms.append("XT"+word)
    else:
        # If it is not a Debtags tag, then we consider it a normal keyword.
        word = word.lower()  # The index stores keyword all in lowercase
        terms.append(word)
        # If the word has a stemmed version, add it to the query.
        # /var/lib/apt-xapian-index/README tells us that stemmed terms have a
        # 'Z' prefix.
        stem = stemmer(word)
        if stem != word:
            terms.append("Z"+stem)

Now we have the terms for the query, and we can create a query that ORs them together.

One may ask, why OR and not AND? The reason is that, contrarily to apt-cache, Xapian scores results according to how well they matched.

Matches that match all the terms will score higher than the others, so if we build an OR query what we really have is an AND query that gracefully degenerates to closer matches when they run out of perfect results.

This allows stemmed searches to work nicely: if you look for 'editing', then the query will be 'editing OR Zedit'. Packages with the word 'editing' will match both and score higher, and packages with the word 'edited' will still match 'Zedit' and get included in the results.

# OR the terms together into a Xapian query.
#
query = xapian.Query(xapian.Query.OP_OR, terms)

We then run the query. Queries are run through a xapian.Enquire object:

# Perform the query
enquire = xapian.Enquire(db)
enquire.set_query(query)

The Enquire object returns results as an mset. An mset represents a view of the result set, and can be iterated to access the resulting documents. Here we iterate the mset and output the result of the query, looking up short descriptions with apt:

# Display the top 20 results, sorted by how well they match
matches = enquire.get_mset(0, 20)
print "%i results found." % matches.get_matches_estimated()
print "Results 1-%i:" % matches.size()
for m in matches:
    # /var/lib/apt-xapian-index/README tells us that the Xapian document data
    # is the package name.
    name = m[xapian.MSET_DOCUMENT].get_data()

    # Get the package record out of the Apt cache, so we can retrieve the short
    # description
    pkg = cache[name]

    # Print the match, together with the short description
    print "%i%% %s - %s" % (m[xapian.MSET_PERCENT], name, pkg.summary)

This is it.

You can use the wsvn interface to get to the full source code.

You can run the code passing keywords and Debtags tags. Try running it as:

    ./axi-query-simple.py role::program image edit
    ./axi-query-simple.py role::program game::arcade
    ./axi-query-simple.py kernel image

You can search Debtags tags using debtags tagsearch. In a later blog post, I'll show how to use apt-xapian-index to implement a better-than-you-would-ever-have-thought-possible tag search.

Next in the series: adding simple result filters to the query.