apt-xapian-index: suggesting new terms for improving the query

I've recently posted an introduction of apt-xapian-index, an example of how to query it and a way to add simple result filters to the query.

Today I'll show how to use a very interesting feature of Xapian: computing a list of the best terms to use to improve the query. I'll expand apt-query-pkgtype.py to write, after the results, the suggested terms.

If you feel like reimplementing my examples in another language, let me know and I'll include it to the post.

Let's start with a few definitions taken from the Xapian glossary:

RSet (Relevance Set)

The Relevance Set (RSet) is the set of documents which have been marked by the user as relevant. They can be used to suggest terms that the user may want to add to the query (these terms form an ESet), and also to adjust term weights to reorder query results.

ESet (Expand Set)

The Expand Set (ESet) is a ranked list of terms that could be used to expand the original query. These terms are those which are statistically good differentiators between relevant and non-relevant documents.

What I'm doing now is: after printing the normal result, I build an rset with all the results. For better results one can ask the user to pick the results they like (make a GUI with that: it should rock!), but starting with the top results in the RSet is a good enough default:

# Now, we ask Xapian what are the terms in the index that are most relevant to
# this search.  This can be used to suggest to the user the most useful ways of
# refining the search.

# Select the first 10 documents as the key ones to use to compute relevant
# terms (matches is the mset returned by a normal query)
rset = xapian.RSet()
for m in matches:
    rset.add_document(m[xapian.MSET_DID])

Then I use the RSet to compute the ESet, and display the results to the user:

# This is the "Expansion set" for the search: the 10 most relevant terms
eset = enquire.get_eset(10, rset)

# Print it out.  Note that some terms have a prefix from the database: can we
# filter them out?  Indeed: Xapian allow to give a filter to get_eset.
# Read on...
print
print "Terms that could improve the search:",
print ", ".join(["%s (%.2f%%)" % (res.term, res.weight) for res in eset])

You can also abuse this feature to show what are the tags that are most related to the search results. This allows you to turn a search based on keywords to a search based on semantic attributes, which would be an absolutely stunning feature in a GUI.

We can do it thanks to Xapian allowing to specify a filter for the output of get_eset. This filter filters out all the keywords that are not tags, or that were in the list of query terms:

class Filter(xapian.ExpandDecider):
    def __call__(self, term):
        "Return true if we want the term, else false"
        return term[:2] == "XT"

I just rerun get_eset like above, but adding the filter, to show only a suggestion of tags:

# This is the "Expansion set" for the search: the 10 most relevant terms that
# match the filter
eset = enquire.get_eset(10, rset, Filter())

# Print out the resulting tags
print
print "Tags that could improve the search:",
print ", ".join(["%s (%.2f%%)" % (res.term[2:], res.weight) for res in eset])

sys.exit(0)

This is it.

You can use the wsvn interface to get to the full source code and the module it uses. Try running it as:

axi-query-expand.py --type=gui image editor
axi-query-expand.py --type=cmdline image editor
axi-query-expand.py --type=game image editor

you will see that as the results change, the suggestions change as well.

I'm happy to see that Xapian often suggests tags: this means that tags are useful discriminators in search results, and that all the work done on Debtags has been a good work.

Notice that the suggestions change as you change any of the terms, the custom filter, and the tags you put in the search. Try to imagine how such a thing could improve Synaptic. What if I told you that Xapian is fast enough that all of this can happen live while you type?

Next in the series: searching for similar packages.