I've recently posted an introduction of apt-xapian-index.
Today I'll show how to make simple queries to apt-xapian-index. If you feel like reimplementing my examples in another language, let me know and I'll include it to the post.
What I'm going to build is a replacement for
apt-cache search that:
- is much faster than apt-cache search
- scores results by relevance, so you get the best matches first
- does stemming of search terms, so it matches 'edit' when you type 'editing'
- understands debtags tags.
This is just a beginning: in future blog posts I'll show how to enhance a search with interesting advanced features.
First thing, we need to import the Xapian module, found in the package
python-xapian. Documentation on the Python Xapian API can be found in
Then we open the apt-xapian-index database:
# Instantiate a xapian.Database object for read only access to the index db = xapian.Database("/var/lib/apt-xapian-index/index")
Now we build a query from the command line arguments. We'll assume that if an
argument is in the form
foo::bar, then it's a Debtags tag instead of a normal
For normal keywords, we also search for the stemmed version, so that we can, for example, find "editing" when the user searches for "edit".
# Stemmer function to generate stemmed search keywords stemmer = xapian.Stem("english") # Build the terms that will go in the query terms =  for word in args: if word.islower() and word.find("::") != -1: # According to /var/lib/apt-xapian-index/README, Debtags tags are # indexed with the 'XT' prefix. terms.append("XT"+word) else: # If it is not a Debtags tag, then we consider it a normal keyword. word = word.lower() # The index stores keyword all in lowercase terms.append(word) # If the word has a stemmed version, add it to the query. # /var/lib/apt-xapian-index/README tells us that stemmed terms have a # 'Z' prefix. stem = stemmer(word) if stem != word: terms.append("Z"+stem)
Now we have the terms for the query, and we can create a query that ORs them together.
One may ask, why OR and not AND? The reason is that, contrarily to apt-cache, Xapian scores results according to how well they matched.
Matches that match all the terms will score higher than the others, so if we build an OR query what we really have is an AND query that gracefully degenerates to closer matches when they run out of perfect results.
This allows stemmed searches to work nicely: if you look for 'editing', then the query will be 'editing OR Zedit'. Packages with the word 'editing' will match both and score higher, and packages with the word 'edited' will still match 'Zedit' and get included in the results.
# OR the terms together into a Xapian query. # query = xapian.Query(xapian.Query.OP_OR, terms)
We then run the query. Queries are run through a
# Perform the query enquire = xapian.Enquire(db) enquire.set_query(query)
The Enquire object returns results as an mset. An mset represents a view of the result set, and can be iterated to access the resulting documents. Here we iterate the mset and output the result of the query, looking up short descriptions with apt:
# Display the top 20 results, sorted by how well they match matches = enquire.get_mset(0, 20) print "%i results found." % matches.get_matches_estimated() print "Results 1-%i:" % matches.size() for m in matches: # /var/lib/apt-xapian-index/README tells us that the Xapian document data # is the package name. name = m[xapian.MSET_DOCUMENT].get_data() # Get the package record out of the Apt cache, so we can retrieve the short # description pkg = cache[name] # Print the match, together with the short description print "%i%% %s - %s" % (m[xapian.MSET_PERCENT], name, pkg.summary)
This is it.
You can run the code passing keywords and Debtags tags. Try running it as:
./axi-query-simple.py role::program image edit ./axi-query-simple.py role::program game::arcade ./axi-query-simple.py kernel image
You can search Debtags tags using
debtags tagsearch. In a later blog post,
I'll show how to use apt-xapian-index to implement a
better-than-you-would-ever-have-thought-possible tag search.
Next in the series: adding simple result filters to the query.