apt-xapian-index: smart way of querying tags
I've recently posted:
- an introduction of apt-xapian-index;
- an example of how to query it;
- a way to add simple result filters to the query;
- a way to suggest keywords and tags to use to improve the query.
- a way to search for similar packages.
- a way to implement an adaptive cutoff on result quality.
Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.
Today I'll show how to implement a really good way of searching for Debtags tags. When I say really good, I mean the sort of good that after you run it you wonder how could it possibly manage to do it.
The idea is simple: you run a package search, but instead of showing the resulting packages, you ask Xapian to suggest tags like we saw in axi-query-expand.py.
For extra points, I'll use an adaptive cutoff in chosing the packages that go in the rset.
So, let's ask the user to enter some keywords to look for tags, and use them to run a normal package query:
# Build the base query query = xapian.Query(xapian.Query.OP_OR, termsForSimpleQuery(args)) # Perform the query enquire = xapian.Enquire(db) enquire.set_query(query)
Now, instead of showing the results of the query, we ask Xapian what are the tags in the index that are most relevant to this search.
First, we pick some representative packages for the expand:
# Use an adaptive cutoff to avoid to pick bad results as references matches = enquire.get_mset(0, 1) topWeight = matches.weight enquire.set_cutoff(0, topWeight * 0.7) # Select the first 10 documents as the key ones to use to compute relevant # terms rset = xapian.RSet() for m in enquire.get_mset(0, 10): rset.add_document(m[xapian.MSET_DID])
Then we define the filter that only keeps tags:
# Filter out all the keywords that are not tags class Filter(xapian.ExpandDecider): def __call__(self, term): "Return true if we want the term, else false" return term[:2] == "XT"
Then we print the tags:
# This is the "Expansion set" for the search: the 10 most relevant terms that # match the filter eset = enquire.get_eset(10, rset, Filter()) # Print out the results for res in eset: print "%.2f %s" % (res.weight, res.term[2:])
That's it. We turned a package search into a tag search, and this allows us to search for tags using keywords that are not present in the tag descriptions at all:
$ ./axi-query-tags.py explore the dungeons 27.50 game::rpg:rogue 26.14 use::gameplaying 17.53 game::rpg 10.27 uitoolkit::ncurses ... $ ./axi-query-tags.py total world domination 7.55 use::gameplaying 5.68 x11::application 5.35 interface::x11 5.05 game::strategy ...
You can see a similar technique working in the Debtags tag editor: enter a package, then choose "Available tags: search".
Next in the series: search as you type.