Enrico's pages/ tags/ debtags

Pages about Debtags.

apt-xapian-index: search as you type

apt-xapian-index: search as you type

I've recently posted:

Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.

Today I'll show how to implement a very attractive feature for a user interface: search as you type. The idea is that you don't need to press enter to fire up a query: instead, the results materialise in front of your eyes as you type them.

The example I created uses curses, but the idea is good on any interactive user interface.

The main thing to keep in mind with search as you type is that the last word is likely to be partially typed, unless maybe some timeout expired since the user's last keystroke.

Xapian comes into help here, as it allows us to expand the partially typed word into an OR query with all the terms that start with it. This means that if we are typing, for example, "progr", we can turn the query into "program OR programmer OR programming OR programmed [...and so on...]".

I won't show the UI code, except a simple input loop that triggers the query at every keystroke:

    def mainloop(self):
        while True:
            c = self.win.getch()
            self.line += chr(c)
            self.results.update(self.line)

The interesting part is in the update function.

First we split the line in words and convert the words into a query:

        # Split the line in words
        args = self.splitline.split(line)
        # Convert the words into terms for the query
        terms = termsForSimpleQuery(args)

Then we expand the last word with all possible completions:

        # Since the last word can be partially typed, we add all words that
        # begin with the last one.
        terms.extend([x.term for x in db.allterms(args[-1])])

Now we can build the query. Of course you can add all other sorts of things to the query, for example a boolean expression of tag filter like in axi-query-pkgtype.py; Xapian will cope.

        # Build the query
        query = xapian.Query(xapian.Query.OP_OR, terms)

Finally the query. For bonus points you can do the adaptive cutoff trick to discard bad results.

In my case, since I don't implement scrolling of results, I also limit them to what fits in the window:

        # Retrieve as many results as we can show
        mset = enquire.get_mset(0, self.size - 1)

Finally, draw the results on screen:

        # Redraw the window
        self.win.clear()

        # Header
        self.win.addstr(0, 0, "%i results found." % mset.get_matches_estimated(), curses.A_BOLD)

        # Results
        for y, m in enumerate(mset):
            # /var/lib/apt-xapian-index/README tells us that the Xapian document data
            # is the package name.
            name = m[xapian.MSET_DOCUMENT].get_data()

            # Get the package record out of the Apt cache, so we can retrieve the short
            # description
            pkg = cache[name]

            # Print the match, together with the short description
            self.win.addstr(y+1, 0, "%i%% %s - %s" % (m[xapian.MSET_PERCENT], name, pkg.summary))

        self.win.refresh()

That's it, try it out.

You can use the wsvn interface to get to the full source code and the module it uses.

You can see a similar technique working in goplay, where it is also integrated with an interactive tag filter.

Posted Tue 06 Nov 2007 23:02:46 CET Tags: debtags
apt-xapian-index: smart way of querying tags

apt-xapian-index: smart way of querying tags

I've recently posted:

Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.

Today I'll show how to implement a really good way of searching for Debtags tags. When I say really good, I mean the sort of good that after you run it you wonder how could it possibly manage to do it.

The idea is simple: you run a package search, but instead of showing the resulting packages, you ask Xapian to suggest tags like we saw in axi-query-expand.py.

For extra points, I'll use an adaptive cutoff in chosing the packages that go in the rset.

So, let's ask the user to enter some keywords to look for tags, and use them to run a normal package query:

# Build the base query
query = xapian.Query(xapian.Query.OP_OR, termsForSimpleQuery(args))

# Perform the query
enquire = xapian.Enquire(db)
enquire.set_query(query)

Now, instead of showing the results of the query, we ask Xapian what are the tags in the index that are most relevant to this search.

First, we pick some representative packages for the expand:

# Use an adaptive cutoff to avoid to pick bad results as references
matches = enquire.get_mset(0, 1)
topWeight = matches[0].weight
enquire.set_cutoff(0, topWeight * 0.7)

# Select the first 10 documents as the key ones to use to compute relevant
# terms
rset = xapian.RSet()
for m in enquire.get_mset(0, 10):
    rset.add_document(m[xapian.MSET_DID])

Then we define the filter that only keeps tags:

# Filter out all the keywords that are not tags
class Filter(xapian.ExpandDecider):
    def __call__(self, term):
        "Return true if we want the term, else false"
        return term[:2] == "XT"

Then we print the tags:

# This is the "Expansion set" for the search: the 10 most relevant terms that
# match the filter
eset = enquire.get_eset(10, rset, Filter())

# Print out the results
for res in eset:
    print "%.2f %s" % (res.weight, res.term[2:])

That's it. We turned a package search into a tag search, and this allows us to search for tags using keywords that are not present in the tag descriptions at all:

$ ./axi-query-tags.py explore the dungeons
27.50 game::rpg:rogue
26.14 use::gameplaying
17.53 game::rpg
10.27 uitoolkit::ncurses
...

$ ./axi-query-tags.py total world domination
7.55 use::gameplaying
5.68 x11::application
5.35 interface::x11
5.05 game::strategy
...

You can use the wsvn interface to get to the full source code and the module it uses.

You can see a similar technique working in the Debtags tag editor: enter a package, then choose "Available tags: search".

Next in the series: search as you type.

Posted Tue 06 Nov 2007 23:02:46 CET Tags: debtags
apt-xapian-index: adaptive quality cutoff

apt-xapian-index: adaptive quality cutoff

I've recently posted:

Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.

Today I'll show how to implement an adaptive cutoff to get rid of the worse results.

Recall that Xapian shows results by decreasing order of quality, and as we pull more an more results out, we reach a point where the matches are so approximated that they look random.

This can be a problem if we want to change the order of the result, for example we may want to sort by package size, or by popcon popularity. There are many scenarios in which a really bad match could end up at the top of the results.

For most cases, you just want to say "discard all results whose quality is less than 70%". But sometimes you have queries that OR lots of terms, and even your top result, while still being a very good result, may be below the cutoff you decided.

Implementing an adaptive cutoff is extremely simple: first, you get the quality estimate of the top result:

# Retrieve the first result, and check its relevance
matches = enquire.get_mset(0, 1)
topWeight = matches[0].weight

Then you tell Xapian that you want a cutoff value that is, for example, 70% of that:

# Tell Xapian that we only want results that are at least 70% as good as that
enquire.set_cutoff(0, topWeight * 0.7)

Finally, you repeat the query. If you want, you can go for bigger result sets, as the cutoff will make it so that if you have lots of results, they will very likely be all good results:

matches = enquire.get_mset(0, 200)

This is it.

You can use the wsvn interface to get to the full source code and the module it uses.

Next in the series: smart way of querying tags.

Posted Sat 27 Oct 2007 21:51:43 CEST Tags: debtags
apt-xapian-index: performing a simple query

apt-xapian-index: performing a simple query

I've recently posted an introduction of apt-xapian-index.

Today I'll show how to make simple queries to apt-xapian-index. If you feel like reimplementing my examples in another language, let me know and I'll include it to the post.

What I'm going to build is a replacement for apt-cache search that:

This is just a beginning: in future blog posts I'll show how to enhance a search with interesting advanced features.

First thing, we need to import the Xapian module, found in the package python-xapian. Documentation on the Python Xapian API can be found in /usr/share/doc/python-xapian.

import xapian

Then we open the apt-xapian-index database:

# Instantiate a xapian.Database object for read only access to the index
db = xapian.Database("/var/lib/apt-xapian-index/index")

Now we build a query from the command line arguments. We'll assume that if an argument is in the form foo::bar, then it's a Debtags tag instead of a normal keyword.

For normal keywords, we also search for the stemmed version, so that we can, for example, find "editing" when the user searches for "edit".

# Stemmer function to generate stemmed search keywords
stemmer = xapian.Stem("english")

# Build the terms that will go in the query
terms = []
for word in args:
    if word.islower() and word.find("::") != -1:
        # According to /var/lib/apt-xapian-index/README, Debtags tags are
        # indexed with the 'XT' prefix.
        terms.append("XT"+word)
    else:
        # If it is not a Debtags tag, then we consider it a normal keyword.
        word = word.lower()  # The index stores keyword all in lowercase
        terms.append(word)
        # If the word has a stemmed version, add it to the query.
        # /var/lib/apt-xapian-index/README tells us that stemmed terms have a
        # 'Z' prefix.
        stem = stemmer(word)
        if stem != word:
            terms.append("Z"+stem)

Now we have the terms for the query, and we can create a query that ORs them together.

One may ask, why OR and not AND? The reason is that, contrarily to apt-cache, Xapian scores results according to how well they matched.

Matches that match all the terms will score higher than the others, so if we build an OR query what we really have is an AND query that gracefully degenerates to closer matches when they run out of perfect results.

This allows stemmed searches to work nicely: if you look for 'editing', then the query will be 'editing OR Zedit'. Packages with the word 'editing' will match both and score higher, and packages with the word 'edited' will still match 'Zedit' and get included in the results.

# OR the terms together into a Xapian query.
#
query = xapian.Query(xapian.Query.OP_OR, terms)

We then run the query. Queries are run through a xapian.Enquire object:

# Perform the query
enquire = xapian.Enquire(db)
enquire.set_query(query)

The Enquire object returns results as an mset. An mset represents a view of the result set, and can be iterated to access the resulting documents. Here we iterate the mset and output the result of the query, looking up short descriptions with apt:

# Display the top 20 results, sorted by how well they match
matches = enquire.get_mset(0, 20)
print "%i results found." % matches.get_matches_estimated()
print "Results 1-%i:" % matches.size()
for m in matches:
    # /var/lib/apt-xapian-index/README tells us that the Xapian document data
    # is the package name.
    name = m[xapian.MSET_DOCUMENT].get_data()

    # Get the package record out of the Apt cache, so we can retrieve the short
    # description
    pkg = cache[name]

    # Print the match, together with the short description
    print "%i%% %s - %s" % (m[xapian.MSET_PERCENT], name, pkg.summary)

This is it.

You can use the wsvn interface to get to the full source code.

You can run the code passing keywords and Debtags tags. Try running it as:

    ./axi-query-simple.py role::program image edit
    ./axi-query-simple.py role::program game::arcade
    ./axi-query-simple.py kernel image

You can search Debtags tags using debtags tagsearch. In a later blog post, I'll show how to use apt-xapian-index to implement a better-than-you-would-ever-have-thought-possible tag search.

Next in the series: adding simple result filters to the query.

Posted Fri 26 Oct 2007 16:43:21 CEST Tags: debtags
apt-xapian-index: searching for similar packages

apt-xapian-index: searching for similar packages

I've recently posted:

Today I'll show how to abuse Xapian to show a list of packages similar to a given one.

This time I'll try just linking to the code in wsvn and showing in the blog only show the most important bits.

So, we have a package name, and we want to show what are the packages similar to that one.

To do it, we simply build a big OR query with all the terms indexed for that package: Xapian will show us the packages whose terms are most similar, and that does the trick.

This works because Xapian gives us the best results first, therefore even if no package except the given one will give an exact match, we still get the nearest matches first.

In order to get the list of indexed terms given a package name we need to do two things:

  1. Get the Xapian document for the package.
  2. Get the termlist of the document.

To get the Xapian document we search for a term that only that document can have. In the index, the package name is indexed with the special prefix "XP", so we can search for that:

def docForPackage(pkgname):
    "Get the document corresponding to the package with the given name"
    # Query the term with the package name
    query = xapian.Query("XP"+pkgname)
    enquire = xapian.Enquire(db)
    enquire.set_query(query)
    # Get the top result only
    matches = enquire.get_mset(0, 1)
    if matches.size() == 0:
        return None
    else:
        m = matches[0]
        return m[xapian.MSET_DOCUMENT]

Then we build the big term list, by iterating the termlist of the document:

# Build a term list with all the terms in the given packages
terms = []
# Get the document corresponding to the package name
doc = docForPackage(pkgname)
if not doc: continue
# Retrieve all the terms in the document
for t in doc.termlist():
    if len(t.term) < 2 or t.term[:2] != 'XP':
        terms.append(t.term)

Note that it's trivial to fetch terms from more than one document, if you want to query "all packages a bit like this one and a bit like that one", although that's less of a useful feature.

Lastly, we build the final query:

# Build the big OR query
query = xapian.Query(xapian.Query.OP_AND_NOT,
            # Terms we want
            xapian.Query(xapian.Query.OP_OR, terms),
            # AND NOT the input packages
            xapian.Query("XP"+pkgname))

I add an AND_NOT part with the input package name so that we don't get in the output the package that we asked for.

This is it:

$ ./axi-query-similar.py debtags
20309 results found.
Results 1-20:
33% debtags-edit - GUI browser and editor for Debian Package Tags
27% tagcolledit - GUI editor for tagged collections
25% libtagcoll2-dev - Functions used to manipulate tagged collections (development version)
24% tagcoll - Commandline tool to perform operations on tagged collections
19% packagesearch - GUI for searching packages and viewing package information
18% doodle - Desktop Search Engine (client)
18% doodled - Desktop Search Engine (daemon)
18% libept0 - High-level library for managing Debian package information
18% upgrade-system - system upgrader from Konflux
18% libept-dev - High-level library for managing Debian package information
17% ept-cache - Commandline tool to search the package archive
16% tracker-utils - metadata database, indexer and search tool - commandline tools
[...]

You can use the wsvn interface to get to the full source code and the module it uses.

Next in the series: adaptive quality cutoff.

Posted Fri 26 Oct 2007 16:40:33 CEST Tags: debtags
Introducing apt-xapian-index

Introducing apt-xapian-index

apt-xapian-index has just been approved into experimental, and in the next days I'm going to blog more about it.

The package contains a tool called update-apt-xapian-index that indexes Debian package metadata into a Xapian index located at /var/lib/apt-xapian-index/index.

The index is read-only, except for update-apt-xapian-index; however, it is world-readable: every user can query it, all the time, at the same time, even during index updates.

The index can contain more than package descriptions. update-apt-xapian-index indexes data using plugins located in /usr/share/apt-xapian-index/plugins, and any package can add their own. For example, debtags will provide a plugin to index tags.

Since Xapian can index numeric values as well, if anyone makes a popcon package that downloads popcon information, they can provide a plugin to index popcon values. If anyone makes an iterating package that downloads ratings, they can provide a plugin to index ratings.

Another plugin could be a specialised Debian stemmer that generates token such as foo out of libfoo-dev.

I think you get the idea: it's very extensible. You can have a look at the initial set of plugins in subversion.

The index is also self-documenting, so that one can keep track of all the intresting things that can be found in it. update-apt-xapian-index does not only maintain the index, but also the file /var/lib/apt-xapian-index/README that aggregates documentation provided by the plugins.

To query the index, you just use Xapian. Debian contains Xapian bindings for various languages:

In the next days I am going to post various example queries and interesting tricks that the index allows you to do.

It's going to be fun.

Next in the series: performing a simple query.

Posted Fri 26 Oct 2007 15:52:32 CEST Tags: debtags
apt-xapian-index: suggesting new terms for improving the query

apt-xapian-index: suggesting new terms for improving the query

I've recently posted an introduction of apt-xapian-index, an example of how to query it and a way to add simple result filters to the query.

Today I'll show how to use a very interesting feature of Xapian: computing a list of the best terms to use to improve the query. I'll expand apt-query-pkgtype.py to write, after the results, the suggested terms.

If you feel like reimplementing my examples in another language, let me know and I'll include it to the post.

Let's start with a few definitions taken from the Xapian glossary:

RSet (Relevance Set)

The Relevance Set (RSet) is the set of documents which have been marked by the user as relevant. They can be used to suggest terms that the user may want to add to the query (these terms form an ESet), and also to adjust term weights to reorder query results.

ESet (Expand Set)

The Expand Set (ESet) is a ranked list of terms that could be used to expand the original query. These terms are those which are statistically good differentiators between relevant and non-relevant documents.

What I'm doing now is: after printing the normal result, I build an rset with all the results. For better results one can ask the user to pick the results they like (make a GUI with that: it should rock!), but starting with the top results in the RSet is a good enough default:

# Now, we ask Xapian what are the terms in the index that are most relevant to
# this search.  This can be used to suggest to the user the most useful ways of
# refining the search.

# Select the first 10 documents as the key ones to use to compute relevant
# terms (matches is the mset returned by a normal query)
rset = xapian.RSet()
for m in matches:
    rset.add_document(m[xapian.MSET_DID])

Then I use the RSet to compute the ESet, and display the results to the user:

# This is the "Expansion set" for the search: the 10 most relevant terms
eset = enquire.get_eset(10, rset)

# Print it out.  Note that some terms have a prefix from the database: can we
# filter them out?  Indeed: Xapian allow to give a filter to get_eset.
# Read on...
print
print "Terms that could improve the search:",
print ", ".join(["%s (%.2f%%)" % (res.term, res.weight) for res in eset])

You can also abuse this feature to show what are the tags that are most related to the search results. This allows you to turn a search based on keywords to a search based on semantic attributes, which would be an absolutely stunning feature in a GUI.

We can do it thanks to Xapian allowing to specify a filter for the output of get_eset. This filter filters out all the keywords that are not tags, or that were in the list of query terms:

class Filter(xapian.ExpandDecider):
    def __call__(self, term):
        "Return true if we want the term, else false"
        return term[:2] == "XT"

I just rerun get_eset like above, but adding the filter, to show only a suggestion of tags:

# This is the "Expansion set" for the search: the 10 most relevant terms that
# match the filter
eset = enquire.get_eset(10, rset, Filter())

# Print out the resulting tags
print
print "Tags that could improve the search:",
print ", ".join(["%s (%.2f%%)" % (res.term[2:], res.weight) for res in eset])

sys.exit(0)

This is it.

You can use the wsvn interface to get to the full source code and the module it uses. Try running it as:

axi-query-expand.py --type=gui image editor
axi-query-expand.py --type=cmdline image editor
axi-query-expand.py --type=game image editor

you will see that as the results change, the suggestions change as well.

I'm happy to see that Xapian often suggests tags: this means that tags are useful discriminators in search results, and that all the work done on Debtags has been a good work.

Notice that the suggestions change as you change any of the terms, the custom filter, and the tags you put in the search. Try to imagine how such a thing could improve Synaptic. What if I told you that Xapian is fast enough that all of this can happen live while you type?

Next in the series: searching for similar packages.

Posted Fri 26 Oct 2007 15:50:18 CEST Tags: debtags
apt-xapian-index: adding simple result filters to the query

apt-xapian-index: adding simple result filters to the query

I've recently posted an introduction of apt-xapian-index and an example of how to query it.

Today I'll show how to add simple Debtags-based result filters to a query. I'll expand axi-query-simple.py with a --type comman line option that allows to filter for package type. As a starter I'll implement filtering for "game", "gui", "cmdline" or "editor".

If you feel like reimplementing my examples in another language, let me know and I'll include it to the post.

The filters are implemented using Xapian queries. Xapian allows to build queries as well as combine existing queries using a large amount of operators.

This allows us to create a little database of ready made queries:

# This is our little database of simple Debtags filters we provide: the name
# entered by the user in "--type" maps to a piece of Xapian query
filterdb = dict(
    # We can do simple AND queries...
    game = xapian.Query(xapian.Query.OP_AND, ('XTuse::gameplaying', 'XTrole::program')),
    # Or we can do complicate binary expressions...
    gui = xapian.Query(xapian.Query.OP_AND, xapian.Query('XTrole::program'),
        xapian.Query(xapian.Query.OP_OR, 'XTinterface::x11', 'XTinterface::3d')),
    cmdline = xapian.Query(xapian.Query.OP_AND, 'XTrole::program', 'XTinterface::commandline'),
    editor = xapian.Query(xapian.Query.OP_AND, 'XTrole::program', 'XTuse::editing'))
    # Feel free to invent more

It's now trivial to lookup a query and AND it to the main query:

# Build the query by ORing together various terms
query = xapian.Query(xapian.Query.OP_OR, termlist)

# If a filter was requested, AND it with the query
if extrafilter:
    query = xapian.Query(xapian.Query.OP_AND, query, filterdb[extrafilter])

That's it. Simple, trivial feature to add to a search, but very useful and simple to use.

You can use the wsvn interface to get to the full source code and the module it uses. Try running it as:

axi-query-pkgtype.py --type=gui image editor
axi-query-pkgtype.py --type=cmdline image editor
axi-query-pkgtype.py --type=game image editor

you will see that the results change radically.

What are the games that match "image editor", you would ask?

74% xjig - An X11 jigsaw puzzle
69% wmpuzzle - WindowMaker dock app 4x4 puzzle
66% xflip - programs to mirror-image or melt your display

Jigsaws, of course :)

The code is now colourised thanks to Simon Ward who pointed me at the syntax plugin for ikiwiki.

Next in the series: suggesting new terms for improving the query.

Posted Fri 26 Oct 2007 15:50:18 CEST Tags: debtags
Filtering Debian Packages file

Filtering Debian Packages file

pkgprune - Filter a Packages file

Usage: pkgprune Packages > Packages.filtered

Filters a Debian Packages file by removing either a list of packages (and
their reverse dependencies) or the packages that have given tags.

Dependencies in the filtered Packages file will still be satisfied.

Available via:

wget http://people.debian.org/~enrico/2007-08/pkgprune/pkgprune

or:

hg clone static-http://people.debian.org/~enrico/2007-08/pkgprune

The top of the script has a configuration section:

#
# Configuration
#

# Set of packages you don't want to have (the reverse dependencies will be
# removed as far as it's possible)
blacklist_pkgs = set(['gconf2'])

# List of sets of tags; the packages whose tag sets contain one of these tag
# sets will be removed
blacklist_tags = [ set(['implemented-in::php']) ]

Edit that before running the script. The idea is to be able to configure a system to run such a script after apt-get update, both to make dpkg faster on slow arches and to reduce the amount of packages displayed by an interface.

Posted Mon 27 Aug 2007 12:17:04 CEST Tags: debtags
Debtags prod pages

Debtags prod pages

Are you a Debian maintainer?

Have a look at the Debtags prod page: it lists debtags statistics by maintainer, and you can sort it:

Look at yourself in the list: are you doing well? You want to improve? Click on your name and you'll end up in your own "Go tagging!" page, where you can tag all your packages and get suggestions by our various happy AI engines.

If you're not in the list it means that all your packages are somehow tagged.

If you still want to review the tags of your packages, hit the "Debtags" link in your DDPO page.

The prod pages are updated every 6 hours. The "Go tagging!" page is updated in realtime.

Posted Thu 05 Jul 2007 00:06:54 CEST Tags: debtags