Pages about Debtags.

fuss-launcher: an application launcher built on apt-xapian-index

Long ago I blogged about using apt-xapian-index to write an application launcher.

Now I just added a couple of new apt-xapian-index plugins that look like they have been made just for that.

In fact, they have indeed been made just for that.

After my blog post in 2008, people from Truelite and the FUSS project took up the challenge and wrote a launcher applet around my example engine.

The prototype has been quite successful in FUSS, and as a consequence I've been asked (and paid) to bring in some improvements.

The result, that I have just uploaded to NEW, is a package called fuss-launcher:

* New upstream release
   - Use newer apt-xapian-index: removed need of local index
   - Dragging a file in the launcher shows the applications that can open it
   - Remembers the applications launched more frequently
   - Allow to set a list of favourite applications

To get it:

  • apt-get install fuss-launcher (after it passed NEW);
  • or git clone http://git.fuss.bz.it/git/launcher.git/ and apt-get install python-gtk2 python-xapian python-xdg apt-xapian-index app-install-data

It requires apt-xapian-index >= 0.35.

To try it:

  1. Make sure your index is up to date, especially if you just installed app-install-data: just run update-apt-xapian-index as root.
  2. Run fuss-launcher.
  3. Click on the new tray icon to open the launcher dialog.
  4. Type some keywords and see the list of matching applications come to life as you type.

It's worth mentioning again that all this work was sponsored by Truelite and the Fuss project, which rocks.

Some screenshots:

When you open the launcher, by default it shows the most frequently started applicationss and the favourite applications:

launcher just opened

When you type some keywords, you get results as you type, and context-sensitive completion:

keyword search

When you drag a file on the launcher you only see the applications that can open that file:

drag files to the launcher

Posted Mon May 17 10:41:09 2010 Tags: debtags

New apt-xapian-index plugins

Besides a fair bit of refactoring and cleanup, I've recently added two new plugins to apt-xapian-index:

app-install

If app-install-data is installed, information about .desktop files will now enter the index.

This allows, for example, to limit query results to only those packages that contain .desktop files, which is quite useful, for example for building desktop-oriented package managers.

aliases

It reads term->aliases mapping from files in /etc/apt-xapian-index/aliases/ or /usr/share/apt-xapian-index/aliases/, and feeds them as synonyms in the index.

apt-xapian-index ships an example alias file, to give people who know the wrong software names a chance to find the right ones:

# Aliases expanding names of popular applications

excel       XToffice::spreadsheet
powerpoint  XToffice::presentation
photoshop   XTworks-with::image:raster
coreldraw   XTworks-with::image:vector
autocad     XTworks-with::3dmodel

Notice how it is possible to use index terms that happen to be Debtags tags as synonyms, which yields better results, language independence and extra coolness.

Posted Mon May 17 00:17:55 2010 Tags: debtags

apt-xapian-index now comes with a query tool

I've just uploaded a new version of apt-xapian-index to unstable. Now it comes with a little query tool called axi-cache.

You can search this way:

axi-cache search foo bar baz facet::tag sec:section

In fact, you can use most of the things described here.

You can then say axi-cache more to get more results, or axi-cache again to retry a search, or axi-cache again wibble wabble to add keywords to the last search.

This allows to start with a search and tweak it. In order to work it needs to save the last search so again or more can amend it. Searches are saved in ~/.cache/axi-cache.state.

You can search tags instead of packages by adding --tags.

It will suggest extra terms for the search, and also suggest extra tags. It can even correct spelling mistakes in the query terms once the index has been rebuilt with this new version of update-apt-xapian-index.

I need to thank Carl Worth who, with notmuch, reminded me that if I just build a nice interface on top of Xapian's query parser I go quite a long way towards making a Xapian database extremely useful indeed.

axi-cache also integrates with bash-completion so that tab completion is context-sensitive to the command line being typed:

$ axi-cache search image pro
probability      process          processors       programmability  provides         
problem          processing       production       pronounced       proving  
$ axi-cache search kernel pro
problems     processor    production   proved       provided     
processing   processors   programming  provide      provides     

Thanks to David Paleino who wrote the bash completion script.

Just for reference, this is the command line help:

$ axi-cache help
Usage: axi-cache [options] command [args]

Query the Apt Xapian index.

Commands:
  axi-cache help            show a summary of commands
  axi-cache search [terms]  start a new search
  axi-cache again [query]   repeat the last search, possibly adding query terms
  axi-cache more [count]    show more terms from the last search

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s SORT, --sort=SORT  sort by the given value, as listed in /var/lib/apt-
            xapian-index/values
  --tags                show matching tags, rather than packages
  --tabcomplete=TYPE    suggest words for tab completion of the current
            command line (type is 'plain' or 'partial')

If you install the package for the first time, you may need to rebuild the index by running update-apt-xapian-index as root before using axi-cache.

Posted Sat Apr 10 00:23:18 2010 Tags: debtags

Introducing apt-xapian-index

apt-xapian-index has just been approved into experimental, and in the next days I'm going to blog more about it.

The package contains a tool called update-apt-xapian-index that indexes Debian package metadata into a Xapian index located at /var/lib/apt-xapian-index/index.

The index is read-only, except for update-apt-xapian-index; however, it is world-readable: every user can query it, all the time, at the same time, even during index updates.

The index can contain more than package descriptions. update-apt-xapian-index indexes data using plugins located in /usr/share/apt-xapian-index/plugins, and any package can add their own. For example, debtags will provide a plugin to index tags.

Since Xapian can index numeric values as well, if anyone makes a popcon package that downloads popcon information, they can provide a plugin to index popcon values. If anyone makes an iterating package that downloads ratings, they can provide a plugin to index ratings.

Another plugin could be a specialised Debian stemmer that generates token such as foo out of lib*foo*-dev.

I think you get the idea: it's very extensible. You can have a look at the initial set of plugins in subversion.

The index is also self-documenting, so that one can keep track of all the intresting things that can be found in it. update-apt-xapian-index does not only maintain the index, but also the file /var/lib/apt-xapian-index/README that aggregates documentation provided by the plugins.

To query the index, you just use Xapian. Debian contains Xapian bindings for various languages:

In the next days I am going to post various example queries and interesting tricks that the index allows you to do.

It's going to be fun.

Next in the series: performing a simple query.

Posted Sat Jun 6 00:57:39 2009 Tags: debtags

Debtags interesting times

An unbelievable amount of interesting, fun and bleeding-edge things to do are coming out with Debtags.

Erich did some javascript tag cloud.

Interactive tag clouds are probably the coolest thing to provide as a link for packages.debian.org when people click on a tag. They can also replace the current navigator we have at http://debtags.alioth.debian.org/cgi-bin/index.cgi

I started using Xapian, which is cool. It does fast text search with all the cool things like stemming and whatnot. Check out a prototype package search interface using it.

Xapian can also find packages similar to a given one by looking at the description, which is cool. I can already use it to suggest tags, until we manage to put some of the bayesian code we are accumulating into production.

Plus, I had an IRC chat with a Xapian developer: he was really nice. We had one of those neat chats in which really cool ideas keep coming out and it won't be so hard to implement even and wow, that'd be so cool, let's do it!

Then there's the new web-based tag editing interface to do to replace the old one. I can add the smart search idea to it allowing to display tags giving keywords: and that'd give suggestions that automatically follow existing tag practices, which rocks.

The C++ daemon backend that we now have for the website is great to allow to quickly intermix tag updates and queries, and still hope to scale. Maybe it means that soon we finally can start handling debtags mail submissions on Alioth as well.

Distro-wise I have a plan to finally get rid of apt-index-watcher. The plan is already working in my computer, and just needs some more tests before hitting unstable.

I'd also like to have the Python Debtags interface packaged somewhere, so that people who don't grok C++ can have fun with the Debtags data. Work is underway.

Then there's the debtags-updatecontrol experiment, which didn't go quite as I expected, but still went well. I've already sent my first tag override update to aj incorporating data from control files.

Other ideas? For example, implementing Xapian indexing in libept, and finally have a package manager with an amazingly fast search-as-you-type.

And that, as a side effect, would create a Xapian package description in /var/something that is ready for any installed software to use.

Then there's popcon. That's another piece of data to make available to package managers, and I already know how to do it.

With Phil Hands at Debconf6 we had also started designing a way to use the unaggregated popcon database to implement an Amazon-style functionality that would go like "users which have a system similar to yours also have packages X and Y installed: would you like to have a look at them?"

Joey Hess suggested to filter popcon analysis data with debtags, so that one can get suggestions, for example, only limited to game packages. Oh so cool!

And this all feels like just a beginning...

Posted Sat Jun 6 00:57:39 2009 Tags: debtags

apt-xapian-index: search as you type

I've recently posted:

Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.

Today I'll show how to implement a very attractive feature for a user interface: search as you type. The idea is that you don't need to press enter to fire up a query: instead, the results materialise in front of your eyes as you type them.

The example I created uses curses, but the idea is good on any interactive user interface.

The main thing to keep in mind with search as you type is that the last word is likely to be partially typed, unless maybe some timeout expired since the user's last keystroke.

Xapian comes into help here, as it allows us to expand the partially typed word into an OR query with all the terms that start with it. This means that if we are typing, for example, "progr", we can turn the query into "program OR programmer OR programming OR programmed [...and so on...]".

I won't show the UI code, except a simple input loop that triggers the query at every keystroke:

    def mainloop(self):
        while True:
            c = self.win.getch()
            self.line += chr(c)
            self.results.update(self.line)

The interesting part is in the update function.

First we split the line in words and convert the words into a query:

        # Split the line in words
        args = self.splitline.split(line)
        # Convert the words into terms for the query
        terms = termsForSimpleQuery(args)

Then we expand the last word with all possible completions:

        # Since the last word can be partially typed, we add all words that
        # begin with the last one.
        terms.extend([x.term for x in db.allterms(args[-1])])

Now we can build the query. Of course you can add all other sorts of things to the query, for example a boolean expression of tag filter like in axi-query-pkgtype.py; Xapian will cope.

        # Build the query
        query = xapian.Query(xapian.Query.OP_OR, terms)

Finally the query. For bonus points you can do the adaptive cutoff trick to discard bad results.

In my case, since I don't implement scrolling of results, I also limit them to what fits in the window:

        # Retrieve as many results as we can show
        mset = enquire.get_mset(0, self.size - 1)

Finally, draw the results on screen:

        # Redraw the window
        self.win.clear()

        # Header
        self.win.addstr(0, 0, "%i results found." % mset.get_matches_estimated(), curses.A_BOLD)

        # Results
        for y, m in enumerate(mset):
            # /var/lib/apt-xapian-index/README tells us that the Xapian document data
            # is the package name.
            name = m[xapian.MSET_DOCUMENT].get_data()

            # Get the package record out of the Apt cache, so we can retrieve the short
            # description
            pkg = cache[name]

            # Print the match, together with the short description
            self.win.addstr(y+1, 0, "%i%% %s - %s" % (m[xapian.MSET_PERCENT], name, pkg.summary))

        self.win.refresh()

That's it, try it out.

You can use the wsvn interface to get to the full source code and the module it uses.

You can see a similar technique working in goplay, where it is also integrated with an interactive tag filter.

Next in the series: dynamic tag cloud.

Posted Sat Jun 6 00:57:39 2009 Tags: debtags

apt-xapian-index: dynamically generated tag clouds

About apt-xapian-index, I have already posted:

Today I'll show how to create tag clouds. Not only that, but I'll show how to implement tag clouds that change as the user types a query.

This example uses python-gtk, and has been created together with Matteo Zandi.

axi-searchcloud screenshot

Generating a tag cloud out of any Xapian query is simple, and it is just a matter of presenting into a tag cloud the information that you get with the technique shown in a smart way of querying tags: you get the tags related to the query, and you lay out their names with a font size proportional to their Xapian rank.

For the presentation, we can load pretty names from the Debtags vocabulary in /var/lib/debtags/vocabulary:

from debian_bundle import deb822

# Facet name -> Short description
facets = dict()
# Tag name -> Short description
tags = dict()
for p in deb822.Deb822.iter_paragraphs(open("/var/lib/debtags/vocabulary", "r")):
    if "Description" not in p: continue
    desc = p["Description"].split("\n", 1)[0]
    if "Tag" in p:
        tags[p["Tag"]] = desc
    elif "Facet" in p:
        facets[p["Facet"]] = desc

The query then goes on as usual, and when we get the tags from the eset we also record their score and normalise it between 0 and 1. I found that computing the logarithm of scores helps to avoid having a tag cloud with a few huge tags and a lot of tiny tiny tags:

class Filter(xapian.ExpandDecider):
    def __call__(self, term):
        return term[:2] == "XT"

def format(k):
    if k in tags:
        facet = k.split("::", 1)[0]
        if facet in facets:
            return "<i>%s: %s</i>" % (facets[facet], tags[k])
        else:
            return "<i>%s</i>" % tags[k]
    else:
        return k

taglist = []
maxscore = None
for res in enquire.get_eset(15, rset, Filter()):
    # Normalise the score in the interval [0, 1]
    weight = math.log(res.weight)
    if maxscore == None: maxscore = weight
    tag = res.term[2:]
    taglist.append(
        (tag, format(tag), float(weight) / maxscore)
    )
taglist.sort(key=lambda x:x[0])

Finally, you mark up a gtkhtml2.Document to display in a gtkhtml2 widget:

def mark_text_up(result_list):
    # 0-100 score, key (facet::tag), description
    document = gtkhtml2.Document()
    document.clear()
    document.open_stream("text/html")
    document.write_stream("""<html><head>
<style type="text/css">
a { text-decoration: none; color: black; }
</style>
</head><body>""")
    for tag, desc, score in result_list:
        document.write_stream('<a href="%s" style="font-size: %d%%">%s</a> ' % (tag, score*150, desc))
    document.write_stream("</body></html>")
    document.close_stream()
    return document

That's it, try it out.

You can use the git web interface to get to the full source code and the module it uses.

Posted Sat Jun 6 00:57:39 2009 Tags: debtags

Evaluating programming languages for playing with Debtags

Since having workable bindings for the C++ Debtags libraries seems to be still a bit in the future, I'm planning to build a bit of native infrastructure in some higher level language. First step is seeing what language I could start playing with.

The problem

At the most basic level, in Debtags we have a number of packages, each of which have a set of tags.

The way I usually save tags is a file with the format:

package1, package2: tag1, tag2, tag3
package3: tag1, tag2

That is, every line has a list of packages with the same tags, and the list of their tags.

Since any script I'm going to write has to at least be able to parse the data into something like a package -> tags hash, then print it out.

Let's see how perl, python and ruby perform.

Tests

C++

The reference point for the experiment will be the C++ implementation, tagcoll:

$ time tagcoll copy package-tags > /dev/null
real    0m0.421s
user    0m0.412s
sys     0m0.000s

Perl

First attempt is with Perl, creating the script that parses into a hash of package => set of tags and prints the result.

There are set modules for Perl on CPAN, but I have none handy at the moment. However, since they are implemented using hashes, I can approximate them by using a hash.

Note that I also want to have a different copy of the tag set for every package, so that I can manipulate them in the future without unwanted side effects.

Here is the code:

#!/usr/bin/perl -w

use strict;

my %db;

# Read the tag database
while (<>)
{
    chop();
    my ($pkgs, $tags) = split(': ');
    # Create the tagset using keys of a hash
    my %tags = map { $_ => undef } split(', ', $tags);
    for my $p (split(', ', $pkgs))
    {
        # Make a copy of the tagset
        $db{$p} = {%tags};
    }
}

# Write the tag database
while (my ($pkg, $tags) = each %db)
{
    print $pkg, join(', ', keys %$tags), "\n";
}

Here is the running time:

$ time ./parse.pl package-tags > /dev/null
real    0m0.448s
user    0m0.436s
sys     0m0.008s

Not so bad, comparable with tagcoll.

Python

Then comes Python. I'm not much of a Python fancier, but I'm rather attracted by the new set native type introduced with Python 2.4, which seems to have most of what I need nice and done.

Here is the script:

#!/usr/bin/python

import sys

input = sys.stdin
if len(sys.argv) > 1:
    input = open(sys.argv[1],"r")

# Read the tag database
db = {}
for line in input:
    # Is there a way to remove the last character of a line that does not
    # make a copy of the entire line?
    line = line.rstrip("\n")
    pkgs, tags = line.split(": ")
    # Create the tag set using the native set
    tags = set(tags.split(", "))
    for p in pkgs.split(", "):
        db[p] = tags.copy()

# Write the tag database
for pkg, tags in db.items():
    # Using % here seems awkward to me, but if I use calls to
    # sys.stdout.write it becomes a bit slower
    print "%s:" % (pkg), ", ".join(tags)

Here is the running time:

$ time ./parse.py  package-tags  > /dev/null
real    0m0.418s
user    0m0.376s
sys     0m0.036s

I'm pleased, very pleased. Using the native set seems to be not only handy, but efficient.

Ruby

Finally, Ruby. I like to use Ruby. In this case, however, it lacks a native set implementation, although it has a set module which is implemented using a hash.

Here is the script:

#!/usr/bin/ruby

require 'set'

infile = ARGV[0] ? File.new(ARGV[0]) : $stdin

# Read the tag database
db = {}
infile.each_line do |line|
    line.chop()
    pkgs, tags = line.split(": ")
    # Create the set using the Set module
    tags = Set.new(tags.split(", "))
    pkgs.split(", ").each do |p|
        # Is this a copy or a reference?  I need to find out.
        db[p] = tags
    end
end

# Write the tag database
db.each do |key, tags|
    # Ouch, Set does not do join by itself
    print key, ": ", tags.to_a.join(", ")
end

Here is the running time:

$ time ./parse.rb package-tags > /dev/null
real    0m1.637s
user    0m1.572s
sys     0m0.052s

I hope I got something wrong in the script, but I can't see what.

Results

As much as I don't fancy Python, it looks like it's currently the best choice for playing around with Debtags. I hope the native sets will bring me joy.

If in the future I'll be asked "how come you chose Python for this Debtags thing?", I can point to this page.

Posted Sat Jun 6 00:57:39 2009 Tags: debtags

libept 0.5.3 hit unstable

I prepared a new toy to play with at Debconf and uploaded it to unstable:

Package: libept-dev
Description: High-level library for managing Debian package information
 The library defines a very minimal framework in which many sources of data
 about Debian packages can be implemented and queried together.
 .
 The library includes four data sources:
 .
  * APT: access the APT database
  * Debtags: access the Debtags tag information
  * Popcon: access Popcon package scores
  * TextSearch: fast Xapian-based full text search on package description
 .
 This is the development library.

Package: ept-cache
Description: Commandline tool to search the package archive
 ept-cache is a simple commandline interface to the functions of libept.
 .
 It can currently search and display data from four sources:
 .
  * The APT database
  * The Debtags tag information
  * Popcon package scores
  * A fast Xapian-based full text index on package descriptions

Yes, this finally brings lots of very cool data sources about packages together.

Try this one:

# Check if all data providers are active and give instructions on how
# to activate those that aren't
ept-cache info

# Follow the instructions to activate everything

# Show all GUI image editors, sorted by popularity, in reverse order
ept-cache search image editor -t gui -s p-

If you have the Xapian data provider enabled, the results of a search are given in relevance order, the most relevant first. And also, searches are done with proper stemming, so if you look for image editor it will also find image editing, although it would score image editor higher.

It's also quite lovely to work with it in C++. I'll improvise here a few examples:

Print name and short description of every package

#include <ept/apt/apt.h>
#include <ept/apt/packagerecord.h>

using namsepace std;
using namespace ept::apt;

void playWithApt()
{
    // Apt data source
    Apt apt;

    // Parser of package records
    PackageRecord rec;

    // Iterate all package records
    for (Apt::record_iterator i = apt.recordBegin();
        i != apt.recordEnd(); ++i) 
    {
        rec.scan(*i);
        cout << rec.pakcage() << " - " << rec.shortDescription() << endl;
    }
}

Show all image editors

#include <ept/debtags/debtags.h>
#include <set>

using namespace ept::debtags;

void playWithDebtags()
{
    // Apt data source
    Apt apt;
    // Parser of package records
    PackageRecord rec;
    // Debtags data source
    Debtags debtags;

    if (!debtags.hasData())
        return;

    set<Tag> tags;
    tags.insert(debtags.vocabulary().tagByName("works-with::image:raster"));
    tags.insert(debtags.vocabulary().tagByName("use::editing"));
    tags.insert(debtags.vocabulary().tagByName("role::program"));
    set<string> results = debtags.getItemsHavingTags(tags);
    for (set<string>::const_iterator i = results.begin();
        i != results.end(); ++i)
    {
        rec.scan(apt.rawRecord(*i));
        cout << rec.pakcage() << " - " << rec.shortDescription() << endl;
    }
}

Print all package names, sorted by popularity

#include <ept/popcon/popcon.h>
#include <algorithm>

using namespace ept::popcon;

// STL comparator
struct PopconCompare
{
    Popcon& popcon;
    bool operator<(const std::string& pkg1, const std::string& pkg2) const
    {
        return popcon[pkg1] < popocon[pkg2];
    }
};

void playWithPopcon()
{
    // Apt data source
    Apt apt;
    // Popcon data source
    Popcon popcon;
    vector<string> sorted;

    if (!popcon.hasData())
        return;

    // Get all package names in the vector
    copy(apt.begin(), apt.end(), back_inserter(sorted));

    // Sort it by popularity
    sort(sorted.begin(), sorted.end(), PopconCompare(popcon));

    // Print it out
    for (vector<string>::const_iterator i = sorted.begin();
        i != sorted.end(); ++i)
        cout << *i << endl;
}

Search for image viewer, but we don't want to view kernel images

#include <xapian.h>

using namespace ept::textsearch;

void playWithXapian()
{
    TextSearch textsearch;
    vector<string> wanted;
    vector<string> notwanted;

    Xapian::Enquire enq(textsearch.db());
    // This will tokenise the search query into terms, stem them
    // and OR them together in a query.  Xapian will score higher
    // those results in which more ORed terms match, which is what
    // we want.
    Xapian::Query want = textSearch.makeOrQuery("image viewer");
    Xapian::Query dontWant = textSearch.makeOrQuery("linux kernel");

    enq.set_query(Xapian::Query(Xapian::Query::OP_AND_NOT, want, dontWant));

    // Print the top 20 results, with their relevance percentage
    Xapian::MSet matches = enq.get_mset(0, 20);
    for (Xapian::MSetIterator i = matches.begin(); i != matches.end(); ++i)
    {
        // The get_data() of a document is the package name
        cout << i.get_document().get_data() << " ("
             << i.get_percent() << "%)" << endl;
    }
}
Posted Sat Jun 6 00:57:39 2009 Tags: debtags

Improving package managers

I noticed two posts on improving package managers none of which mentions Debtags.

Daniel Burrows mentions various issues:

  • the current sections in Synaptic are useless
  • there are better keyword search technologies than strstr()
  • we could use popularity contest data to sort results
  • it would be cool to do amazon-like things using popcon data

David Nusinov mentions that the ideal package manager should look like Google, where you search for things using just a simple one line text entry and pick from the results what you want to install.

I should probably do a bit of recap of things that have been going on.

I'll go through that list again:

  • The current sections in Synaptic are useless

Agreed. This used to be a bug about this, which has been closed by Debtags more than one year ago. We now have much more useful category data for about 73% of the archive (including experimental), but what we lack is software using it.

Here's a quick trick to try:

  1. install debtags, and this gives you an easy to read text file in /var/lib/debtags/package-tags.
  2. from that file, pick packages that have the tags role::program, scope::application and interface::x11.
  3. display the results, and use the tags works-with::* and use::* to navigate the results.

There is a python-debian package in experimental that has a debtags module you could play with.

Why is that that so far noone has written a simple package manager just for gamers, which uses only the game::* tags?

Do you think Debtags gives you too many tags? Then check out:

  • The Debtags smart search, and especially how it does not show you all the tags, but it is able to infer the tags you want from your google-like query (hi David!).
  • The Debtags tag editor, and especially the search-as-you type feature on all the tags and the tag search (analogous to the Debtags smart search, but it only searches tags.
  • The Debtags tag cloud, and if you don't like that one try to make your own: there are countless ways of generating tag clouds from Debtags data.

To summarise so far, we not only do have better categories, but also a number of cool algorithms to use them, and some interface prototypes as well. Just don't expect me to write a package manager as well: that's a job that so far I decided to leave to someone else. adept gave it a try, with positive results.

  • there are better keyword search technologies than strstr()

Indeed, Xapian for example. I use it as part of the backend of the Debtags smart search, and here's our Xapian-powered normal keyword based package search interface which does stemming, indexing and all you want to ask from a serious full text index.

In that page you don't see all the nice features of Xapian, but only the ones that I needed for my Debtags evil plans. Have a look at the documentation and give it a try.

Here is a way to see Xapian's similarity matching in action:

  1. go to the Go tagging! page
  2. click on a random untagged package
  3. the system gives you a rather relevant selection of tags
  4. look at it again: the package was untagged: how could the web engine possibly figure those tags out?

What is happening under the scenes is that:

  1. I ask Xapian: "what packages are similar to this one?".
  2. I aggregate the tags of the resulting packages.
  3. I rank the tags by how many resulting packages have them.

While we are on this topic, why don't we decide that we maintain a Xapian index of our package descriptions in, for example, /var/lib/apt/fulltext/, so that various applications can share it?

  • we could use popularity contest data to sort results

Indeed. Anyone would like to implement this little "popcon" tool? Having the data easily accessible locally can encourage people to use them.

The Debtags Go tagging! page already uses popcon data to show the most common untagged packages at the top, with double reason: it shows packages that more people are likely to know (and therefore likely to categorise) and it pushes for the most common packages to be tagged more urgently.

  • it would be cool to do amazon-like things using popcon data

Indeed. Anyone volunteers to implement a prototype? The full unaggregated (but anonymised) popcon data are accessible to every Debian Developer on the host gluck.debian.org in the directory /org/popcon.debian.org/popcon-mail/popcon-entries.

Ideally one can do many interesting things with this concept: besides tag suggestions, one could identify the packages that are most representative of an installed system, and also offer negative suggestions like: "people who have packages like yours usually don't have this package: would you like to remove it?".

There is more than all this that could be done. Recently, almost by accident, I had the idea of querying packages by example, like pointing to a file and find packages that can work with it. I've asked Jeroen to have Mole collect info on all files that could possibly get installed in /usr/lib/mime/packages/ (as suggested by Bernhard R. Link), to see if that prototype can be made more accurate.

Query by similarity would be nice: I don't like this program, but what else do we have that does the same job? This is best implemented using Debtags data, since it directly maps to semantic properties. Note that you don't have to show a single tag to the user to implement this kind of interface. Do we have a way to point at the X window of an application and get the name of the package that installed it? Wouldn't it be about time to have it?

Why don't we have a system updater utility that shows the Debian weather?

Why aren't more people playing with semantic web?

But more generally, the problem with package managers is that we seem to be irrationally compulsive in wanting to make the one and only big easy and complete interface for everyone. Other more reasonable people would tell you that if you have two very different kinds of users you may want to consider having two different user interfaces.

Ubuntu for example installs by default 3 package manager interfaces: Synaptic; the thing that you access from the application menu to add applications to it; and the update manager. Does it sound like a waste? To me it makes lots of sense.

We have lots of interesting, usable metadata; we have algorithms; we have prototypes; we have ideas for lots of cool, implementable features. The question is, are we able to write applications that just combines what is needed from all this treasure to provide the right interface(s) for our base(s) of users?

Even if my English in 2004 wasn't easy to understand, a read here might still be useful.

There is so much really cool stuff to be written, just within reach.

Posted Sat Jun 6 00:57:39 2009 Tags: debtags