A pet problem of mine to solve is that when you do a keyword search for "image editor", gimp does not show up. This is indeed because the description of gimp does not include the word "editor". And if you are a gimp developer, please don't change this otherwise I don't have a good example to point at anymore.

Now that I have a Xapian engine in ept-cache, I already managed to make gimp actually appear somewhere in the search results through approximated matches.

But I know one can do better: debtags is very good in providing data that brings gimp back to the world of image editors. So, how can one use tags to improve the xapian results? Today the Xapian developers told me.

I've implemented it in ept-cache 0.5.6, that I've just uploaded to unstable. After you install it, do an ept-cache reindex to rebuild the index with debtags in it.

The system is very clever: here is how it works.

First you prepare the query as usual: tokenize, stem and so on:

// This is the nice ept interface to Xapian
TextSearch textsearch;
Xapian::Enquire enquire(textsearch);

// Set up the base query
Xapian::Query query = textsearch.makeORQuery(keywords.begin(), keywords.end());
enquire.set_query(query);

// Get a set of tag-based tokens that can be used to expand the query
vector<string> expand = textsearch.expand(enquire);

// Build the expanded query
Xapian::Query expansion(Xapian::Query::OP_OR, expand.begin(), expand.end());
enquire.set_query(Xapian::Query(Xapian::Query::OP_OR, query, expansion));

// Get the results as usual
    Xapian::MSet matches = enquire.get_mset(pos, 20);
for (Xapian::MSetIterator i = matches.begin(); i != matches.end(); ++i)
    ...

And this is how you build the expanded query:

// This functor filters out all tokens that are not tags
// (tags are indexed with a 'T' prefix)
struct TagFilter : public Xapian::ExpandDecider
{
    virtual bool operator()(const std::string &term) const { return term[0] == 'T'; }
};

static TagFilter tagFilter;

vector<string> TextSearch::expand(Xapian::Enquire& enq) const
{
    // A Xapian RSet is a list of keywords that can be used to 'expand'
    // the search to show more documents like a set of given ones
    Xapian::RSet rset;

    // Select the top 5 result documents as the 'good ones' to
    // use to expand the search
    Xapian::MSet mset = enq.get_mset(0, 5);
    for (Xapian::MSet::iterator i = mset.begin(); i != mset.end(); ++i)
        rset.add_document(i);

    // Get the expansion terms, but only those that are tags
    Xapian::ESet eset = enq.get_eset(5, rset, &tagFilter);
    vector<string> res;
    for (Xapian::ESetIterator i = eset.begin(); i != eset.end(); ++i)
        res.push_back(*i);

    // Pass the tags to the caller, who will OR them to their
    normal keyword search
    return res;
}

debian debtags eng pdo

2009-06-06 00:57:39+02:00