apt-xapian-index: dynamically generated tag clouds
About apt-xapian-index, I have already posted:
- an introduction of apt-xapian-index;
- an example of how to query it;
- a way to add simple result filters to the query;
- a way to suggest keywords and tags to use to improve the query.
- a way to search for similar packages.
- a way to implement an adaptive cutoff on result quality.
- a smart way of querying tags
- how to implement search as you type
Today I'll show how to create tag clouds. Not only that, but I'll show how to implement tag clouds that change as the user types a query.
This example uses python-gtk, and has been created together with Matteo Zandi.
Generating a tag cloud out of any Xapian query is simple, and it is just a matter of presenting into a tag cloud the information that you get with the technique shown in a smart way of querying tags: you get the tags related to the query, and you lay out their names with a font size proportional to their Xapian rank.
For the presentation, we can load pretty names from the Debtags
vocabulary in /var/lib/debtags/vocabulary:
from debian_bundle import deb822 # Facet name -> Short description facets = dict() # Tag name -> Short description tags = dict() for p in deb822.Deb822.iter_paragraphs(open("/var/lib/debtags/vocabulary", "r")): if "Description" not in p: continue desc = p["Description"].split("\n", 1)[0] if "Tag" in p: tags[p["Tag"]] = desc elif "Facet" in p: facets[p["Facet"]] = desc
The query then goes on as usual, and when we get the tags from
the eset we also record their score and normalise it
between 0 and 1. I found that computing the logarithm of scores
helps to avoid having a tag cloud with a few huge tags and a lot of
tiny tiny tags:
class Filter(xapian.ExpandDecider): def __call__(self, term): return term[:2] == "XT" def format(k): if k in tags: facet = k.split("::", 1)[0] if facet in facets: return "<i>%s: %s</i>" % (facets[facet], tags[k]) else: return "<i>%s</i>" % tags[k] else: return k taglist = [] maxscore = None for res in enquire.get_eset(15, rset, Filter()): # Normalise the score in the interval [0, 1] weight = math.log(res.weight) if maxscore == None: maxscore = weight tag = res.term[2:] taglist.append( (tag, format(tag), float(weight) / maxscore) ) taglist.sort(key=lambda x:x[0])
Finally, you mark up a gtkhtml2.Document to display
in a gtkhtml2 widget:
def mark_text_up(result_list): # 0-100 score, key (facet::tag), description document = gtkhtml2.Document() document.clear() document.open_stream("text/html") document.write_stream("""<html><head> <style type="text/css"> a { text-decoration: none; color: black; } </style> </head><body>""") for tag, desc, score in result_list: document.write_stream('<a href="%s" style="font-size: %d%%">%s</a> ' % (tag, score*150, desc)) document.write_stream("</body></html>") document.close_stream() return document
That's it, try it out.
You can use the git web interface to get to the full source code and the module it uses.
