KM 5433 Blog/Joe Colannino

A blog discussing knowledge management and library science issues.

Saturday, November 04, 2006

Gimme’ The Context: Context-driven Automatic Semantic Annotation with CPANKOW/ Overview and Opinion/J. Colannino

By Philipp Cimiano, Gunter Ladwig, and Steffen Staab, Institute AIFB, University of Karlsruhe, Germany

In an earlier blog I gave a coarsely articulated recipe for an automated semantic engine. I also spoke of Schmoogle, my word for a hypothetical semantic search engine that is as user-friendly as Google. Talk is cheap, of course. The glory is in the doing not the dreaming. To wit, now there is C-PANKOW (Context-sensitive Pattern-based ANotation through Knowledge On the Web. It is hosted here, though the server was unavailable each time I tried it.

The actual algorithms are a complex integration of propositional calculus, probability and Bayesian statistics, and analytic geometry. However, heuristically the automated metatagging follows this process:

  1. scan the text of the subject web page that one desires to semantically tag,
  2. match the text to a pattern library, creating “instances,”
  3. generate an automated query to Google for each instance,
  4. download the first n abstracts (n <>
  5. assess the similarity (pattern match) between the downloaded abstracts and the subject page,
  6. weight the pattern matching in the abstracts by number and similarity (these are presumed to be contextually relevant),
  7. tag the subject page with the annotations contextually relevant as voted by the filtered web results.

I think the scheme, especially step 7, is clever.

For the instance recognition, C-PANKOW uses the part-of-speech tagger hosted at Ohio State University (the other OSU). Then it parses the parts of speech creating a kind of ordered paraphrase or pattern string. (Identifying parts of speech is the only way I can think of to begin automatically constructing semantics). C-PANKOW then generates an automated query to Google for these pattern strings, typically downloading the first ten hits for each instance, counting the number of pattern strings using one of Google’s application processing interfaces (APIs); I presume it is the AJAX Search API, but this is not clear from the article. At any rate, these hits provide the context for creating and interpreting pattern strings.

So, does it work? Yes and no.

First of all, it can take up to twenty minutes to semantically tag a page. Obviously, this could never be used on-the-fly. To be fair, the authors do not envision it as an on-the-fly tool, but rather as a way for creators to tag their web pages. But even so, twenty minutes is still a lifetime in computer years. No doubt the algorithm will continue to be refined and the tagging time will be reduced. However, the bigger problem as we learned from the Mohammed article is that no one believes metatagging – indeed, search engines ignore it.

So, C-PANKOW or something like it will not be a player in enabling semantic search, ever, unless some authoritative body undertakes to do the tagging. Even so, tagging times of minutes require careful selection of resources (e.g., scholarly work). Hasn’t this been the pattern with library resources anyway, whether metadata (cringe) is automated or hand crafted? Maybe the focus should be in creating trustworthy data asylums that are free, or at least inexpensive. Hey! We could call it a library!

Labels: , , ,