Gimme’ The Context: Context-driven Automatic Semantic Annotation with CPANKOW/ Overview and Opinion/J. Colannino
By Philipp Cimiano, Gunter Ladwig, and Steffen Staab, Institute AIFB,
In an earlier blog I gave a coarsely articulated recipe for an automated semantic engine. I also spoke of Schmoogle, my word for a hypothetical semantic search engine that is as user-friendly as Google. Talk is cheap, of course. The glory is in the doing not the dreaming. To wit, now there is C-PANKOW (Context-sensitive Pattern-based ANotation through Knowledge On the Web. It is hosted here, though the server was unavailable each time I tried it.
The actual algorithms are a complex integration of propositional calculus, probability and Bayesian statistics, and analytic geometry. However, heuristically the automated metatagging follows this process:
- scan the text of the subject web page that one desires to semantically tag,
- match the text to a pattern library, creating “instances,”
- generate an automated query to Google for each instance,
- download the first n abstracts (n <>
- assess the similarity (pattern match) between the downloaded abstracts and the subject page,
- weight the pattern matching in the abstracts by number and similarity (these are presumed to be contextually relevant),
- tag the subject page with the annotations contextually relevant as voted by the filtered web results.
I think the scheme, especially step 7, is clever.
For the instance recognition, C-PANKOW uses the part-of-speech tagger hosted at
So, does it work? Yes and no.
First of all, it can take up to twenty minutes to semantically tag a page. Obviously, this could never be used on-the-fly. To be fair, the authors do not envision it as an on-the-fly tool, but rather as a way for creators to tag their web pages. But even so, twenty minutes is still a lifetime in computer years. No doubt the algorithm will continue to be refined and the tagging time will be reduced. However, the bigger problem as we learned from the Mohammed article is that no one believes metatagging – indeed, search engines ignore it.
So, C-PANKOW or something like it will not be a player in enabling semantic search, ever, unless some authoritative body undertakes to do the tagging. Even so, tagging times of minutes require careful selection of resources (e.g., scholarly work). Hasn’t this been the pattern with library resources anyway, whether metadata (cringe) is automated or hand crafted? Maybe the focus should be in creating trustworthy data asylums that are free, or at least inexpensive. Hey! We could call it a library!
Labels: Annotation, Information Extraction, Metadata, semantic web
0 Comments:
Post a Comment
<< Home