KM 5433 Blog/Joe Colannino

A blog discussing knowledge management and library science issues.

Sunday, December 03, 2006

Indexing the invisible web: a survey by Yanbo Ru and Ellis Horowitz/ Overview (and strained theological implications), J. Colannino

The invisible web (also known as the deep, hidden, or dark web) is the part of the web that is not indexed by search engines such as Google or Yahoo. It is orders of magnitude greater than the part of the web we can see (called the visible or surface web). For example, Yanbo’s article exists in many places, none of which are accessible by search engines. The closest you can get is here, and that is after some trying. Other document warehouses will also hold this article, but all of them charge a fee. Fee-for-service websites deliberately keep most of their content invisible. However, even free databases suffer from the same malady because search engines cannot index databases or the contents of query sites where records are generated on the fly.

The subject article is all about indexing these documents. This is done in two main ways. One method is to index only an interface site. This is the way Google indexes this paper based on the search string “Indexing the invisible web” Yanbo. Because of the variety of databases, protocols, forms to fill out, etc., automated indexing of the invisible web has resisted progress; however, these web contents are often available manually. This leads to the second main way of indexing hidden web documents: professional (manual) indexing. Human indexers cannot possibly access every possible document, so this strategy indexes only a portion of interest within the hidden web.

The article goes on to vet the various methods. If you are interested in an overview of how some search engines attempt to access the invisible web, this 17-page article is well-referenced and helpful as a starting point. If you are interested in how to obtain information from the hidden web, visit a librarian. Information indexing and retrieval is their unequaled profession.

How would you end this article?
I have always wanted to write an article that uses the word “alas” and ends with the phrase “…and the theological implications alone are staggering!” Yanbo and Horowitz’ article ends with this: “A technique that can more comprehensively index the data in an invisible web site, and that will not get swamped by the size of the data, is required.” Okay.... But I was hoping for something less anemic.

In my view, we will never fully index the hidden web. Well, I suppose that is a universal negative akin to “no two snowflakes are alike.” And of course, universal negatives are often not provable, but we belive them anyway. That should give pause to the empiricists among us; apparently, er… alas, the world’s “greatest” religion is pessimism.

… and the theological implications alone are staggering!