Monday, June 2, 2008

Drupal and The Future of News - O'Reilly XML Blog

Drupal and The Future of News - O'Reilly XML Blog: "stochastic taxonomic analyzers"

I was reading this article about Drupal (linked above), and came across the idea of a stochastic taxonomic analyzer. Basically, the idea is that you have a predefined taxonomy based on some example corpus of stuff and you use Bayesian analysis to work out the proper taxonomic categorization of another, previously uncategorized article.

I think this idea is interesting, because it sort of mimics what a human would do in learning how to categorize things. You start out with a "representative subsample", develop some sort of taxonomic relationship, and then apply that to other things. Eventually, though, the taxonomic system will probably break (like classifying a Duckbill Platypus), which is an interesting problem.

I was thinking about this in the context of the Semantic Web. Cory Doctorow's famous "Metacrap" missive gives some compelling reasons why the average person can't be trusted to semantically mark up everything they produce. But it seems to me that the problem isn't with our inability to determine the proper semantic information, but our inability to deal with the boring, meticulous manner in which a computer must be fed this info.

For example, take DocBook. DocBook has lots of fun semantic tags like and <title> and
and what not. Semantic information. But what happens to these when you use the XSL for print output? This semantic information gets translated into "formatting" information. The kind of information that is supposed to somehow separate from semantics. But why do we bother with this formatting? Precisely because it is the formatting that gives humans the visual cues necessary to derive the very same DocBook tags in the first place. So in the case of DocBook, one could imagine an OCR system that scans a book or magazine article does a fairly decent job at doing the markup based purely on the formatting.

I mean isn't semantics why humans invented grammar in the first place? It seems like if we could teach computers to recognize titles and diagram sentences, then they many be able to derive semantic information from "unstructured" documents. Obviously, this isn't easy, or the field of natural language processing would have nothing left to do.... but it is an interesting line of thought.