I was reading this article about Drupal (linked above), and came across the idea of a stochastic taxonomic analyzer. Basically, the idea is that you have a predefined taxonomy based on some example corpus of stuff and you use Bayesian analysis to work out the proper taxonomic categorization of another, previously uncategorized article.
I think this idea is interesting, because it sort of mimics what a human would do in learning how to categorize things. You start out with a "representative subsample", develop some sort of taxonomic relationship, and then apply that to other things. Eventually, though, the taxonomic system will probably break (like classifying a Duckbill Platypus), which is an interesting problem.
I was thinking about this in the context of the Semantic Web. Cory Doctorow's famous "Metacrap" missive gives some compelling reasons why the average person can't be trusted to semantically mark up everything they produce. But it seems to me that the problem isn't with our inability to determine the proper semantic information, but our inability to deal with the boring, meticulous manner in which a computer must be fed this info.
For example, take DocBook. DocBook has lots of fun semantic tags like
I mean isn't semantics why humans invented grammar in the first place? It seems like if we could teach computers to recognize titles and diagram sentences, then they many be able to derive semantic information from "unstructured" documents. Obviously, this isn't easy, or the field of natural language processing would have nothing left to do.... but it is an interesting line of thought.