Monday, June 2, 2008

Drupal and The Future of News - O'Reilly XML Blog

Drupal and The Future of News - O'Reilly XML Blog: "stochastic taxonomic analyzers"

I was reading this article about Drupal (linked above), and came across the idea of a stochastic taxonomic analyzer. Basically, the idea is that you have a predefined taxonomy based on some example corpus of stuff and you use Bayesian analysis to work out the proper taxonomic categorization of another, previously uncategorized article.

I think this idea is interesting, because it sort of mimics what a human would do in learning how to categorize things. You start out with a "representative subsample", develop some sort of taxonomic relationship, and then apply that to other things. Eventually, though, the taxonomic system will probably break (like classifying a Duckbill Platypus), which is an interesting problem.

I was thinking about this in the context of the Semantic Web. Cory Doctorow's famous "Metacrap" missive gives some compelling reasons why the average person can't be trusted to semantically mark up everything they produce. But it seems to me that the problem isn't with our inability to determine the proper semantic information, but our inability to deal with the boring, meticulous manner in which a computer must be fed this info.

For example, take DocBook. DocBook has lots of fun semantic tags like and <title> and
and what not. Semantic information. But what happens to these when you use the XSL for print output? This semantic information gets translated into "formatting" information. The kind of information that is supposed to somehow separate from semantics. But why do we bother with this formatting? Precisely because it is the formatting that gives humans the visual cues necessary to derive the very same DocBook tags in the first place. So in the case of DocBook, one could imagine an OCR system that scans a book or magazine article does a fairly decent job at doing the markup based purely on the formatting.

I mean isn't semantics why humans invented grammar in the first place? It seems like if we could teach computers to recognize titles and diagram sentences, then they many be able to derive semantic information from "unstructured" documents. Obviously, this isn't easy, or the field of natural language processing would have nothing left to do.... but it is an interesting line of thought.

Friday, May 30, 2008

Long Now: Views: Essays

Long Now: Views: Essays: "Every great man that I have known has had a certain time and place in their life that they use as a reference point; a time when things worked as they were supposed to and great things were accomplished. For Richard, that time was at Los Alamos during the Manhattan Project. Whenever things got 'cockeyed,' Richard would look back and try to understand how now was different than then."

The Richard in question here is Richard Feynman. I wonder how you know you've found that reference point? Does the lack of such a reference point limit one's "greatness"?

Bob Sutton: Strong Opinions, Weakly Held

Bob Sutton: Strong Opinions, Weakly Held

Pure genius. I think this is really what the scientific method is all about -- the ability to be both confident and questioning at the same time. I need to work on the former.

Friday, April 4, 2008

Google NewsBot Humor


Sometimes I think that the Google bot that assembles the news page has a sense of humor when it chooses the image to show. I think this is an example.

Friday, March 28, 2008

Progress

Ecology, Art, and Technology | Environmental Risk Assessment Rover - AT Version 1.0: "Why has modernity, which was supposed to create a sense of security, produced more anxiety and threats than ever? Can scientific data and research help us understand the “riskiness” of contemporary life?"

Good question. I bet old Ned Ludd would have a theory on it.

Friday, March 7, 2008

Computer Jokes

Const-Correctness in C: "In C, you merely shoot yourself in the foot.

In C++, you accidentally create a dozen instances of yourself and shoot them all in the foot. Providing emergency medical care is impossible, because you can't tell which are bitwise copies and which are just pointing at others and saying, 'That's me, over there.'"

Haha.. I thought that was funny. I'm a nerd.

Generating N contiguous bits

So say you want to generate a bitmap that is N contiguous bits, without using some kind
of crazy inefficient for loop. Here's a way that I figured out how to do that (not to say I invented it or anything, but I couldn't find it when I searched for it).
Obviously this has compiler and machine dependencies, and probably doesn't work on the Ides of March, so use at your own risk.

In this example, I want to generate 6 bits of 1's (so 0b111111):

#include <stdio.h>
main(){
unsigned short a;
short b;
int n = 6;
b = 1 << n;
a = (unsigned short) -b; // extend the left bits
printf("a is %hx\n", a);
printf("~a is %hx\n", ~a);
a = ~a; // now b should be 0b111111
}


That's probably like 4 or 5 operations, so for 6 bits it probably sucks. But if you use 64 bit long longs and want to generate 47 one's then it will save some cycles.

Isn't playing with bits fun?