Hobby-hacking Eric

2007-03-31

instant dada

Just a small project idea for in case anybody is bored: write a small web toy that generates random blog entries in the style of a given blog.

My mental image would be basically to have a form in which the user can paste a URL. What the user should get back is a formatted blog post (maybe even using the right templates), and also html code that they can paste back into their own blogs.

You'll also want to figure out how to retrieve the individual blog posts, perhaps by using a feed-discovery mechanism. The result should be a single post with title, text and maybe even comments. Haskellers, especially newbies, might be particularly interested because it'll give you a chance with the Haskell XML stuff for parsing the RSS, PFP (for the text generation?) and HApps (for the web front-end). Many of the pieces are there. I remember seeing an RSS parser and a random text generator floating about, so it might just be a matter of cobbling things together.

If you're sucessful, you might even kick off one of those memes where people dada their own blogs as a post.


3 comments:

Miles said...

A quick google shows up no Haskell dada code, but it shouldn't be too hard to port/interface to Jamie Zawinski's DadaDodo (exterminate all rational thought!). It's just Markov chains, anyway.

kowey said...

I guess what I had in mind was word bigram based generator by Mikael Johansson. Thanks for the link!

Miles said...

Nifty! I should really read Michi's work/code blog more often. I don't think what he's doing is quite the same, though - if I'm reading Michi's code aright (which I might not be), he takes n/2 randomly-picked bigrams and concatenates them together, whereas I think DadaDodo works by saying "given the current word, what's the probability distribution of the next word?" Markov chains, as I said.

I did some playing around with this stuff a while ago, using a program called NIALL (Non-intelligent algorithmic language learner - can't find it on the web at the moment, alas), and we found that the best results for a reasonable-sized corpus came when you considered the last (two or) three words rather than just the last one. Considering the last four meant that the program would just dump out big chunks of verbatim corpus text. But as JWZ says, the optimum lookback (and thus the memory requirement) probably increases with the size of the corpus.