Wednesday, July 07, 2010

Notes on Hadoop and Elastic MapReduce

We've been busily integrating Hadoop into our distributed processing architecture at PeerIndex. Here is a list of some items that I've run across that made the whole process easier.

  1. Do some training (seriously). Hadoop and MapReduce isn't grandfathers programming and many of the ideas & principles you would otherwise use don't suit MapReduce/Hadoop. Those coming from large-scale scientific data background will have a head start. Cloudera has some great training vids available on their site and also offer a training course which I attended and found good.
  2. Use a local version to test. I found Karamsphere to be very useful for testing the scripts written and working out the bugs. The Cloudera has a virtual machine that is very, very useful for doing Pig and Hive testing.
  3. Streaming jobs are your friend. Using streaming jobs is a great way to get Hadoop based processing of the ground.
  4. Follow these pointers from Pete Warden. Increase allocated memory size (particularly if using PHP via ini_set), use the -jobconf stream.recordreader.compression=gzip etc
  5. Delete outputs. MapReduce jobs will fail if you don't delete the output directory from a previous run. This one will get you into trouble all the time.

Enhanced by Zemanta