Horizon of Stars: July 2010

Wednesday, July 07, 2010

Notes on Hadoop and Elastic MapReduce

We've been busily integrating Hadoop into our distributed processing architecture at PeerIndex. Here is a list of some items that I've run across that made the whole process easier.

Do some training (seriously). Hadoop and MapReduce isn't grandfathers programming and many of the ideas & principles you would otherwise use don't suit MapReduce/Hadoop. Those coming from large-scale scientific data background will have a head start. Cloudera has some great training vids available on their site and also offer a training course which I attended and found good.
Use a local version to test. I found Karamsphere to be very useful for testing the scripts written and working out the bugs. The Cloudera has a virtual machine that is very, very useful for doing Pig and Hive testing.
Streaming jobs are your friend. Using streaming jobs is a great way to get Hadoop based processing of the ground.
Follow these pointers from Pete Warden. Increase allocated memory size (particularly if using PHP via ini_set), use the -jobconf stream.recordreader.compression=gzip etc
Delete outputs. MapReduce jobs will fail if you don't delete the output directory from a previous run. This one will get you into trouble all the time.

Horizon of Stars

Wednesday, July 07, 2010

Notes on Hadoop and Elastic MapReduce

Subscribe

Twitter Updates

Photos

About Me

Labels

Blog Archive