We've been busily integrating Hadoop into our distributed processing architecture at PeerIndex. Here is a list of some items that I've run across that made the whole process easier.
- Do some training (seriously). Hadoop and MapReduce isn't grandfathers programming and many of the ideas & principles you would otherwise use don't suit MapReduce/Hadoop. Those coming from large-scale scientific data background will have a head start. Cloudera has some great training vids available on their site and also offer a training course which I attended and found good.
- Use a local version to test. I found Karamsphere to be very useful for testing the scripts written and working out the bugs. The Cloudera has a virtual machine that is very, very useful for doing Pig and Hive testing.
- Streaming jobs are your friend. Using streaming jobs is a great way to get Hadoop based processing of the ground.
- Follow these pointers from Pete Warden. Increase allocated memory size (particularly if using PHP via ini_set), use the -jobconf stream.recordreader.compression=gzip etc
- Delete outputs. MapReduce jobs will fail if you don't delete the output directory from a previous run. This one will get you into trouble all the time.