MongoDB Blog Contest – I Won!
I entered my post on Why (and How) I Replaced Amazon SQS with MongoDB into the MongoDB Blog Contest and won!
As part of the grand prize, I went out to OSCON in Portland, OR and had a great time! Kristina Chodorow‘s new book on MongoDB was on display and I spoke to lots of people who are interested in what MongoDB is and what it can do for them. I also went to a lot of talks, the most interesting of which (in my opinion) were about node.js and the analytics done at Twitter using Hadoop and Pig.
If you haven’t checked out node.js yet, you should. It’s a really cool technology that provides an event based I/O system on top of Google’s V8 javascript engine. Now you can take all of that javascript hacking that you’ve been doing and write shell scripts using javascript (sort of like how your C# knowledge helps with Powershell) or even write full-fledged web servers! Node.js is incredibly fast and manages to stay fast at large quantities of connections by following an I/O strategy like NGINX or even a windows application. Rather than having a thread pool to post work to or creating a new thread per connection, it uses a single thread and a message queue of operations (I’m not sure if there are multiple threads with their own queues, but you get the idea). Also, node.js is designed such that any I/O operation at all requires a callback. This can make for some messy looking nested code, but ensures that you think about how to parallelize your application as much as possible. It’s really easy to get started and has a great community behind it. My favorite starting point was Blog rolling with mongoDB, express and Node.js.
I haven’t personally used Pig and I’m not currently using Hadoop, but if I were to start using Hadoop, Pig would be the first thing I would install. Hadoop allows you to store and process large amounts of data in a distributed fashion. In order to process the data, you write map and reduce methods in Java and then let Hadoop take over. The problem with this comes when you’re rewriting large chunks of Java code and compiling over and over again to figure out how you want to view your data and how to process it. At Twitter, with over 75 Billion tweets, a simple map-reduce can take a long time! Then you need to debug your code, compile, and run. Then again, and again. Now if you’re trying to do a distributed join, the pain really starts. Kevin Weil from Twitter showed the Java code, which is over 200 lines and needs to be written custom to each map-reduce you perform. That’s slow and painful! In comes Pig. Pig is a scripting language that rides on top of Hadoop and allows you to quickly perform operations on your data without having to worry about writing the Java code and compiling. It’s much more terse and readable and takes away a lot of the debugging headaches in writing your own joins or other complexities. It seems like a great addition to the map-reduce world!