Why Not Hadoop?
We are flying back from Boston after an excellent week at the Architecture Technology Review. This was my first interaction with the Nokia architecture community at large, and I was really pleased (and I have to admit, somewhat surprised), to see how awesome many of the developments coming down the pipe are. Ville gave a talk on what we have been doing with Disco, and we also gave a demo during one of the 'speed geeking' sessions. One of the most common questions we were asked was, "why not Hadoop?", so I thought I'd give my opinion on the subject.
Prior to coming to the NRC, I was using Hadoop for about a year and a half (doing bioinformatics), and I must say that it served me quite well. To be sure, there were problems along the way, but Hadoop enabled me to do analyses that I would not otherwise have done, not because they would be impossible without Hadoop, but because mapreduce makes it so easy to parallelize a huge class of problems, that the overhead of doing things with big data becomes amazingly small. Even when using Hadoop, I always used Python (with Hadoop Streaming) to write map/reduce functions, because Python is such a pleasure to write, and because I am much more productive writing Python than Java (or pretty much any other language). Because of my love for Python, I often wondered why noone had yet written a Python implementation of mapreduce, and even considered writing my own. I think it is natural for anyone who thinks about the design of systems, to question the validity of architecture decisions and to wonder how those designs might be improved. Of course, actually implementing a new design is a whole other story, and finding the impetus to do so, especially when a reasonably good implementation (with lots of high-profile developers) already exists, is not always easy. When I discovered the Disco project, which is part Erlang, part Python, I was deeply intrigued. I questioned the choice of Erlang (not knowing much about it), but Ville's argument was extremely pragmatic: Erlang is really good at distributed stuff (that's what it was built to do), and Python is awesome for high-level programming (i.e. its fun, easy to read/write, expressive, etc.). But I guess the question remains, why not Hadoop? The reason answering this question is hard, is because largely it is a matter of taste. The bottom line is that neither Hadoop nor Disco is really a mature project (Hadoop IS more highly developed than Disco though), while it seems to me the choice of framework is a long-term question. For me, wanting to use Python to improve the framework itself is a no-brainer (additionally, Jython is currently too far behind CPython for me to consider it a replacement). Why Disco? Because of it's philosophy: massive data - "minimal code". Lightweight is a design goal in Disco, and we really, truly, care about programmer overhead. Framework development should be as agile as possible, if we are trying to optimize programmer productivity. My vision of Disco is a framework that can be shaped to the needs of its users (including myself), by its users. For me, the reality of Hadoop was quite different.


