David DeWitt and Michael Stonebraker write about MapReduce in "The Database Column". Now I usually like what Michael Stonebraker writes (e.g. his piece on the RDBMS demise which I also wrote about myself). However I can't say that this time around.
David and Michael write that MapReduce is a big step backwards. before I'll talk about what they write, here is a (very high level) reminder what Map/Reduce is
MapReduce as Google's Jeffery Dean and Sanjay Ghemawat explain is a way to get automatic parallelization and distribution along with fault tolerance, monitoring and I/O scheduleing for tasks that need to work on complete datasets. MapReduce uses two functions:
  • Map - multiple instances of which run in parallel  to process a key/value pair and produce  produce a set of  grouping key(s) and intermediate values.
  • Reduce - which runs per grouping key and merge the intermediate values to a a set of merged outputs (usually one)
David and Michael claims that MapReduce is
1. a step backwards because it doesn't build on Schema
2. a poor implementation because it doesn't use indexes
3. not new
4. missing features - like bulk load, indexing, updates, transactions, integrity constraints, referential integrity, views
5. incompatible with DBMS tools - like report writers, BI tools, replication tools, design tools

Well, if anything, it seems that David and Michael don't really understand what MapReduce is. As I noted above MapReduce is a way to go over complete sets in an efficient distributed manner. In fact it can even be used to build the index of a traditional RDBMS. It isn't really competing wit databases Relational or other. Yep, comparing MapReduce and databse is the  apples and oranges thing...

I guess they might have meant to talk about another Google tool called BigTable - which is at least sort of a column database (Michael's company also makes a column database) for storing structured data in a highly distributed , high performance way. However David and Michael would still be wrong as BigTable is proprietary and targeted at a specific purpose so it isn't supposed to solve the same problems as a general purpose  database not to mention that it is highly scalable (ever heard of google's search engine ;) ) and does support things like indexes, updates etc.

Also as I mentioned in the "RDBMS is dead" post, the internet proved that RDBMS features (like transactions etc.)  can only only scale so much.  While Databases focus on the Consistency and Availability parts of the CAP conjecture and ACID tenets , internet scale systems pick Partitioning and Availability and BASE tenets instead.


 
Comments are closed.