David DeWitt and Michael Stonebraker
write about MapReduce in
"The Database Column". Now I usually like what Michael Stonebraker writes (e.g. his piece on the
RDBMS demise which
I also wrote about myself). However I can't say that this time around.
David
and Michael write that MapReduce is a big step backwards. before I'll
talk about what they write, here is a (very high level) reminder what
Map/Reduce is
MapReduce as
Google's Jeffery Dean and Sanjay Ghemawat explain
is a way to get automatic parallelization and distribution along with
fault tolerance, monitoring and I/O scheduleing for tasks that need to
work on complete datasets. MapReduce uses two functions:
- Map
- multiple instances of which run in parallel to process a key/value
pair and produce produce a set of grouping key(s) and intermediate
values.
- Reduce - which runs per grouping key and merge the intermediate values to a a set of merged outputs (usually one)
David and Michael claims that MapReduce is
1. a step backwards because it doesn't build on Schema
2. a poor implementation because it doesn't use indexes
3. not new
4. missing features - like bulk load, indexing, updates, transactions, integrity constraints, referential integrity, views
5. incompatible with DBMS tools - like report writers, BI tools, replication tools, design tools
Well,
if anything, it seems that David and Michael don't really understand
what MapReduce is. As I noted above MapReduce is a way to go over
complete sets in an efficient distributed manner. In fact it can even be used to build the index of a traditional
RDBMS. It isn't really competing wit databases Relational or other. Yep, comparing MapReduce and databse is the apples and oranges thing...
I guess they might have meant to talk about another Google tool called
BigTable
- which is at least sort of a column database (Michael's company also
makes a column database) for storing structured data in a highly
distributed , high performance way. However David and Michael would
still be wrong as BigTable is proprietary and targeted at a specific
purpose so it isn't supposed to solve the same problems as a general
purpose database not to mention that it is highly scalable (ever heard
of google's search engine ;) ) and does support things like indexes,
updates etc.
Also as I mentioned in the
"RDBMS is dead"
post, the internet proved that RDBMS features (like transactions etc.)
can only only scale so much. While Databases focus on the Consistency
and Availability parts of the
CAP conjecture and ACID tenets , internet scale systems pick Partitioning and Availability and BASE tenets instead.