Microsoft recently released SP1 for .NET. While the SP brings some nice stuff it seems it also has some bugs and a few less than inspiring components
Another example for a less than stellar idea is the "ADO.NET data services" component. Before I go on to explain why I think that. I should probably mention that this isn't just a Microsoft thing as IBM also mentions similar ideas as part of their (broader and sometimes even worse) view of "Information as a Service"

So why is exposing the database through a web-service (RESTful or otherwise) is wrong? let me count the ways
  1. It circumvents the whole idea about "Services" - there's no business logic
  2. It makes for CRUD resources/services
  3. It is exposing internal database structure or data rather than a thought-out contract
  4. It encourages bypassing real services and going straight to their data
  5. It creates a blob service (the data source)
  6. It encourages minuscule demi-serices (the multiple "interfaces" of said blob) that disregard few of the fallacies of distributed computing
  7. It is just client-server in sheep's clothing
When it comes for ADO.NET data services you can add a few other problems like
  1. it isn't really RESTful - you can also "enhance" the services with operations like example 18 in "Using ADO.NET data services" : http://host/vdir/northwind.svc/CustomersByCity?city=London (though it does support caching and hypermedia )
  2. Also it doesn't really externalize a state machine it externalizes a relational model
  3. It is built on Entity Services.

 
Tags: .NET | data | REST | SOA

January 26, 2008
@ 11:31 PM
David J. DeWitt and Michael Stonebraker are at it again. There was a lot of buzz on the internet after their previous post (here is what I had to say about it).
Their first point on the new post tries to counter the claim that MapReduce is not a database so it shouldn't be judged as one. They claim that it isn't a matter of apples and oranges but rather
 " We are judging two approaches to analyzing massive amounts of information, even for less structured information."

The problem with that is they continue from there to define a problem in database terms and then show how MapReduce will not be as good as a database in solving it - well, duh.
The fact that isolated queries may run better in a pre-indexed database should come as no great surprise. As I noted in the previous post on the subject - MapReduce can be used to create the appropriate index or partition the data into smaller chunks that would be easier to use to answer the type of queries David and Michael mention.
As Mark Chu-Carroll explains Map/Reduce and databased don't solve the same kind of problems

Also what happens when the database is constantly updated ?!  - I don't mind how scientifically accurate are the measurements that say database scale like no other things. I am more comfortable with the empiric experience by companies like Amazon, Diggs, Google and ebay who found they have to shard their data to support their scalability needs and not use distributed transactions/distributed databased.


 
David DeWitt and Michael Stonebraker write about MapReduce in "The Database Column". Now I usually like what Michael Stonebraker writes (e.g. his piece on the RDBMS demise which I also wrote about myself). However I can't say that this time around.
David and Michael write that MapReduce is a big step backwards. before I'll talk about what they write, here is a (very high level) reminder what Map/Reduce is
MapReduce as Google's Jeffery Dean and Sanjay Ghemawat explain is a way to get automatic parallelization and distribution along with fault tolerance, monitoring and I/O scheduleing for tasks that need to work on complete datasets. MapReduce uses two functions:
  • Map - multiple instances of which run in parallel  to process a key/value pair and produce  produce a set of  grouping key(s) and intermediate values.
  • Reduce - which runs per grouping key and merge the intermediate values to a a set of merged outputs (usually one)
David and Michael claims that MapReduce is
1. a step backwards because it doesn't build on Schema
2. a poor implementation because it doesn't use indexes
3. not new
4. missing features - like bulk load, indexing, updates, transactions, integrity constraints, referential integrity, views
5. incompatible with DBMS tools - like report writers, BI tools, replication tools, design tools

Well, if anything, it seems that David and Michael don't really understand what MapReduce is. As I noted above MapReduce is a way to go over complete sets in an efficient distributed manner. In fact it can even be used to build the index of a traditional RDBMS. It isn't really competing wit databases Relational or other. Yep, comparing MapReduce and databse is the  apples and oranges thing...

I guess they might have meant to talk about another Google tool called BigTable - which is at least sort of a column database (Michael's company also makes a column database) for storing structured data in a highly distributed , high performance way. However David and Michael would still be wrong as BigTable is proprietary and targeted at a specific purpose so it isn't supposed to solve the same problems as a general purpose  database not to mention that it is highly scalable (ever heard of google's search engine ;) ) and does support things like indexes, updates etc.

Also as I mentioned in the "RDBMS is dead" post, the internet proved that RDBMS features (like transactions etc.)  can only only scale so much.  While Databases focus on the Consistency and Availability parts of the CAP conjecture and ACID tenets , internet scale systems pick Partitioning and Availability and BASE tenets instead.


 
September 7, 2007
@ 02:07 PM
In the previous post on the subject I wrote that the RDBMS is dead. I didn't mean that it is dead dead, but rather that it isn't well build to meet some of the newer challenges like linear scalability, high availability etc.
Well, it is one thing hearing it from me - and it is another thing hearing it from someone like Michael Stonebraker.
Michael, was the main architect for the Ingres prototype project at UC Berkely just one year after Codd's paper and (9 years before Oracle was released  and more than a decade before the commercial version of Ingress was released).

Well that was in 1970 - in 2007 Michael recently wrote :
 
In short, the world of 2007 is radically different from the world of the late 1970s. However, none of the major vendors have performed a complete redesign to deal with this changed landscape. As such they should be considered legacy technology, more than a quarter of century in age and "long in the tooth".
Among the new needs Michael cites are intelligence DBMSs (needs a lot of relations), textual and semi-structural data etc. He also said (promoting his own product) that 2007 customers expect high availability, linear scalability.
Michael's main point is that specialization can provide significant performance enhancements vs. the one-size-fits-all approach of RDBMSs. He gives his product (Vertica) as an example for how a column oriented database (vs. the RDBMS row orientation) can outperform RDBMs by a factor of 50. Google's Big table is another example.

Interesting...


 
August 21, 2007
@ 02:58 PM
Ok now, that I got your attention, that it isn't dead yet - but we can see a whole class of applications (maybe a couple of classes) where the importance of the RDBMS as we know it today is greatly diminished.
In an article I posted recently on InfoQ, (which I also mentioned in the post on eBay architecture last week ) I discussed the notion of database denormalization on internet-scale sites (such as Amazon, eBay, Flickr etc.). One point of denormalization is immutable data where there isn't a lot of gain in normalization to begin with.
The other thing is entity representation vs. speed. The problem is that joins are slow and sometimes you get to corners where if we want any type of scent speed we need to denormalize. Todd Hoff notes that as well:
The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again. Flickr decided to denormalize because it took 13 Selects to each Insert, Delete or Update.
This point is, however, that these "corner cases" get more and more prevalent even in smaller scale application - especially when you have complex entities (as is the case with defense systems for example). Mats Helander, recently wrote a post about saving to Blob, and only adding fields as needed for indexing and identity purposes. Mats also suggest the semi-transparent way of using XML columns where the database can do something with the otherwise opaque data.
This point in fact, demonstrate that the relational data future is indeed not totally secures as we  do see that that leading databases  begin to treat XML data (which is hierarchical and not relational)  as a native citizen - to the point we can even index XML data.

So far we've seen a trend to denormalize more, handle non-relational data, what else? ah transactions
Ive worked on several systems where the data was constantly updated and actually gave the system's representation of the world out-side (of the system) the focus was on availability and latency. Which is again also aligned with the approach taken by the large internet sites which emphasis eventual consistency over immediate consistency.
In distributed systems crashes happen. The RDBMS is show-stopper when it comes to crashes - if we can't commit, we need to stop,roll back. now maybe we can start-over. Is this acceptable? there are many scenarios where it is not. I've seen it in defense systems, in communications systems and even in e-commerce systems (if you are not responsive, I'll just go to the competition).
What do you do in the presence of error? Joe Armstrong suggest the following as the basis for Erlang in his thesis:
To make a fault-tolerant software system which behaves reasonably in the presence of software errors we proceed as follows:

1. We organize the software into a hierarchy of tasks that the system has to perform. Each task corresponds to the achievement of a number of goals. The software for a given task has to try and achieve the goals associated with the task. Tasks are ordered by complexity. The top level task is the most complex, when all the goals in the top level task can be achieved then the system should function perfectly. Lower level tasks should still allow the system to function in an acceptable manner, though it may offer a reduced level of service.The goals of a lower level task should be easier to achieve than the goals of a higher level task in the system.

2. We try to perform the top level task.

3. If an error is detected when trying to achieve a goal, we make an attempt to correct the error. If we cannot correct the error we immediately abort the current task and start performing a simpler task.

On top of that we try to keep any update local i.e. within a task boundary on the hardware where the task occurred - distributing the transactions is not a good option. I outlined why when I talked about SOA and cross-services transactions but the reasoning holds.

Well, truth be said the RDBMS is not dead, its demise probably not even around the corner. Also this does not mean that there aren't any uses for a database. But that's true for other architectural choices. Who ever said that a single tier solution is not the right one for very specific types of system...
RDBMS succeeded to to become the de-facto standard to building system because they offer some very compelling attributes - ACID brings a lot of piece of mind. Large scale systems,low-latency system and fault tolerant systems opt for another set of compelling attributes  (BASE). The point is that  when you design your next solution maybe the conventional database thinking is something that you should at least give another thought to and instead of just following dogma