[It has been a little rough last week between a looming milestone @ work and my son fracturing his elbow @ home but hopefully I'll be back to the regular schedule this week]

Stateless services are da bomb right? they are easy to scale (since they have no state you can deploy as many as you like) they are easy to reuse (no state - no baggage) and what not.
The only problem with that is that the state doesn't really go away. Stateless services just suffer from NIMBYism ("Not in my back yard") when it comes to state. A stateless service needs to be stateful when it performs it action and since the state is not there, it has to get it from somewhere

There are basically two approaches to getting the state into the stateless service
The common way is to make the state someone else's problem (usually that would spell a database). With this approach the stateless service perform queries (database or otherwise) to get the state from the 3rd party. This is problematic in many ways e.g.
  • You need to pay network tax for getting the state (remember the fallacies of distributed computing..)
  • If that someone else is a single source (such as a database) it can easily become a barrier for scalability (I wrote about the RDBMS problem in the RDBMS is dead). If it isn't a single source you need to go to multiple sources so you have the network problem multiplies
  • You need to pay network tax for putting the state back at the state repository
The other way to get the state is to put the state on the message - or the "document" approach. This approach is superior to the previous one as you get to piggyback the data on the request. This is a good example of stateless communications*, which as a side effect, can save the stateless service the problems mentioned above.
The "state on the message" approach works when the handling of messages is serialized. ie. only one "station" in the flow can make changes to the state at any one time.  Unfortunately this only works for a subset of the interactions you can have. Inj most cases multiple consumers need to get to the same data or coordinate

You can also combine the two approaches and sometimes get good reults.
Another way altogether is to look at stateful services which I'll talk about in the next post



* Many times people fail to make the distinction between stateless services and stateless communications - I'll expand on that in another post.


 
Tags: scalability | SOA | Software Architecture

January 26, 2008
@ 11:31 PM
David J. DeWitt and Michael Stonebraker are at it again. There was a lot of buzz on the internet after their previous post (here is what I had to say about it).
Their first point on the new post tries to counter the claim that MapReduce is not a database so it shouldn't be judged as one. They claim that it isn't a matter of apples and oranges but rather
 " We are judging two approaches to analyzing massive amounts of information, even for less structured information."

The problem with that is they continue from there to define a problem in database terms and then show how MapReduce will not be as good as a database in solving it - well, duh.
The fact that isolated queries may run better in a pre-indexed database should come as no great surprise. As I noted in the previous post on the subject - MapReduce can be used to create the appropriate index or partition the data into smaller chunks that would be easier to use to answer the type of queries David and Michael mention.
As Mark Chu-Carroll explains Map/Reduce and databased don't solve the same kind of problems

Also what happens when the database is constantly updated ?!  - I don't mind how scientifically accurate are the measurements that say database scale like no other things. I am more comfortable with the empiric experience by companies like Amazon, Diggs, Google and ebay who found they have to shard their data to support their scalability needs and not use distributed transactions/distributed databased.


 
Tags: data | scalability | Software Architecture | Trends

David DeWitt and Michael Stonebraker write about MapReduce in "The Database Column". Now I usually like what Michael Stonebraker writes (e.g. his piece on the RDBMS demise which I also wrote about myself). However I can't say that this time around.
David and Michael write that MapReduce is a big step backwards. before I'll talk about what they write, here is a (very high level) reminder what Map/Reduce is
MapReduce as Google's Jeffery Dean and Sanjay Ghemawat explain is a way to get automatic parallelization and distribution along with fault tolerance, monitoring and I/O scheduleing for tasks that need to work on complete datasets. MapReduce uses two functions:
  • Map - multiple instances of which run in parallel  to process a key/value pair and produce  produce a set of  grouping key(s) and intermediate values.
  • Reduce - which runs per grouping key and merge the intermediate values to a a set of merged outputs (usually one)
David and Michael claims that MapReduce is
1. a step backwards because it doesn't build on Schema
2. a poor implementation because it doesn't use indexes
3. not new
4. missing features - like bulk load, indexing, updates, transactions, integrity constraints, referential integrity, views
5. incompatible with DBMS tools - like report writers, BI tools, replication tools, design tools

Well, if anything, it seems that David and Michael don't really understand what MapReduce is. As I noted above MapReduce is a way to go over complete sets in an efficient distributed manner. In fact it can even be used to build the index of a traditional RDBMS. It isn't really competing wit databases Relational or other. Yep, comparing MapReduce and databse is the  apples and oranges thing...

I guess they might have meant to talk about another Google tool called BigTable - which is at least sort of a column database (Michael's company also makes a column database) for storing structured data in a highly distributed , high performance way. However David and Michael would still be wrong as BigTable is proprietary and targeted at a specific purpose so it isn't supposed to solve the same problems as a general purpose  database not to mention that it is highly scalable (ever heard of google's search engine ;) ) and does support things like indexes, updates etc.

Also as I mentioned in the "RDBMS is dead" post, the internet proved that RDBMS features (like transactions etc.)  can only only scale so much.  While Databases focus on the Consistency and Availability parts of the CAP conjecture and ACID tenets , internet scale systems pick Partitioning and Availability and BASE tenets instead.


 
Tags: data | scalability | Software Architecture

From time to time I read about the magic that is RESTful services and how they solve everything and anything like scalability, idempotency, simplicity etc. for instance in "RESTful Web Services" by Sam Ruby and Leonard Richardson they say
 "PUT and DELETE operations are idempotent. if you DELETE a resource, it's gone. If you DELETE it again, it's still gone..." (p.103)
or
"the safe methods, GET and HEAD, are automatically idempotent as well" (p.219)

Another example comes from Anne Thomas Manes who said

"The REST architectural style defines a number of basic rules (constraints), and if you adhere to these rules, your applications will exhibit a number of desirable characteristics, such as simplicity, scalability, performance, evolvability, visibility, portability, and reliability.

The basic rules are:
  • Everything that's interesting is named via a URI and becomes an addressable resource
  • Every resource exposes a uniform interface (e.g., GET, PUT, POST, DELETE)
  • You interact with the resource by exchanging representations of the resource's state using the standard methods in the uniform interface
"

I think such claims  are plainly wrong and misleading.
 
Don't get me wrong, I like the REST approach, since it encourages better service design - e.g. document oriented message exchange vs. the RPC like message exchange which the so called "WS-death-*s" (or actually the tools that support them) encourages.

It also encourages the above mentioned traits - however that's exactly the  point - REST encourages this thinking not solves scalability or other problems out of the box- you still need to design your services properly.

For instance if you follow Anne's rules you can still end up with a service which is stateful, that performs heavy distributed transactions against multiple databases and systems - i.e. a service that is neither simple, scalable or perfromant

DELETE will only be idempotent if the resource is idempotent (e.g. a specific version of a resource)  or the message is idempotent (e.g. requesting a deletion of a specific version) if you are deleting the "recent version" then it might have been recreated between your calls you are now deleting something completely different. heck, even a GET (read) message with a single reader can be made to be non-idempotent  if you decide to code something that alters the state of a resource significantly whenever it is read. When you have multiple readers and writers GET will not be idempotent "automatically" as two consecutive reads can give you two different representations as the resource might have changed (again unless the resources are idempotent)

REST is not different from other styles in this respect - for instance you can do Object orientation in C but working in an OO language encourages object orientation (the opposite is also true - using an Object Oriented language does not guarantee that you get an Object Oriented design)

At the end of the day, architects should still think about the design if they want to ensure the results matches the quality attributes they want to achieve - some environments/styles/tools will make some quality attributes more easy to achieve but nothing will solve the problems for you.



 
Tags: Everything | OO | scalability | SOA | Software Architecture | REST

September 7, 2007
@ 02:07 PM
In the previous post on the subject I wrote that the RDBMS is dead. I didn't mean that it is dead dead, but rather that it isn't well build to meet some of the newer challenges like linear scalability, high availability etc.
Well, it is one thing hearing it from me - and it is another thing hearing it from someone like Michael Stonebraker.
Michael, was the main architect for the Ingres prototype project at UC Berkely just one year after Codd's paper and (9 years before Oracle was released  and more than a decade before the commercial version of Ingress was released).

Well that was in 1970 - in 2007 Michael recently wrote :
 
In short, the world of 2007 is radically different from the world of the late 1970s. However, none of the major vendors have performed a complete redesign to deal with this changed landscape. As such they should be considered legacy technology, more than a quarter of century in age and "long in the tooth".
Among the new needs Michael cites are intelligence DBMSs (needs a lot of relations), textual and semi-structural data etc. He also said (promoting his own product) that 2007 customers expect high availability, linear scalability.
Michael's main point is that specialization can provide significant performance enhancements vs. the one-size-fits-all approach of RDBMSs. He gives his product (Vertica) as an example for how a column oriented database (vs. the RDBMS row orientation) can outperform RDBMs by a factor of 50. Google's Big table is another example.

Interesting...


 
Tags: BI | data | Everything | scalability | Software Architecture

August 21, 2007
@ 02:58 PM
Ok now, that I got your attention, that it isn't dead yet - but we can see a whole class of applications (maybe a couple of classes) where the importance of the RDBMS as we know it today is greatly diminished.
In an article I posted recently on InfoQ, (which I also mentioned in the post on eBay architecture last week ) I discussed the notion of database denormalization on internet-scale sites (such as Amazon, eBay, Flickr etc.). One point of denormalization is immutable data where there isn't a lot of gain in normalization to begin with.
The other thing is entity representation vs. speed. The problem is that joins are slow and sometimes you get to corners where if we want any type of scent speed we need to denormalize. Todd Hoff notes that as well:
The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again. Flickr decided to denormalize because it took 13 Selects to each Insert, Delete or Update.
This point is, however, that these "corner cases" get more and more prevalent even in smaller scale application - especially when you have complex entities (as is the case with defense systems for example). Mats Helander, recently wrote a post about saving to Blob, and only adding fields as needed for indexing and identity purposes. Mats also suggest the semi-transparent way of using XML columns where the database can do something with the otherwise opaque data.
This point in fact, demonstrate that the relational data future is indeed not totally secures as we  do see that that leading databases  begin to treat XML data (which is hierarchical and not relational)  as a native citizen - to the point we can even index XML data.

So far we've seen a trend to denormalize more, handle non-relational data, what else? ah transactions
Ive worked on several systems where the data was constantly updated and actually gave the system's representation of the world out-side (of the system) the focus was on availability and latency. Which is again also aligned with the approach taken by the large internet sites which emphasis eventual consistency over immediate consistency.
In distributed systems crashes happen. The RDBMS is show-stopper when it comes to crashes - if we can't commit, we need to stop,roll back. now maybe we can start-over. Is this acceptable? there are many scenarios where it is not. I've seen it in defense systems, in communications systems and even in e-commerce systems (if you are not responsive, I'll just go to the competition).
What do you do in the presence of error? Joe Armstrong suggest the following as the basis for Erlang in his thesis:
To make a fault-tolerant software system which behaves reasonably in the presence of software errors we proceed as follows:

1. We organize the software into a hierarchy of tasks that the system has to perform. Each task corresponds to the achievement of a number of goals. The software for a given task has to try and achieve the goals associated with the task. Tasks are ordered by complexity. The top level task is the most complex, when all the goals in the top level task can be achieved then the system should function perfectly. Lower level tasks should still allow the system to function in an acceptable manner, though it may offer a reduced level of service.The goals of a lower level task should be easier to achieve than the goals of a higher level task in the system.

2. We try to perform the top level task.

3. If an error is detected when trying to achieve a goal, we make an attempt to correct the error. If we cannot correct the error we immediately abort the current task and start performing a simpler task.

On top of that we try to keep any update local i.e. within a task boundary on the hardware where the task occurred - distributing the transactions is not a good option. I outlined why when I talked about SOA and cross-services transactions but the reasoning holds.

Well, truth be said the RDBMS is not dead, its demise probably not even around the corner. Also this does not mean that there aren't any uses for a database. But that's true for other architectural choices. Who ever said that a single tier solution is not the right one for very specific types of system...
RDBMS succeeded to to become the de-facto standard to building system because they offer some very compelling attributes - ACID brings a lot of piece of mind. Large scale systems,low-latency system and fault tolerant systems opt for another set of compelling attributes  (BASE). The point is that  when you design your next solution maybe the conventional database thinking is something that you should at least give another thought to and instead of just following dogma


 
Tags: data | Design | Everything | scalability | Software Architecture