January 25, 2009
@ 11:42 PM
If you read this blog regularily you've probably heard/read about the 8 fallacies of distributed computing once or twice ... you know the assumptions architects and designers tend to make when designing distributed systems which prove to be wrong down the road, causing pain and havoc in the  project.  (indeed my paper explaining them is the second most poplar download on my site with just about 50K downloads)
Originally drafted in 1994 by  Peter Deutsch (with one more added by James Gosling in 1997). These fallacies still hold true today. I still see designers make these same old mistakes in modern  SOAs, RESTful designs and whatnot - but that's not the reason for this post.
What I want to talk about is the second fallacy "Latency is zero".

The more I think about it the more I think this fallacy should be updated to "Latency is zero or constant" (or add another fallacy for "latency is constant" on its own).

What's the difference?

Well, "latency is zero" fallacy means treating remote "things" as if they are the same as local "things". We can't do that - we need to build the API of remote things to take the fact the information takes time to get there into account (e.g. chatty interfaces vs. chunky interfaces). You can see more on that in a post called "Why arbitrary tier-splitting is bad" i wrote about a year ago

The "latency is constant" fallacy means thinking that if we send several batches of "stuff" to a remote "thing", they may arrive late but at least they'll arrive in order. Or to move from "things" and "stuff" to more concrete terms if you send messages over a network from one service to another they won't necessarily arrive in order.

But wait isn't it only true for  asynchronous messages? if we make synchronous calls we don't really care about this, now do we? That's only true if you and the service you are consuming are alone in the world. In all other cases (i.e. most of the time) even if you make all your calls synchronous, you can't know what other messages (from other senders) will arrive in between your messages - and how it will affect its state.

Unreliable latency can also mean we'll retry a message because we think it is lost and find out that the reciever gets it multiple times later.

These are things you really have to take that into account when you make multiple related calls - like,say, in a saga. One thing you can do to help is make messages idempotent (which also helps with the "network is reliable" fallacy). You can also increase latency even more and order the messages something that happens, for example, when  streaming video or audio.

What you really need to think about is  ACID 2. No, I am not talking about the database transactions ACID but rather on another term I first saw in "Building on Quicksand" (paper (pdf)/ppt) by Pat Helland. In this paper Pat talks about some of the implications of unreliable conditions (such as inconstant latency, failure etc.) on fault tolerance. ACID 2 (which apparently was  coined by Shel Finkelstein) stands for Associative, Commutative, Idempotent and Distributed. i.e. messages can be processed at least once , anywhere (same machine or across several machines), in any order.

That's harsh but I think that If you are building distributed systems today (SOA or otherwise) you can't ignore it.






 
Tags: REST | SOA | Software Architecture

January 16, 2009
@ 07:08 PM
In a post called "Rhino Service Bus: Saga and State" Ayende said
"In a messaging system, a saga orchestrate a set of messages. The main benefit of using a saga is that it allows us to manage the interaction in a stateful manner (easy to think and reason about) while actually working in a distributed and asynchronous environment."

I really don't agree with this definition of a saga. The Saga provides a context for set of messages to allow manging an effort for distributed concensus. It does not "orchestrate" messages (that's what workflows are for) - you can read more on Saga's in an excerpt from my SOA patterns book:  Saga pattern.

Here's the comment I left on Ayende's site:
"What you describe is nice except it isn't a Saga it is more of a workflow. The notion of Saga which is originated from databases relates to the overall coordination of state between the different services - or the context for the whole business process.
In the coffee shop example you use that would be the whole "transaction" from the point the customer orders her coffee until she either gets it or the transaction is canceled (e.g. it took too long and the customer leaves or the coffee shop is out of milk etc.)
Unlike database (or distributed) transaction when/if a saga is aborted the different component of the system might not return to their previous state e.g. if the customer complains that the coffee is not good and gets her money back. the milk is not separated back from the coffee beans and returned to the bottle - rather the coffee cup goes to the trash.

Workflow is one strategy a service can take to handle the long running interaction within a saga. In your case the BristaSaga class (which I think should be BristaWF) orchestrate the internal state transitions depending on the different messages that arrive within the saga. In your case you have a hardcoded workflow - but it is also possible to use a workflow engine for the job.

By the way, in the above example you could also use a statemachine instead of a WF to manage the process "
In another comment Kristofer asked me:

Arnon: I'm not 100% sure of how you distinguish a Saga from a Workflow, could you elaborate some more on this?

A Saga involves a number of underlying workflows?
A Saga might as well contain a number of underlying Sagas?

Isn't it just a question of at what level it is initiated?

If a Saga should represent the whole transaction / business process, then who should handle it? Couldn't it be implemented as a Saga, exactly as Ayende describes it, by the initiating service (in this case the ordering)?, which then also is given the responsibility to handle restoring the total state etc of underlying/involved services if the transaction is aborted? The possibility to restore state does of course depend on what the specific Saga is handling, some processes might not be able to "rollback" completely, it's rather a question of rolling back all involved parties to a known/acceptable state."

The answer is that ,again, Saga is similar to a transaction in the sense that it provides a shared context for an attempt to get a distributed consensus  Unlike a transaction which insures ACID properties. Sagas are not.
The concept of dissipating that shared context, having each party (service) affect whether the saga should be aborted or successful etc. is what I call a saga.
When a saga is aborted the only thing the coordinator can do is pass the status to the participants. Each of the services is responsible to do its best effort to handle the abort (either by rolling back, compensation or whatever)

Workflow is another thing altogether. which keeps a context between calls and means externalizing the decisions on the logic flow from the business logic (usually with a workflow engine). You can use workflows within a service (a pattern I call workflodize) or you can use them externally (a pattern I call orchestrated choreography e.g. BPM)
You can use either form of workflow to support the implementation of a saga but you can also implement sagas without workflows.
In our system we use an "event broker" (see www.rgoarchitects.com/.../EventingInWCF.aspx) the event broker infrastructure dissipates the saga context when you raise a saga event. A service that initialized a saga (by sending the first event) can choose to close the saga (commit) or abort it. etc. We don't currently have any workflow driven services (but some of them use a state machine as an alternative)

(I think the term Saga does not describe Ayende's class since the "barista" is just on of the participants in the saga there are other participants.)

Powered by ScribeFire.


 
Tags: SOA | SOA Patterns | Software Architecture

 A couple of days ago, one of our team members saw in the log that we are getting index out of bound exceptions in some of the code of the a piece of code I wrote (part of the EventBroker). Taking a look he saw the following:
       private void EventSender(object state)
{
var id = (Guid) state;
try
{
sagas[id].Dispatch();
}
catch (Exception e)
{

            Logger.DebugFormat("EventSender: Exeption Details={0}",e);
rwl.TryRLock();
var isClosed = true;
if (sagas.ContainsKey(id))
isClosed = sagas[id].IsClosed;
rwl.ExitReadLock();

if (!isClosed)
HandleSagaFault(id);

}
}

It is well known that one of the guidelines for exception throwing is "Do not use exceptions for normal or expected errors, or for normal flow of control".So,naturally, he raised an eyebrow and asked me why don't I check that the id exist before I try to call the Dispatch method.
e.g. something like:
            try
{

rwl.TryRLock();
var isClosed = true;
if (sagas.ContainsKey(id))

sagas[id].Dispatch();
rwl.ExitReadLock();
}

Well, as it happens the EventSender method runs on a thread pool thread (something you could have guessed from the object State parameter). It's role is to get the saga to dissipate the event. i.e. the Dispatch method on the Saga object handles actually sending   events to event's subscribers.And while it does its work in parallel it also waits for all the events to arrive before returning (collecting any communications exceptions etc.) What this means is that the Dispatch method takes awhile to return.

Ahh, there's the rub - Contrary to the guideline mentioned above, it is actually more worthwhile to pay the performance penalty of throwing an exception rather than pay the more severe penalty of holding a lock for a long time. The lock affects all the threads running whereas the exception only affects the isolated thread.

Indeed - on another MSDN page there a better version of the guideline "Do not use exceptions for normal flow of control, if possible. Except for system failures and operations with potential race conditions"

The lesson here is (again) is that we need to think before blindly following guidelines.Guidelines are, well er, guidelines not commandments

PS
if you release the lock before calling Dispatch you don't gain anything, since it is multi-threaded code and the id can still be collected between the check and the call itself.

PPS
in case you are wondering TryRLock() is an extension method which try to obtain a lock with a given (short) timeout and throw (instead of deadlock) if the timeout is reached.


 
Tags: .NET | Design | OO

January 12, 2009
@ 08:42 PM
When describing the "known exceptions" to the Knot anti-pattern, I wrote the following:
Starting out on a large project, such as moving an enterprise to SOA, is difficult enough as it is. You can’t figure everything in advance; you need to deliver something – so as Nike says “just do it”. Get something done. You do need to be prepared to let go and redesign further down the road

In a comment to that post, Derrick Gibson wrote:
I have concerns about a "just do it" approach; it belies an assumption that at some point in the future the opportunity will be there to do things a "right way", whereas today time does not permit adherence to this mythical "right way".

One cannot put off til tomorrow that which should be done today. There is no guarantee of any future work to do "enhancements" or "architecture" and there is certainly no guarantee that even if there is a future project, you will be around to work on it. The next team will be starting from scratch and they will be left literally scratching their heads asking, "why did Team Alpha make this decision?"

So, if you make the first assumption that your team has to implement the best architecture it can with the time it has allotted, then will that not lead to other discussions along the way that prevent laying the seeds for this anti-pattern?

For instance, would not the use of a service bus and an approach that says each application makes calls to and receives responses from a service bus, free you from having services that call each other? Now, your services are no longer dependent upon other services or even other back-end data stores, so as new processes are defined and/or new systems are implemented (or others retired), your services remain agnostic to those changes.

This requires your service bus to have the logic which says, "this message needs to be routed here, while that message needs to be routed there." Wouldn't this approach resolve the knot anti-pattern before it ever originates?
The concrete  answer to this comment  is that service bus is one of the candidate solutions to solve/circumvent the Knot anti-pattern (as I also mentioned when I described the anti-pattern) - The question it begets  however is how do you know that the service bus is the right architectural decision for the project on the onset?! Ths question has much wider implications.

In "Who needs an Architect?" (a worthwhile reading in itself) Martin Fowler mentions that we can look at architecture as "things that people perceive as hard to change". The conclusion from that that an architect can do her work better if she doesn't impose these "hard to change things" or does that  as late as possible.
My experience is that when you start a "new grounds" project (such as moving an enterprise to SOA) there are a lot of moving parts. What I mean by that is that the uncertainty levels are very high e.g. the requirements are not set, the understanding of the technology and/or domain is partial, team is new and what not. Making a definitive architectural decision, which is "hard to change" and has a lot of effect on how you design your system and/or has substantial costs (both in licensing, training, adoption etc.) is not necessarily the right the decision. In fact, chances are you initial architectural decision will be flawed.

A phrase I heard from Ivar Jacobson once  is "plan to throw one away, you will anyway" - This is something I try to take with me and differ costly decisions if possible. Especially considering initial releases usually suffer from "time-to-market" constraints. To use a cliche - sometimes you need to go slow to go fast. By the way, this is one place where I don't agree with Uncle Bob who recently said "When is redesign the right strategy? ... Here's the answer. Never."

Like every guidance, this isn't always true. For instance, if this is your n-th similar project and you already know enough about it to say that an architectural pattern X (say service bus) or technology Y (say Hibernate) is good then, yeah go ahead and use that. You still want to consider the "cost to change" though since you can still be wrong.



 
Tags: SOA | Software Architecture

January 5, 2009
@ 08:02 PM
We are going to use some of our test code in production. Yes you read it right test code in production. Here are the details
In our system, among other things, we support visual search in video calls. i.e. an end user calls the system, points the camera at something she is interested, and (hopefully :) ) gets relevant information. Basically the system is made of several resources (image extraction, identification etc.) that collaborate via an event broker. We have a blogjecting watchdog that makes sure everything is up and running and we have applicative recovery service to handle failures.
The watchdog makes sure resources/services are up, resources report their liveliness and wellness so we know more about the resources than the fact that they are up. However, we still need a way to make sure that resource instances  can collaborate to provide the service.

Enter our automated acceptance tests. Part of our development effort included building a test runner for automated tests scenarios, e.g. load tests, verifying algorithms correctness etc. One of these tests is the smoke test (run after each successful build) which includes a sunny-day scenario of a video call- as described above. What we're going to do now is build on the test runner and the sunny day scenario a "keep-alive" tester that will periodically make test calls to the system (depending on the current load etc.) and make sure that everything is still working correctly.


So there you have it, an unexpected benefit of automated acceptance tests, who would have thunk it :)



 
Tags: .NET | SOA | Software Architecture | TDD | WCF