If you read this blog regularily you've probably heard/read about the
8 fallacies of distributed computing
once or twice ... you know the assumptions architects and designers
tend to make when designing distributed systems which prove to be wrong
down the road, causing pain and havoc in the project. (indeed
my paper explaining them is the second most poplar download on my site with just about 50K downloads)
Originally
drafted in 1994 by Peter Deutsch (with one more added by James Gosling
in 1997). These fallacies still hold true today. I still see designers
make these same old mistakes in modern SOAs, RESTful designs and
whatnot - but that's not the reason for this post.
What I want to talk about is the second fallacy "Latency is zero".
The
more I think about it the more I think this fallacy should be updated
to "Latency is zero or constant" (or add another fallacy for "latency
is constant" on its own).
What's the difference?
Well,
"latency is zero" fallacy means treating remote "things" as if they are
the same as local "things". We can't do that - we need to build the API
of remote things to take the fact the information takes time to get
there into account (e.g. chatty interfaces vs. chunky interfaces). You
can see more on that in a post called
"Why arbitrary tier-splitting is bad" i wrote about a year ago
The
"latency is constant" fallacy means thinking that if we send several
batches of "stuff" to a remote "thing", they may arrive late but at
least they'll arrive in order. Or to move from "things" and "stuff" to
more concrete terms if you send messages over a network from one
service to another they won't necessarily arrive in order.
But wait isn't it only true for asynchronous messages? if we make
synchronous calls we don't really care about this, now do we? That's
only true if you and the service you are consuming are alone in the
world. In all other cases (i.e. most of the time) even
if you make all your calls synchronous, you can't know what other
messages (from other senders) will arrive in between your messages -
and how it will affect its state.
Unreliable latency can also mean we'll retry a message because we think
it is lost and find out that the reciever gets it multiple times later.
These are things you really have to take that into account when you make multiple related calls - like,say, in a
saga. One thing you can do to help is make messages
idempotent
(which also helps with the "network is reliable" fallacy). You can also
increase latency even more and order the messages something that
happens, for example, when streaming video or audio.
What you really need to think about is ACID 2. No, I am not talking
about the database transactions ACID but rather on another term I first
saw in "Building on Quicksand" (
paper (pdf)/
ppt)
by Pat Helland. In this paper Pat talks about some of the implications
of unreliable conditions (such as inconstant latency, failure etc.) on
fault tolerance. ACID 2 (which apparently was coined by Shel
Finkelstein) stands for Associative, Commutative, Idempotent and
Distributed. i.e. messages can be processed at least once , anywhere
(same machine or across several machines), in any order.
That's harsh but I think that If you are building distributed systems today (SOA or otherwise) you can't ignore it.