Scaling up Reconciliation in Eventual Consistency

Eventually consistent distributed stores have followed different strategies when it comes to deciding how to reconcile data that has diverged. A discussion last week led me to think a little bit on what can be done without knowing the semantics of the diverged data and what possibly cannot be done.

A first step is considering if a given system does detect divergence. Systems that use Last Writer Wins (LWWs), such as Cassandra, do not detect divergence and simplify the process by deciding up front how divergence is reconciled, and that is LWWs. In these systems data is tagged with a timestamp that follows a total order consistent with causality (ideally close to real time), so that related data is in the causally expected order, but concurrently evolved data will also appear to be ordered.

Another option is to track the causal evolution of data and accurately detect when concurrent writes have lead to divergent states (e.g. as possible in Riak). Now, these states can be presented to the application logic, or user, that can try to resolve the conflict and create a new state that can causally overwrite the concurrent states. This option is also followed in revision control systems where the programmer is responsible to resolve conflicts.

Marek Zawirski remembered me that LWW can be quite adequate if there is a single user updating the data. In this case, and if the system’s registered total order is not too far from real time, the user is actually making an external causal link across the data. In simpler terms, if I wrote “meet today” in a partition and I later wrote “better meet tomorrow” in another partition, I know that converging to “better meet tomorrow” is the correct option, although the two states would be presented as concurrent from a causal perspective. However, in general LWW leads to lost updates when multiple users do changes.

Now we come to the second option, track causally divergent states and present them for re-conciliation. If the rate of updates on a given data item is not very high, we can still expect the data to converge. Once two or more concurrent states occur, we only need to expect some user (say, a programer merging two versions of code) to read them and submit a merged state that subsumes the concurrent ones. However, if the merge is non deterministic, and other important properties are not met, we might run into a problem:

Each reconciliation decision is like an update and if we have multiple users trying to reconcile in parallel and even other users doing new updates one might never reduce concurrent state to a single copy, and the needed reconciliation will become more and more confusing so that in practical terms a converged state might not be reachable. This is a problem in what we can refer as syntactic reconciliation, and stems from only relying in causality, where we have no information on the internal structuring of the data and how it is reconciled. These approach only tracks how states relate in causal terms and which states safely subsume other preceding states.

So although tracking divergence and reducing concurrency by syntactic reconciliation is in general better than LWW, there is probably a scalability limitation in this line of approach. When the data types permit so, a way around is to use information on the semantics of the data and automate reconciliation and make it behave more like an aggregation process, where intermediate reconciliations do not increase the complexity but instead help the progress towards a convergent state. To that end, reconciliation needs to be deterministic and exhibit certain properties: Associative, Commutative and Idempotent; form a join semi-lattice under a given partial order and LUB.

Such amenable data types, captured in Conflict-Free Replicated Data Types (CRDTs) designs and principled eventual consistency, still allow several options when considering viable system designs. One such option is to do merges at the server side and try to keep divergence at a minimum (see riak-dt). Another is to do it at the client side, something achievable with appropriate libraries (see toolbox, meangirls, knockbox and statebox) on top of any system that detects and presents concurrent states.

And certainly there is an interesting middle-ground to explore where semantic merging is done both at the edge (close to the clients) and at the core (in the DCs).