Author Archives: Carlos Baquero

Aggregation is not Replication

Both distributed aggregation and replication for high availability (yes, I am thinking of CRDTs) are techniques that can help tackle geo-replication, offline operation and edge/fog computing. Distributed aggregation often shares many properties in common with CRDT style convergent replication, but they are not the same concept. I have witnessed this difficulty in separating the two concepts in many settings, and this prompted me to attempt a clarification.

The main diference is that in replication there is an abstraction of a single replicated state that can be updated in the multiple locations where a replica is present. This state is not owned by any given replica, but any replica can evolve it by applying operations that transform the shared state. This notion applies both in strong consistency and high availability settings. The diference being that in highly available replication the replicas are allowed to diverge and later reconcile. Another factor is that operations that lead to state changes are often the result of the activity an external user that interacts with the system, e.g. adjusting the target room temperature up by 2 degrees. As such, different users, can do conflicting actions, either concurrently or in sequence (most of us did in their childhood on/off light switching fights with other kids and adults).

Distributed data aggregation refers to several data aggregation techniques that are common in sensor network settings and datacenter infrastructure monitoring.  In contrast to replication, each node/location has access to its own local data, e.g. CPU utilisation or a local measurement of humidity levels, and typically this data can evolve continuously. Also, the data to be aggregated is often not directly controlled by users, it usually results from an external physical process or the result of complex system evolutions. Thus, each sensing node usually has exclusive access to a local input value that evolves in time. The aggregation process is then tasked with collecting and transforming this information, e.g. calculating the average or the maximum value, and making it available at a specified location (sink) or disseminating it back to the nodes (by broadcasting the aggregate result). In aggregation the source of truth for each individual measurement is in the actual node that provided it.

Sometimes the two concepts have in common the notion of data merging. In state-based CRDTs operations are reflected in a semi-lattice state that can be combines with others with a join function. In data aggregation there is also often a notion of joining data together, but there is an additional aspect of data reduction and summarisation that is usually not present in CRDT designs. To add to the confusion, its is possible to combine the two concepts in a single system, as we did in the design of Scalable Eventually Consistent Counters, that combines a hierarchical  CRDT design with a global aggregation and reporting facet.

However, ignoring corner cases, the difference can be quite clear and recognising it can help in selecting the right tools. A final take-away example is to consider the control of room temperature: The plus/minus control that sets the set point temperature can be captured by a CRDT; The combining of different temperature sensors across the room to obtain the average temperature is distributed aggregation.

In the context of the H2020 LightKone project we are trying to advance the state of the art in both CRDT based highly available replication solutions and in general purpose distributed aggregation protocols. Bellow I leave some pointers to recent results in each of the categories.

CRDT based high availability:

Distributed data aggregation:

Post-Doc Position in Distributed Data Aggregation

We are opening a Post-Doc Position in HASLab supported by a one year grant.

The grant is for a PhD holder, to join our research line in Distributed Data Aggregation and Monitoring. The successful candidate is expected to research on distributed aggregation, e.g. Flow Updating and Extrema Propagation, and explore improvements to allow for faster and more reactive adaptation to changes in the network and monitored values. Prospective candidates should consult this announcement.

HASLab has a very active research team in Large Scale Distributed Systems, working in both Systems and Theory. Our Laboratory is located in the Braga Campus of Universidade do Minho, one of the top state funded research universities in Portugal and in the top 3% of Scimago IBE ranking. We are part of the Associate Laboratory INESC TEC.  Braga is very close to the Gerês Natural Park, and has good connections to Porto (and its vibrant cultural life) and the International Airport (30 min shuttle bus to Braga). It also has a very competitive cost of living in comparison to other European cities.

This call is open from June, 16th to June, 30th, 2014.

Any questions, contact Carlos Baquero at

Research grants in CRDTs and Distributed Aggregation

We are opening several new positions  in HASLab, to be supported by (up to) two year research grants.

– One of the grants is for a Post Doc, to join the local team that is working on the evolution of CRDTs (Conflict-free Replicated Data Types). This is a very active research topic with growing impact in both industry and academia. We are looking for someone that can share our vision and work with us and our research associates. Possible research topics to be pursued by the successful candidate include the composition of CRDTs (with correctness proofs by construction), efficient update dissemination under different channel assumptions, and unification of state based and operation based models. Prospective candidates should check this announcement and apply for research line RL3.

– Another grant is aimed at candidates holding an MSc degree, to join our research line in Distributed Data Aggregation. The successful candidate is expected to research on distributed aggregation, e.g. Flow Updating and Extrema Propagation, and explore improvements to allow for faster and more reactive adaptation to changes in the network and monitored values. These research activities would be compatible with part-time enrollment in our PhD program, which the successful candidate will be encoraged to pursue. Prospective candidates should consult this announcement and apply for research line RL3.1.

HASLab has a very active research team in Large Scale Distributed Systems, working in both Systems and Theory. Our Laboratory is located in the Braga Campus of Universidade do Minho, one of the top state funded research universities in Portugal and in the top 3% of Scimago IBE ranking. We are part of the Associate Laboratory INESC TEC.  Braga is very close to the Gerês Natural Park, and has good connections to Porto (and its vibrant cultural life) and the International Airport (30 min shuttle bus to Braga). It also has a very competitive cost of living in comparison to other European cities.

This call is open from June, 12th to June, 26th, 2013.

Any questions, contact Carlos Baquero at

Scaling up Reconciliation in Eventual Consistency

Eventually consistent distributed stores have followed different strategies when it comes to deciding how to reconcile data that has diverged. A discussion last week led me to think a little bit on what can be done without knowing the semantics of the diverged data and what possibly cannot be done.

A first step is considering if a given system does detect divergence. Systems that use Last Writer Wins (LWWs), such as  Cassandra, do not detect divergence and simplify the process by deciding up front how divergence is reconciled, and that is LWWs. In these systems data is tagged with a timestamp that follows a total order  consistent with causality (ideally close to real time), so that related data is in the causally expected order, but concurrently evolved data will also appear to be ordered.

Another option is to track the causal evolution of data and accurately detect when concurrent writes have lead to divergent states (e.g. as possible in Riak). Now, these states can be presented to the application logic, or user,  that can try to resolve the conflict and create a new state that can causally overwrite the concurrent states.  This option is also followed in revision control systems where the programmer is responsible to resolve conflicts.

Marek Zawirski remembered me that LWW can be quite adequate if there is a single user updating the data. In this case, and if the system’s registered total order is not too far from real time, the user is actually making an external causal link across the data. In simpler terms, if I wrote “meet today” in a partition and I later wrote “better meet tomorrow” in another partition, I know that converging to “better meet tomorrow” is the correct option, although the two states would be presented as concurrent from a causal perspective. However, in general LWW leads to lost updates when multiple users do changes.

Now we come to the second option, track causally divergent states and present them for re-conciliation. If the rate of updates on a given data item is not very high, we can still expect the data to converge. Once two or more concurrent states occur, we only need to expect some user (say, a programer merging two versions of code) to read them and submit a merged state that subsumes the concurrent ones. However, if the merge is non deterministic, and other important properties are not met, we might run into a problem:

Each reconciliation decision is like an update and if we have multiple users trying to reconcile in parallel and even other users doing new updates one might never reduce concurrent state to a single copy, and the needed reconciliation will become more and more confusing so that in practical terms a converged state might not be reachable. This is a problem in what we can refer as syntactic reconciliation, and stems from only relying in causality, where we have no information on the internal structuring of the data and how it is reconciled. These approach only tracks how states relate in causal terms and which states safely subsume other preceding states.

So although tracking divergence and reducing concurrency by syntactic  reconciliation is in general better than LWW, there is probably a scalability limitation in this line of approach. When the data types permit so, a way around is to use information on the semantics of the data and automate reconciliation and make it behave more like an aggregation process, where intermediate reconciliations do not increase the complexity but instead help the progress towards a convergent state. To that end, reconciliation needs to be deterministic and exhibit certain properties: Associative, Commutative and Idempotent; form a join semi-lattice under a given partial order and LUB.

Such amenable data types, captured in Conflict-Free Replicated Data Types (CRDTs) designs and principled eventual consistency, still allow several options when considering viable system designs. One such option is to do merges at the server side and try to keep divergence at a minimum (see riak-dt). Another is to do it at the client side, something achievable with appropriate libraries (see toolbox, meangirls, knockbox and statebox) on top of any system that detects and presents concurrent states.

And certainly there is an interesting middle-ground to explore where semantic merging is done both at the edge (close to the clients) and at the core (in the DCs).

Conflict-free Replicated Data Types

A follow up to the paper in this post, was just accepted to SSS 2011. The focus of the new paper is around the following contributions:

  • A solution to the CAP problem, Strong Eventual Consistency (SEC).
  • Formal definitions of Strong Eventual Consistency (SEC) and of CRDTs.
  •  Two sufficient conditions for SEC.
  • A strong equivalence between the two conditions.
  • We show that SEC is incomparable to sequential consistency.
  • Description of basic CRDTs, including integer vectors and counters.
  • More advanced CRDTs, including sets and graphs.

Regarding CAP, what we explore is how must consistency can we deliver, without compromising Availability and Partition Tolerance. This is captured by the notion of Strong Eventual Consistency (SEC), that is formalized in the paper.

Any subset of replicas of a SEC object eventually converge, independently of the fate of the remaining replicas. They only need to deliver messages. Therefore a SEC object tolerates up to n − 1 simultaneous crashes. A SEC object thus fills an interesting space given by the CAP theorem: it is (eventually) consistent without sacrificing availability and partition-tolerance. Remarkably, SEC does not require to solve consensus.

We also introduce a SEC compliant Directed-Graph CRDT that allows concurrent Vertice and Edge addition and removal, with a semantics compatible with concurrent web crawling.

The associated TR is available here.

Version Vectors are not Vector Clocks

Most people still confuse Version Vectors and Vector Clocks. Well, I did confuse them for several years. In fact they look the same and the update rules are almost the same.

Both of these mechanisms are practical ways of capturing causality in distributed systems. Causality (in distributed systems) is an abstract concept, can was formalized in 1978 by Leslie Lamport in one of the most cited  articles in computer science. In 1983 Version Vectors are developed to track the causal relations among data items that can be concurrently updated. Some years later, around 1988, Vector Clocks are developed to track the causality between events in a distributed computation. In both cases a vector of integers, one per source of concurrency, is used. There is, however, a fundamental difference.

First, in order to simplify a little bit, lets consider that we have a fixed number of processes and a fixed number of replicas. 

  • Vector Clocks need to establish a partial order among a, potentially ever growing, set of events occurring in the processes. The set of events that are generated in the distributed computation. Naturally since the set can grow unbounded, the tracking mechanism also needs to grow unbounded. Vectors of integers are fine since, at least in theory we don’t run out of integers. In 1991, Charron-Bost showed in this article that Vector Clocks are the smaller mechanism that can track causality among distributed events.
  • Version Vectors need to establish a partial order (more precisely a pre-order) among the replicas in the distributed system. Notice that although the state in these replicas changes as consequence of ever growing sets of update events, here we want to relate the replica states and not the update events. Using vectors of integers is over-expressive in this setting. In 2004, we noticed this and constructed Bounded Version Vectors, where integers are substituted by a limited set of symbols, depending on the number of replicas. Naturally, Charron-Bost result does not apply to Version Vectors.

Lets consider a simple example, thats shows that vectors of integers are over-expressive . One has two recently synchronized replicas A and B with identical state and vectors A[2,3] and B[2,3]. Now, replica A suffers an update and its new vector is A[3,3]. We now see that A is more updated than B, since [3,3] > [2,3] (here each integer in the left is greater or equal than its counterpart in the same position). Now replica A suffers 7 more updates, A[10,3]. Still we have [10,3] > [2,3], and this increase in the integer does not convey more information to the task of tracking the causal order among the two replicas. The number of individual updates is typically not important, specially in systems where you can change either a lot or a little in a single update. What is important is how they change the causal order relation among the replicas.

In the example, at this point, we can compare A[10,3] and B[2,3] and notice that they are easy to synchronize since A has the most updated state and it can be simply copied into B. However, if B issues an update and we now have a system with A[10,3] and B[2,4]. Now the replicas are divergent and a synchronization will typically have to look at the state (often with user assistance)  to figure the correct synchronized state that can be tagged with [10,4].

The message is, use the right tools and know the differences. Vector Clocks are great to implement causal delivery middleware, consistent snapshots, and the like. But for replicated data, Version Vectors are the right concept, and several mechanisms can make use of the subtle differences. Our own work on Dotted Version Vectors explores this and a specific system interaction that occurs in present data-stores. We also point out to Concise Version Vectors that explores the synchronization of multiple replicas  and early work on dynamic version vector maintenance



Convergent and Commutative Replicated Data Types

We just published in the Bulletin of the European Association for Theoretical Computer Science, EATCS, a contribution to the management of eventual consistency. The current activity on Cloud systems design, NoSQL approaches, and  the implications of the CAP Theorem are bringing increasing attention to eventual consistency.

In this paper we establish the conditions for eventual consistent datatypes (with no rollback allowed) and present two equivalent implementation flavors: State Based, with a defined merge operation and constraints that ensure that mutating operations are compatible with merge; Operation Based, building on top of a causal delivery middleware and ensuring that concurrent operations commute.

An extended version of this article is also available via INRIA and includes a larger portfolio of datatypes.