Category Archives: Distributed Systems

Post-Doc Position in Distributed Data Aggregation

We are opening a Post-Doc Position in HASLab supported by a one year grant.

The grant is for a PhD holder, to join our research line in Distributed Data Aggregation and Monitoring. The successful candidate is expected to research on distributed aggregation, e.g. Flow Updating and Extrema Propagation, and explore improvements to allow for faster and more reactive adaptation to changes in the network and monitored values. Prospective candidates should consult this announcement.

HASLab has a very active research team in Large Scale Distributed Systems, working in both Systems and Theory. Our Laboratory is located in the Braga Campus of Universidade do Minho, one of the top state funded research universities in Portugal and in the top 3% of Scimago IBE ranking. We are part of the Associate Laboratory INESC TEC.  Braga is very close to the Gerês Natural Park, and has good connections to Porto (and its vibrant cultural life) and the International Airport (30 min shuttle bus to Braga). It also has a very competitive cost of living in comparison to other European cities.

This call is open from June, 16th to June, 30th, 2014.

Any questions, contact Carlos Baquero at


Reasoning About Eventual Consistency with Replicated Data Types (Alexey Gotsman – IMDEA Software Institute)

HASLab Seminar

Abstract: Modern databases underlying large-scale Internet services
guarantee immediate availability and tolerate network partitions at the
expense of providing only weak forms of consistency, commonly dubbed
eventual consistency. Even though the folklore notion of eventual
consistency is very weak, in reality, the semantics and programming
interfaces of systems implementing it can be very subtle. In particular,
such systems can resolve conflicting updates using complex protocols,
called replicated data types (aka CRDTs), and can allow varying kinds of
anomalous behaviour.

So far the semantics provided by eventually consistent systems have been
poorly understood. I will present a novel framework for their formal
specification and discuss how recent nontrivial replicated data type
implementations fit into it. I will then illustrate the utility of the
framework by presenting techniques for proving the correctness of
replicated data type implementations.

This is joint work with Sebastian Burckhardt (Microsoft Research),
Hongseok Yang (University of Oxford) and Marek Zawirski (INRIA &
UPMC-LIP6). The associated paper appeared in POPL’14:

Research grants in CRDTs and Distributed Aggregation

We are opening several new positions  in HASLab, to be supported by (up to) two year research grants.

– One of the grants is for a Post Doc, to join the local team that is working on the evolution of CRDTs (Conflict-free Replicated Data Types). This is a very active research topic with growing impact in both industry and academia. We are looking for someone that can share our vision and work with us and our research associates. Possible research topics to be pursued by the successful candidate include the composition of CRDTs (with correctness proofs by construction), efficient update dissemination under different channel assumptions, and unification of state based and operation based models. Prospective candidates should check this announcement and apply for research line RL3.

– Another grant is aimed at candidates holding an MSc degree, to join our research line in Distributed Data Aggregation. The successful candidate is expected to research on distributed aggregation, e.g. Flow Updating and Extrema Propagation, and explore improvements to allow for faster and more reactive adaptation to changes in the network and monitored values. These research activities would be compatible with part-time enrollment in our PhD program, which the successful candidate will be encoraged to pursue. Prospective candidates should consult this announcement and apply for research line RL3.1.

HASLab has a very active research team in Large Scale Distributed Systems, working in both Systems and Theory. Our Laboratory is located in the Braga Campus of Universidade do Minho, one of the top state funded research universities in Portugal and in the top 3% of Scimago IBE ranking. We are part of the Associate Laboratory INESC TEC.  Braga is very close to the Gerês Natural Park, and has good connections to Porto (and its vibrant cultural life) and the International Airport (30 min shuttle bus to Braga). It also has a very competitive cost of living in comparison to other European cities.

This call is open from June, 12th to June, 26th, 2013.

Any questions, contact Carlos Baquero at

Scaling up Reconciliation in Eventual Consistency

Eventually consistent distributed stores have followed different strategies when it comes to deciding how to reconcile data that has diverged. A discussion last week led me to think a little bit on what can be done without knowing the semantics of the diverged data and what possibly cannot be done.

A first step is considering if a given system does detect divergence. Systems that use Last Writer Wins (LWWs), such as  Cassandra, do not detect divergence and simplify the process by deciding up front how divergence is reconciled, and that is LWWs. In these systems data is tagged with a timestamp that follows a total order  consistent with causality (ideally close to real time), so that related data is in the causally expected order, but concurrently evolved data will also appear to be ordered.

Another option is to track the causal evolution of data and accurately detect when concurrent writes have lead to divergent states (e.g. as possible in Riak). Now, these states can be presented to the application logic, or user,  that can try to resolve the conflict and create a new state that can causally overwrite the concurrent states.  This option is also followed in revision control systems where the programmer is responsible to resolve conflicts.

Marek Zawirski remembered me that LWW can be quite adequate if there is a single user updating the data. In this case, and if the system’s registered total order is not too far from real time, the user is actually making an external causal link across the data. In simpler terms, if I wrote “meet today” in a partition and I later wrote “better meet tomorrow” in another partition, I know that converging to “better meet tomorrow” is the correct option, although the two states would be presented as concurrent from a causal perspective. However, in general LWW leads to lost updates when multiple users do changes.

Now we come to the second option, track causally divergent states and present them for re-conciliation. If the rate of updates on a given data item is not very high, we can still expect the data to converge. Once two or more concurrent states occur, we only need to expect some user (say, a programer merging two versions of code) to read them and submit a merged state that subsumes the concurrent ones. However, if the merge is non deterministic, and other important properties are not met, we might run into a problem:

Each reconciliation decision is like an update and if we have multiple users trying to reconcile in parallel and even other users doing new updates one might never reduce concurrent state to a single copy, and the needed reconciliation will become more and more confusing so that in practical terms a converged state might not be reachable. This is a problem in what we can refer as syntactic reconciliation, and stems from only relying in causality, where we have no information on the internal structuring of the data and how it is reconciled. These approach only tracks how states relate in causal terms and which states safely subsume other preceding states.

So although tracking divergence and reducing concurrency by syntactic  reconciliation is in general better than LWW, there is probably a scalability limitation in this line of approach. When the data types permit so, a way around is to use information on the semantics of the data and automate reconciliation and make it behave more like an aggregation process, where intermediate reconciliations do not increase the complexity but instead help the progress towards a convergent state. To that end, reconciliation needs to be deterministic and exhibit certain properties: Associative, Commutative and Idempotent; form a join semi-lattice under a given partial order and LUB.

Such amenable data types, captured in Conflict-Free Replicated Data Types (CRDTs) designs and principled eventual consistency, still allow several options when considering viable system designs. One such option is to do merges at the server side and try to keep divergence at a minimum (see riak-dt). Another is to do it at the client side, something achievable with appropriate libraries (see toolbox, meangirls, knockbox and statebox) on top of any system that detects and presents concurrent states.

And certainly there is an interesting middle-ground to explore where semantic merging is done both at the edge (close to the clients) and at the core (in the DCs).

Model-checking graph properties using Alloy

While developing an algorithm for fast distributed computation of a graph diameter and radius, some of my colleagues had a conjecture about graph topology that, if true, would enable an aggressive optimization of the algorithm.

In order to understand the conjecture we must first recall some (more or less) standard graph theory definitions:

  • The distance between two nodes is the length of the shortest path between them;
  • The eccentricity of a node is the greatest distance between that node and all others;
  • The diameter of a graph is the maximum eccentricity found in the graph;
  • A node is peripheral if its eccentricity is equal to the diameter;
  • The antipode of a node is the set of nodes at distance equal to its eccentricity.

Given a non-directed connected graph the initial conjecture was:

For every node, all nodes in its antipode are peripheral.

They quickly found a counter-example to this conjecture and relaxed it to the following:

For every node, some node in its antipode is peripheral.

For a long time they were not able to prove or refute this conjecture. At some point I was involved in the discussion and decided to model-check it using Alloy before attempting any proof.

Continue reading

Conflict-free Replicated Data Types

A follow up to the paper in this post, was just accepted to SSS 2011. The focus of the new paper is around the following contributions:

  • A solution to the CAP problem, Strong Eventual Consistency (SEC).
  • Formal definitions of Strong Eventual Consistency (SEC) and of CRDTs.
  •  Two sufficient conditions for SEC.
  • A strong equivalence between the two conditions.
  • We show that SEC is incomparable to sequential consistency.
  • Description of basic CRDTs, including integer vectors and counters.
  • More advanced CRDTs, including sets and graphs.

Regarding CAP, what we explore is how must consistency can we deliver, without compromising Availability and Partition Tolerance. This is captured by the notion of Strong Eventual Consistency (SEC), that is formalized in the paper.

Any subset of replicas of a SEC object eventually converge, independently of the fate of the remaining replicas. They only need to deliver messages. Therefore a SEC object tolerates up to n − 1 simultaneous crashes. A SEC object thus fills an interesting space given by the CAP theorem: it is (eventually) consistent without sacrificing availability and partition-tolerance. Remarkably, SEC does not require to solve consensus.

We also introduce a SEC compliant Directed-Graph CRDT that allows concurrent Vertice and Edge addition and removal, with a semantics compatible with concurrent web crawling.

The associated TR is available here.

Version Vectors are not Vector Clocks

Most people still confuse Version Vectors and Vector Clocks. Well, I did confuse them for several years. In fact they look the same and the update rules are almost the same.

Both of these mechanisms are practical ways of capturing causality in distributed systems. Causality (in distributed systems) is an abstract concept, can was formalized in 1978 by Leslie Lamport in one of the most cited  articles in computer science. In 1983 Version Vectors are developed to track the causal relations among data items that can be concurrently updated. Some years later, around 1988, Vector Clocks are developed to track the causality between events in a distributed computation. In both cases a vector of integers, one per source of concurrency, is used. There is, however, a fundamental difference.

First, in order to simplify a little bit, lets consider that we have a fixed number of processes and a fixed number of replicas. 

  • Vector Clocks need to establish a partial order among a, potentially ever growing, set of events occurring in the processes. The set of events that are generated in the distributed computation. Naturally since the set can grow unbounded, the tracking mechanism also needs to grow unbounded. Vectors of integers are fine since, at least in theory we don’t run out of integers. In 1991, Charron-Bost showed in this article that Vector Clocks are the smaller mechanism that can track causality among distributed events.
  • Version Vectors need to establish a partial order (more precisely a pre-order) among the replicas in the distributed system. Notice that although the state in these replicas changes as consequence of ever growing sets of update events, here we want to relate the replica states and not the update events. Using vectors of integers is over-expressive in this setting. In 2004, we noticed this and constructed Bounded Version Vectors, where integers are substituted by a limited set of symbols, depending on the number of replicas. Naturally, Charron-Bost result does not apply to Version Vectors.

Lets consider a simple example, thats shows that vectors of integers are over-expressive . One has two recently synchronized replicas A and B with identical state and vectors A[2,3] and B[2,3]. Now, replica A suffers an update and its new vector is A[3,3]. We now see that A is more updated than B, since [3,3] > [2,3] (here each integer in the left is greater or equal than its counterpart in the same position). Now replica A suffers 7 more updates, A[10,3]. Still we have [10,3] > [2,3], and this increase in the integer does not convey more information to the task of tracking the causal order among the two replicas. The number of individual updates is typically not important, specially in systems where you can change either a lot or a little in a single update. What is important is how they change the causal order relation among the replicas.

In the example, at this point, we can compare A[10,3] and B[2,3] and notice that they are easy to synchronize since A has the most updated state and it can be simply copied into B. However, if B issues an update and we now have a system with A[10,3] and B[2,4]. Now the replicas are divergent and a synchronization will typically have to look at the state (often with user assistance)  to figure the correct synchronized state that can be tagged with [10,4].

The message is, use the right tools and know the differences. Vector Clocks are great to implement causal delivery middleware, consistent snapshots, and the like. But for replicated data, Version Vectors are the right concept, and several mechanisms can make use of the subtle differences. Our own work on Dotted Version Vectors explores this and a specific system interaction that occurs in present data-stores. We also point out to Concise Version Vectors that explores the synchronization of multiple replicas  and early work on dynamic version vector maintenance