Category Archives: Distributed Systems

Aggregation is not Replication

Both distributed aggregation and replication for high availability (yes, I am thinking of CRDTs) are techniques that can help tackle geo-replication, offline operation and edge/fog computing. Distributed aggregation often shares many properties in common with CRDT style convergent replication, but they are not the same concept. I have witnessed this difficulty in separating the two concepts in many settings, and this prompted me to attempt a clarification.

The main diference is that in replication there is an abstraction of a single replicated state that can be updated in the multiple locations where a replica is present. This state is not owned by any given replica, but any replica can evolve it by applying operations that transform the shared state. This notion applies both in strong consistency and high availability settings. The diference being that in highly available replication the replicas are allowed to diverge and later reconcile. Another factor is that operations that lead to state changes are often the result of the activity an external user that interacts with the system, e.g. adjusting the target room temperature up by 2 degrees. As such, different users, can do conflicting actions, either concurrently or in sequence (most of us did in their childhood on/off light switching fights with other kids and adults).

Distributed data aggregation refers to several data aggregation techniques that are common in sensor network settings and datacenter infrastructure monitoring.  In contrast to replication, each node/location has access to its own local data, e.g. CPU utilisation or a local measurement of humidity levels, and typically this data can evolve continuously. Also, the data to be aggregated is often not directly controlled by users, it usually results from an external physical process or the result of complex system evolutions. Thus, each sensing node usually has exclusive access to a local input value that evolves in time. The aggregation process is then tasked with collecting and transforming this information, e.g. calculating the average or the maximum value, and making it available at a specified location (sink) or disseminating it back to the nodes (by broadcasting the aggregate result). In aggregation the source of truth for each individual measurement is in the actual node that provided it.

Sometimes the two concepts have in common the notion of data merging. In state-based CRDTs operations are reflected in a semi-lattice state that can be combines with others with a join function. In data aggregation there is also often a notion of joining data together, but there is an additional aspect of data reduction and summarisation that is usually not present in CRDT designs. To add to the confusion, its is possible to combine the two concepts in a single system, as we did in the design of Scalable Eventually Consistent Counters, that combines a hierarchical  CRDT design with a global aggregation and reporting facet.

However, ignoring corner cases, the difference can be quite clear and recognising it can help in selecting the right tools. A final take-away example is to consider the control of room temperature: The plus/minus control that sets the set point temperature can be captured by a CRDT; The combining of different temperature sensors across the room to obtain the average temperature is distributed aggregation.

In the context of the H2020 LightKone project we are trying to advance the state of the art in both CRDT based highly available replication solutions and in general purpose distributed aggregation protocols. Bellow I leave some pointers to recent results in each of the categories.

CRDT based high availability:

Distributed data aggregation:

Advertisements

Post-Doc Position in Distributed Data Aggregation

We are opening a Post-Doc Position in HASLab supported by a one year grant.

The grant is for a PhD holder, to join our research line in Distributed Data Aggregation and Monitoring. The successful candidate is expected to research on distributed aggregation, e.g. Flow Updating and Extrema Propagation, and explore improvements to allow for faster and more reactive adaptation to changes in the network and monitored values. Prospective candidates should consult this announcement.

HASLab has a very active research team in Large Scale Distributed Systems, working in both Systems and Theory. Our Laboratory is located in the Braga Campus of Universidade do Minho, one of the top state funded research universities in Portugal and in the top 3% of Scimago IBE ranking. We are part of the Associate Laboratory INESC TEC.  Braga is very close to the Gerês Natural Park, and has good connections to Porto (and its vibrant cultural life) and the International Airport (30 min shuttle bus to Braga). It also has a very competitive cost of living in comparison to other European cities.

This call is open from June, 16th to June, 30th, 2014.

Any questions, contact Carlos Baquero at cbm@di.uminho.pt

Reasoning About Eventual Consistency with Replicated Data Types (Alexey Gotsman – IMDEA Software Institute)

HASLab Seminar

Abstract: Modern databases underlying large-scale Internet services
guarantee immediate availability and tolerate network partitions at the
expense of providing only weak forms of consistency, commonly dubbed
eventual consistency. Even though the folklore notion of eventual
consistency is very weak, in reality, the semantics and programming
interfaces of systems implementing it can be very subtle. In particular,
such systems can resolve conflicting updates using complex protocols,
called replicated data types (aka CRDTs), and can allow varying kinds of
anomalous behaviour.

So far the semantics provided by eventually consistent systems have been
poorly understood. I will present a novel framework for their formal
specification and discuss how recent nontrivial replicated data type
implementations fit into it. I will then illustrate the utility of the
framework by presenting techniques for proving the correctness of
replicated data type implementations.

This is joint work with Sebastian Burckhardt (Microsoft Research),
Hongseok Yang (University of Oxford) and Marek Zawirski (INRIA &
UPMC-LIP6). The associated paper appeared in POPL’14:
http://software.imdea.org/~gotsman/papers/distrmm-popl14.pdf

Research grants in CRDTs and Distributed Aggregation

We are opening several new positions  in HASLab, to be supported by (up to) two year research grants.

– One of the grants is for a Post Doc, to join the local team that is working on the evolution of CRDTs (Conflict-free Replicated Data Types). This is a very active research topic with growing impact in both industry and academia. We are looking for someone that can share our vision and work with us and our research associates. Possible research topics to be pursued by the successful candidate include the composition of CRDTs (with correctness proofs by construction), efficient update dissemination under different channel assumptions, and unification of state based and operation based models. Prospective candidates should check this announcement and apply for research line RL3.

– Another grant is aimed at candidates holding an MSc degree, to join our research line in Distributed Data Aggregation. The successful candidate is expected to research on distributed aggregation, e.g. Flow Updating and Extrema Propagation, and explore improvements to allow for faster and more reactive adaptation to changes in the network and monitored values. These research activities would be compatible with part-time enrollment in our PhD program, which the successful candidate will be encoraged to pursue. Prospective candidates should consult this announcement and apply for research line RL3.1.

HASLab has a very active research team in Large Scale Distributed Systems, working in both Systems and Theory. Our Laboratory is located in the Braga Campus of Universidade do Minho, one of the top state funded research universities in Portugal and in the top 3% of Scimago IBE ranking. We are part of the Associate Laboratory INESC TEC.  Braga is very close to the Gerês Natural Park, and has good connections to Porto (and its vibrant cultural life) and the International Airport (30 min shuttle bus to Braga). It also has a very competitive cost of living in comparison to other European cities.

This call is open from June, 12th to June, 26th, 2013.

Any questions, contact Carlos Baquero at cbm@di.uminho.pt

Scaling up Reconciliation in Eventual Consistency

Eventually consistent distributed stores have followed different strategies when it comes to deciding how to reconcile data that has diverged. A discussion last week led me to think a little bit on what can be done without knowing the semantics of the diverged data and what possibly cannot be done.

A first step is considering if a given system does detect divergence. Systems that use Last Writer Wins (LWWs), such as  Cassandra, do not detect divergence and simplify the process by deciding up front how divergence is reconciled, and that is LWWs. In these systems data is tagged with a timestamp that follows a total order  consistent with causality (ideally close to real time), so that related data is in the causally expected order, but concurrently evolved data will also appear to be ordered.

Another option is to track the causal evolution of data and accurately detect when concurrent writes have lead to divergent states (e.g. as possible in Riak). Now, these states can be presented to the application logic, or user,  that can try to resolve the conflict and create a new state that can causally overwrite the concurrent states.  This option is also followed in revision control systems where the programmer is responsible to resolve conflicts.

Marek Zawirski remembered me that LWW can be quite adequate if there is a single user updating the data. In this case, and if the system’s registered total order is not too far from real time, the user is actually making an external causal link across the data. In simpler terms, if I wrote “meet today” in a partition and I later wrote “better meet tomorrow” in another partition, I know that converging to “better meet tomorrow” is the correct option, although the two states would be presented as concurrent from a causal perspective. However, in general LWW leads to lost updates when multiple users do changes.

Now we come to the second option, track causally divergent states and present them for re-conciliation. If the rate of updates on a given data item is not very high, we can still expect the data to converge. Once two or more concurrent states occur, we only need to expect some user (say, a programer merging two versions of code) to read them and submit a merged state that subsumes the concurrent ones. However, if the merge is non deterministic, and other important properties are not met, we might run into a problem:

Each reconciliation decision is like an update and if we have multiple users trying to reconcile in parallel and even other users doing new updates one might never reduce concurrent state to a single copy, and the needed reconciliation will become more and more confusing so that in practical terms a converged state might not be reachable. This is a problem in what we can refer as syntactic reconciliation, and stems from only relying in causality, where we have no information on the internal structuring of the data and how it is reconciled. These approach only tracks how states relate in causal terms and which states safely subsume other preceding states.

So although tracking divergence and reducing concurrency by syntactic  reconciliation is in general better than LWW, there is probably a scalability limitation in this line of approach. When the data types permit so, a way around is to use information on the semantics of the data and automate reconciliation and make it behave more like an aggregation process, where intermediate reconciliations do not increase the complexity but instead help the progress towards a convergent state. To that end, reconciliation needs to be deterministic and exhibit certain properties: Associative, Commutative and Idempotent; form a join semi-lattice under a given partial order and LUB.

Such amenable data types, captured in Conflict-Free Replicated Data Types (CRDTs) designs and principled eventual consistency, still allow several options when considering viable system designs. One such option is to do merges at the server side and try to keep divergence at a minimum (see riak-dt). Another is to do it at the client side, something achievable with appropriate libraries (see toolbox, meangirls, knockbox and statebox) on top of any system that detects and presents concurrent states.

And certainly there is an interesting middle-ground to explore where semantic merging is done both at the edge (close to the clients) and at the core (in the DCs).

Model-checking graph properties using Alloy

While developing an algorithm for fast distributed computation of a graph diameter and radius, some of my colleagues had a conjecture about graph topology that, if true, would enable an aggressive optimization of the algorithm.

In order to understand the conjecture we must first recall some (more or less) standard graph theory definitions:

  • The distance between two nodes is the length of the shortest path between them;
  • The eccentricity of a node is the greatest distance between that node and all others;
  • The diameter of a graph is the maximum eccentricity found in the graph;
  • A node is peripheral if its eccentricity is equal to the diameter;
  • The antipode of a node is the set of nodes at distance equal to its eccentricity.

Given a non-directed connected graph the initial conjecture was:

For every node, all nodes in its antipode are peripheral.

They quickly found a counter-example to this conjecture and relaxed it to the following:

For every node, some node in its antipode is peripheral.

For a long time they were not able to prove or refute this conjecture. At some point I was involved in the discussion and decided to model-check it using Alloy before attempting any proof.

Continue reading

Conflict-free Replicated Data Types

A follow up to the paper in this post, was just accepted to SSS 2011. The focus of the new paper is around the following contributions:

  • A solution to the CAP problem, Strong Eventual Consistency (SEC).
  • Formal definitions of Strong Eventual Consistency (SEC) and of CRDTs.
  •  Two sufficient conditions for SEC.
  • A strong equivalence between the two conditions.
  • We show that SEC is incomparable to sequential consistency.
  • Description of basic CRDTs, including integer vectors and counters.
  • More advanced CRDTs, including sets and graphs.

Regarding CAP, what we explore is how must consistency can we deliver, without compromising Availability and Partition Tolerance. This is captured by the notion of Strong Eventual Consistency (SEC), that is formalized in the paper.

Any subset of replicas of a SEC object eventually converge, independently of the fate of the remaining replicas. They only need to deliver messages. Therefore a SEC object tolerates up to n − 1 simultaneous crashes. A SEC object thus fills an interesting space given by the CAP theorem: it is (eventually) consistent without sacrificing availability and partition-tolerance. Remarkably, SEC does not require to solve consensus.

We also introduce a SEC compliant Directed-Graph CRDT that allows concurrent Vertice and Edge addition and removal, with a semantics compatible with concurrent web crawling.

The associated TR is available here.