DHTs based of Amazon’s Dynamo, such as Cassandra and Riak, claim to be “Eventually Consistent” DHTs. In the following post I attempt to visually represent exactly what that means to me.
I came across the blog post by Werner Vogels, where he explains the significance of N, W, and W variables in zero-hop DHT. Roughly speaking, N is the number of nodes data will be replicated to.W is simply the number of nodes that need to respond to a client when it tries to write data into DHT, whereas R is the number of nodes that need to respond when reading data from a DHT. The goal of this post is also to make more understandable to me exactly what N, R, and W variables are. For a DHT with 8 nodes, labeled A – H, certain portion of the data is stored on each of the nodes. The node that stores the data is determined by the hash of the portion of the data. If the hash has a value range between 0 – 256, each node is responsible for an equal sub portion of the data corresponding to the hash space.
Additionally, DHTs allow for the data to be replicated to more than one node in case of failure. In the example, N is set to 3, so the data is replicated in three places. So when W < N, how do the N – W nodes receive the data? According to Werner Vogels article, “a mechanism is in place that applies the updates in a lazy manner to the remaining nodes in the replica’s set”. The time it takes for the update to propagate through is the inconsistency window and is the period during which it is dangerous to read from the DHT, since you may get a different value from the one that was previously written.
However, Werner Vogel claims that this may only happen when W + R <= N. I wanted to understand better what this dynamic relationship between N, R, and W is, so with some help from my professor I drew a graph of what exactly happens. In order to do that, I took some liberties in order to simplify the example. I assumed that events, such as write to DHT, read from DHT, receive a response, etc., only happen at specific time ticks, where a clock tick has predefined time period. I also no two events are concurrent, which means that two nodes cannot receive a multicast message from a client at the same time.
If we rewrite the equation, then R <= N – W. So this means that the system is in danger only when the amount of reads necessary is less than or equal to the number of nodes that are being propagated to lazily. There are three possibilities, either R is less than W, R is equal to W, or R is greater than W. I will focus on the case when R is less than W, although all three cases work the same way.
The following example shows how the graph looks when R < W. Time is represented on the horizontal line, and the vertical line represents the amount of nodes the request has propagated to. So, at a certain time tick period, tick 1, client ta adds data to the DHT. At another time tick period, tick 2, W nodes have responded to ta’s request and ta thinks that the data is written into the DHT. However, it takes some time for the data to get out to all N nodes. We call the time for the data to propagate to all N nodes tick 3. Now let’s imagine that a different client, tb executes a get on the data item because it wants to know its content, which can happen either between tick 1 and tick 2, or between tick 2 and tick 3. If the get happens before tick 1 or after tick 3, it is the same as what it was before or after the add. The data is in danger of being different in the read than from the write only if the dispatch from the read starts and ends in the danger area. As an example, let’s say than we have N = 5, W = 4, and R = 1. According to the equation, it is possible to get an inconsistent result since R <= N – W. If it takes one clock tick for the request to propagate to each node, then we are in the danger area for 1 clock tick. At exactly this clock tick it is possible for a node to read from the single incorrect node and get the stale data. This example is illustrated in the graph below, where the purple area represents the time that it takes for the read to propagate to the incorrect node.
However if the read happened before tick 2, then the read would be correct since the read finished before the write finished. However if we set N = 5, W = 4, and R = 2, then there is no possibility for the read to land in the danger area. If the read occurs between tick 2 and tick 3, then it could only finish after tick 3, at which point the client would get two conflicting results. Similarly if the read starts before tick 2, it could end after tick 2 and before tick 3, but then the results would conflict. Here is a different example.
Here, a read operation starts in between a write operation, and ends after the write operation has completed. Let’s say that N = 5, R = 3, and W = 3. In the following example we should be safe and have no inconsistencies. However if the read operation starts before the write finishes, and ends in the danger area, it is possible to get the old value. Let’s say the read request lands on the node that has the old value in the safe area. In the danger area, the read request lands on the lazily updated nodes before the data has propagated. Thus, tb reads the old value after the write has finished. The time that tb takes to perform the read is illustrated by the purple area in the graph above. The data is still considered correct, however, since the read has started before the write has finished. That verdict implies that, assuming R, W, and N are configured correctly, ta does not offer any guarantees about what the correct data should be before it has finished its write. Thus any read operation started before that may get either the old or the new data.
So, with DHTs, you are essentially able to toggle that window of the danger area. You can set your DHTs to be always consistent with W = N, or you can leave them to be inconsistent for a brief period of time with W < N, but guaranteeing that you will always know about it with R <= N – W. The data will become eventually consistent when it propagates to all N – W nodes. This inconsistency window is an example of practice diverging from theory, where it is theoretically possible to get stale results, but in practice it seldom matters. When dealing with entities such as “Big Data”, sometimes it might be more advantageous to provide availability and eventual consistency, rather than strong consistency but poor availability.