Dynamo: Amazon’s Highly Available Key-value Store — A Summary (Part-2)
4. System Architecture
This section focuses on the core distributed systems techniques used in Dynamo: partitioning, replication, versioning, membership, failure handling and scaling.
4.1 System Interface
It is like the front door to Dynamo. It’s how clients (applications or users) interact with the system to store and retrieve data. Dynamo’s interface is designed to be simple and easy to use, while still providing powerful functionality. It exposes two operations: get() and put().
- get() Operation: When you want to find a specific object (like a book) stored in Dynamo, you use the
get()
operation. You provide a key (like a book's title), and Dynamo locates all the copies of that object across its storage system. It then returns either a single copy of the object or a list of copies if there are conflicting versions, along with some extra information called a context. - put() Operation: When you want to store a new object in Dynamo, you use the
put()
operation. You provide a key, some context information (like when the book was published), and the object itself (the actual content of the book). Dynamo figures out where to store copies of the object based on the key and writes them to disk.
It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key (see consistent hashing).
4.2 Partitioning Algorithm
Partitioning in Dynamo is like organizing books into different sections of a library based on their titles. The Partitioning Algorithm decides which section (or partition) each piece of data belongs to. This ensures that data is evenly distributed across storage nodes, allowing Dynamo to scale effectively as more data is added.
- Consistent Hashing: Dynamo uses a technique called consistent hashing for partitioning. Imagine each storage node is like a shelf in the library, and each piece of data (or key) has a unique identifier. Instead of assigning keys to shelves based on their names, Dynamo uses consistent hashing to map keys to nodes in a circular “ring” structure.
- 128-bit Identifier: To determine which node should store a particular key, Dynamo applies a mathematical function called an MD5 hash to the key. This generates a 128-bit identifier for the key, which is then mapped onto the ring structure. Each node in the ring is responsible for storing keys that fall within a certain range of identifiers.
- Virtual Nodes: To make the partitioning process more flexible and efficient, Dynamo introduces the concept of virtual nodes. Instead of mapping each physical node to a single point on the ring, Dynamo assigns multiple virtual nodes to each physical node. This allows for finer-grained control over data distribution and makes it easier to add or remove nodes from the system without disrupting the overall balance of data.
You can watch this video to understand it better: https://www.youtube.com/watch?v=zaRkONvyGr8 - Load Balancing: By evenly distributing keys across the ring, consistent hashing helps balance the load on storage nodes. If one node becomes overloaded or fails, its workload is automatically distributed among neighboring nodes, ensuring that no single node becomes a bottleneck in the system.
- Scalability: Because keys are distributed randomly across the ring, consistent hashing allows Dynamo to scale seamlessly as the number of nodes in the system grows. New nodes can join the ring and take on responsibility for a portion of the keys without needing to reorganize the entire system.
Overall, the Partitioning Algorithm in Dynamo ensures that data is distributed evenly across storage nodes, allowing for efficient storage and retrieval of data while enabling the system to scale gracefully as it grows.
4.3 Replication
Replication plays a crucial role in ensuring high availability and durability of data. When you store a piece of data (let’s call it “key k”) in DynamoDB, it’s not just saved on one node but replicated across multiple nodes for redundancy.
Here’s how it works:
- Replication Factor (N): DynamoDB lets you specify a parameter called “N,” which determines how many copies of each data item will be made. For example, if N is set to 3, then key k will be replicated on 3 different nodes.
- Coordinator Node: Each data item is assigned to a coordinator node, which oversees the replication process for that item. This coordinator node is responsible for managing the replication of data items within its assigned range.
- Replication Process: Once a data item is stored on the coordinator node, it’s also replicated on N-1 other nodes in a clockwise direction around the consistent hashing ring. So, if N is 3, the coordinator node will replicate key k on the next two nodes in the ring.
- Preference List: The list of nodes responsible for storing a particular key is called the preference list. This list ensures that even if a node fails, there are still enough copies of the data available on other nodes. To account for potential node failures, the preference list contains more than N nodes.
4.4 Data Versioning
DynamoDB operates on the principle of eventual consistency, which means that updates to data are not instantly propagated to all copies of that data (in other nodes). When you add or change something in DynamoDB, the update might not immediately appear everywhere. This delay can lead to situations where if you try to retrieve data immediately after updating it, you might not see the latest changes.
Some applications, like shopping carts on Amazon, can handle these inconsistencies. For example, if you add something to your cart but the system hasn’t updated everywhere yet, it still remembers your addition. Even if you change an older version of your cart, those changes are kept and merged with the latest version when possible.
DynamoDB treats every change to data as a new, immutable version. This means that even if there are multiple versions of the same data floating around due to network issues, the system can keep track of them.
When you make a change to a data item, whether it’s an addition, modification, or deletion, DynamoDB doesn’t alter the existing version. Instead, it creates a brand new version of the data with the updated changes. This ensures that the history of changes to the data is preserved, and each version is distinct from the others. Since each update is distinct, there is no risk of two updates happening at the same version.
Conflict Resolution
In normal working scenarios, there are no conflicts and new versions are automatically determined by dynamo. However, version branching may happen, resulting in conflicting version of an object. In this case, system cannot automatically merge the changes and the client must do the resolution.
A typical example of a collapse operation is “merging” different versions of a customer’s shopping cart. Using this reconciliation mechanism, an “add to cart” operation is never lost. However, deleted items can resurface.
Vector Clocks
DynamoDB uses something called vector clocks to keep track of the order of changes to data. Each version of data has a vector clock associated with it, which helps the system understand which changes happened when and in what order.
A vector clock is essentially a list of (node, counter) pairs associated with each version of an object. These pairs represent the history of updates to the object across different nodes in the system. By examining the vector clocks of two versions of an object, DynamoDB can determine whether one version causally precedes another, or if they are concurrent.
When a node performs an operation on a data object, it increments its counter in the vector clock associated with that object.
By comparing the vector clocks of different versions of an object, DynamoDB can determine the causal ordering of updates. If the counters in one vector clock are less-than-or-equal to all of the corresponding counters in another vector clock, it indicates that the first version is an ancestor of the second. This means that the updates represented by the first version have occurred before the updates in the second version.
When two versions of an object have conflicting updates, the conflicting versions are preserved and presented to the client for semantic reconciliation. In this case, all the versions are sent to the client on the next read operation and client resolves it.
4.5 Execution of get() and put() Operations
Any storage node in DynamoDB can receive get()
and put()
operations for any key. There are two strategies for a client to select a node:
- Route requests through a generic load balancer that selects a node based on load information. And then redirects the request to the node that has the request information.
- Use a partition-aware client library to route requests directly to the appropriate coordinator nodes. This strategy can achieve lower latency by skipping a potential forwarding step.
The node handling a read or write operation is known as the coordinator. Typically, this is the first among the top N nodes in the preference list for the key being accessed.
Dynamo uses a consistency protocol similar to quorum systems, configurable with two key values: —
R: The minimum number of nodes that must respond for a successful read operation.
W: The minimum number of nodes that must respond for a successful write operation.
put() Operation
- Generate Vector Clock: Upon receiving a put() request, the coordinator generates a vector clock for the new version of the object.
- Local Write: The coordinator writes the new version locally.
- Replicate: The coordinator sends the new version along with the vector clock to the N highest-ranked reachable nodes.
- Acknowledgment: If at least \( W-1 \) nodes acknowledge the write, it is considered successful.
get() Operation
- Request Versions: Upon receiving a get() request, the coordinator requests all existing versions of the object for the key from the N highest-ranked reachable nodes.
- Wait for Responses: The coordinator waits for R responses before returning the result to the client.
- Multiple Versions: If multiple versions of the object are found, the coordinator returns all versions that are causally unrelated.
- Reconciliation: The divergent versions are reconciled, and the reconciled version is written back to the appropriate nodes.
By using these methods, Dynamo ensures that data remains highly available and consistent, even in a distributed and potentially unreliable environment.
4.6 Handling Failures: Hinted Handoff
If Dynamo used a traditional quorum approach it would be unavailable during server failures and network partitions, and would have reduced durability even under the simplest of failure conditions. To maintain high availability and durability, Dynamo implements a technique called Sloppy Quorum.
Sloppy Quorum: In a traditional quorum system, strict membership is enforced, meaning that a majority of nodes must be available for the system to function. However, this strict requirement can lead to unavailability during failures. Dynamo avoids this issue by using a “sloppy quorum,” which means First N Healthy Nodes: All read and write operations are performed on the first N healthy nodes from the preference list, regardless of whether they are the first N nodes encountered on the consistent hashing ring.
Example Scenario
Consider a Dynamo configuration with a replication factor N = 3. If node A is temporarily down or unreachable during a write operation, the system will send the replica that should have gone to node A to node D instead along with metadata telling node A was the intended recipeint. This approach maintains the system’s availability and durability guarantees even when a node is down.
Node D was not the intended recipient, the receiving node (e.g., node D) keeps these hinted replicas in a separate local database.
Nodes periodically scan their local databases to check for hinted replicas. When node D detects that node A has recovered, it will attempt to transfer the replica back to node A. Once the transfer is successful, node D can safely delete the replica from its local store, ensuring that the total number of replicas remains consistent.
4.7 Handling permanent failures: Replica synchronization
There are scenarios under which hinted replicas become unavailable before they can be returned to the original replica node. (Yes, they thought of this too!) To address these challenges, Dynamo employs an anti-entropy protocol to keep replicas synchronized and maintain consistency across the system. This process uses Merkle trees to detect and resolve inconsistencies efficiently.
Merkle Trees
A Merkle tree is a hash tree where:
- Leaves: The leaf nodes are hashes of individual data values (keys).
- Parent Nodes: The parent nodes are hashes of their respective child nodes.
Advantages of Merkle Trees
- Independent Branch Checking: Each branch of the tree can be checked independently. This means nodes do not need to exchange the entire tree or dataset, just the relevant branches where discrepancies might exist.
- Reduced Data Transfer: By comparing hashes at different levels of the tree, nodes can quickly identify where data discrepancies occur, minimizing the amount of data transferred for synchronization.
- Efficient Inconsistency Detection: If the root hashes of two Merkle trees are identical, the data in both trees is consistent. If the root hashes differ, nodes compare hashes at subsequent levels, narrowing down to the exact keys that are out of sync.
Anti-Entropy Process in Dynamo
Here’s how Dynamo uses Merkle trees for anti-entropy:
- Merkle Tree Maintenance: Each node maintains a separate Merkle tree for each key range it hosts. A key range is the set of keys managed by a virtual node.
- Root Exchange: Nodes periodically exchange the root hashes of the Merkle trees corresponding to the key ranges they have in common.
- Tree Traversal: If the root hashes differ, the nodes traverse the tree, comparing hashes at each level until they reach the leaves.
- Identifying Inconsistencies: This traversal helps nodes identify specific keys that are out of sync.
- Data Synchronization: Once discrepancies are identified, nodes synchronize the out-of-date keys, ensuring consistency across replicas.
Challenges and Solutions
- Key Range Changes: When nodes join or leave the system, the key ranges they manage can change, necessitating recalculation of the Merkle trees. This recalculation can be resource-intensive.
- Partitioning Scheme: Dynamo addresses this challenge with a refined partitioning scheme (described in Section 6.2), which minimizes the frequency and impact of key range changes, thus reducing the need for Merkle tree recalculations.
The anti-entropy protocol using Merkle trees allows Dynamo to efficiently detect and resolve inconsistencies between replicas. This process ensures data durability and consistency, even in the presence of permanent failures or when nodes are added or removed from the system. By leveraging the hierarchical nature of Merkle trees, Dynamo minimizes the data transfer and computational overhead involved in keeping replicas synchronized, maintaining high availability and reliability in a distributed environment.
4.8 Membership and Failure Detection
Membership and failure detection in Dynamo are critical for maintaining a stable and consistent view of the system, ensuring that data replication and partitioning remain effective even as nodes join, leave, or fail.
In Dynamo, nodes can experience outages due to failures or maintenance tasks. These outages are typically transient but can last for extended periods. To manage these scenarios effectively:
- Explicit Membership Changes: Rather than automatically rebalancing the system with each transient failure, Dynamo uses explicit commands to add or remove nodes from the ring. This prevents unnecessary rebalancing and replica repairs.
- Administrator Control: An administrator can use a command-line tool or a browser to connect to a Dynamo node and issue commands to add or remove nodes.
- Persistent Store: The node handling the request records the membership change and its timestamp in persistent storage, creating a history of changes.
- Gossip-Based Protocol: Membership changes are propagated throughout the system using a gossip-based protocol. Each node randomly contacts a peer every second to reconcile their membership change histories, ensuring an eventually consistent view across the ring. (As seen earlier, each node knows, which node has stored data for a particular key)
When a node starts for the first time, it selects its tokens (representing virtual nodes in the consistent hash space) and maps nodes to these tokens. The mapping is saved on disk and initially includes only the local node and its tokens. These mappings are also reconciled through the gossip-based protocol, allowing each node to know the token ranges managed by other nodes. This enables direct forwarding of read/write operations to the appropriate nodes.
External Discovery
The mechanism described above could temporarily result in a logically partitioned Dynamo ring. For example, the administrator could contact node A to join A to the ring, then contact node B to join B to the ring. In this scenario, nodes A and B would each consider itself a member of the ring, yet neither would be immediately aware of the other. To prevent logical partitions, some Dynamo nodes play the role of seeds.
To prevent logical partitions where nodes might not be aware of each other:
- Seed Nodes: Certain Dynamo nodes act as seeds. These nodes are known to all other nodes through an external mechanism and help maintain connectivity.
- Seed Nodes’ Role: Seeds ensure that each node eventually reconciles its membership with at least one common node, preventing partitions.
- Discovery: Seeds can be configured statically or discovered via a configuration service. They are typically fully functional Dynamo nodes that participate in the ring.
Failure Detection
Failure detection is essential to ensure that nodes do not waste time attempting to communicate with unreachable peers, which is crucial for maintaining high availability during get()
and put()
operations and for transferring partitions and hinted replicas.
Local Failure Detection: Each node maintains a local view of failure:
- Message Responsiveness: If node A fails to receive a response from node B, node A marks node B as failed and reroutes requests to other nodes. (Each node knows the list of nodes that might have information about a particular key).
- Periodic Retries: Node A will periodically retry communicating with node B to check for recovery.
A node reaches out to other node only when it has to serve some information. If there is no client request, and a node is down, no other node will be aware of it.
By combining explicit membership management, a gossip-based protocol, and effective local failure detection, Dynamo ensures robust handling of both temporary and permanent node failures, maintaining the system’s availability and consistency.
4.9 Adding/Removing Storage Nodes
When a new node (let’s call it X) is added to the Dynamo ring, the following steps occur:
- Token Assignment: Node X is assigned a number of tokens that are randomly scattered on the consistent hashing ring. These tokens determine the key ranges for which node X will be responsible.
- Current Handlers: For each key range assigned to node X, there are typically existing nodes that are currently handling those keys.
- Key Transfer: These existing nodes will transfer the relevant key ranges to node X. This transfer ensures that the data is redistributed to maintain balance across the ring.
- Load Distribution: This method of transferring key ranges ensures that the load is uniformly distributed across storage nodes. This is crucial for meeting latency requirements and ensuring fast bootstrapping of the new node.
When a node is removed from the system, the reallocation of keys happens in a reverse process.
This is how amazingly Dynamo is architectured keeping all the scenarios in mind.
Let’s Get in Touch
You are most welcome to follow me here on Medium. In addition, feel free to check out:
- My portfolio
- My LinkedIn Profile: Let’s connect!
- My Twitter Profile