Dynamo: Amazon’s Highly Available Key-value Store — A Summary (Part-1)

Abhishek Kedia
6 min readMay 26, 2024

--

Introduction to DynamoDB

Imagine a database that scales seamlessly with your application’s demands, handles massive amounts of data effortlessly, and never goes down. That’s AWS DynamoDB, Amazon’s fully managed NoSQL database service. DynamoDB is designed to provide high availability and performance, making it an ideal choice for applications requiring consistent, low-latency data access at any scale. Whether you’re building real-time bidding platforms, gaming leaderboards, or IoT applications, DynamoDB offers a robust solution with its distributed, key-value store architecture.

About This Article

In this article, we’ll dive into the heart of DynamoDB by exploring the foundational concepts laid out in the influential paper “Dynamo: Amazon’s Highly Available Key-Value Store.” We’ll break down each section of the paper into easy-to-understand terms, complete with illustrative images. Whether you’re a newcomer to distributed systems or an experienced developer, this guide will provide valuable insights into the innovative design principles that make DynamoDB a cornerstone of modern cloud architecture. Join us as we uncover the mechanisms behind DynamoDB’s high availability, scalability, and fault tolerance, and learn how these principles are applied in real-world scenarios.

1. Introduction

In the world of online retail, downtime is not an option. Amazon, a leader in e-commerce, needed a storage system that could handle massive amounts of traffic, scale effortlessly, and remain available even in the face of hardware failures. Enter Dynamo, Amazon’s highly available key-value store designed to meet these demanding requirements.

The Challenge

Traditional relational databases struggled to keep up with Amazon’s need for high availability, especially during peak shopping seasons. Systems had to be robust enough to handle failures without impacting the customer experience. To address this, Amazon engineers needed a solution that could:

  • Ensure Availability: The system should be operational 24/7, even when some parts fail.
  • Handle Scalability: As the business grows, the database should easily scale without major overhauls.
  • Maintain Performance: Speedy and consistent performance is crucial for a smooth customer experience.

With these many number of servers, there are always a small but significant number of server and network components that are failing at any given time. So, amazon’s software systems need to be constructed in a manner that treats failure handling as the normal case without impacting availability or performance.

The Solution

Dynamo was engineered as a distributed, decentralized key-value store that prioritizes availability and partition tolerance over strong consistency. This trade-off, known as eventual consistency, means that while the system ensures data is eventually consistent, it prioritizes keeping the service available at all times.
The introduction of Dynamo brought several innovative concepts to the table, which have since influenced many modern distributed systems, including:

  • Consistent Hashing: Efficiently distributes data across nodes, minimizing the impact of node additions or removals.
  • Vector Clocks: Tracks data versions to handle conflicts from concurrent updates.
  • Quorum-Based Replication: Balances read and write operations to maintain data consistency and availability.
  • Merkle Trees: Efficiently identifies and reconciles data inconsistencies across replicas.

We will see all of these in detail as we proceed.

2. Background

Traditional relational databases, while powerful, were not well-suited to meet the demands of Amazon’s highly available, large-scale service-oriented architecture. Amazon’s e-commerce platform required a system that could: handle extreme loads, guarantee uptime and support diverse services.
RDBMS was not the ideal choice. Though it allowed writing complex queries and management functionalities, it required expensive hardware and skilled personnel and itwas not something amazon needed.
Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window, is efficient in its resources usage, and has a simple scale out scheme to address growth in data set size or request rates.

2.1 System Assumptions and Requirements

Creating Dynamo required a clear set of assumptions and requirements to ensure it could meet the needs of Amazon’s ever-growing e-commerce platform. This section outlines the critical assumptions and requirements that guided its development.

Query Model: Dynamo uses a simple key-value interface. Each item is uniquely identified by a key, allowing for straightforward queries. It does not support complex queries or operations like joins, which are typical in relational databases. State is stored as binary objects (i.e, blobs) identified by unique keys. It targets applications that need to store object that are relatively small (usually less than 1MB).

ACID Properties: Dynamo compromises consistency with availability. It targets applications with weaker consistency. Each update operation in Dynamo is confined to a single key. You can read or write an individual key-value pair but cannot perform operations that span multiple keys within a single transaction.
In the context of databases, isolation is one of the ACID properties that ensures that transactions are executed in such a way that they do not interfere with each other. In other words, even if multiple transactions are occurring simultaneously, each transaction should be unaware of others, providing a consistent view of the data.
If two clients write to the same key at the same time, Dynamo does not isolate these operations from each other. This lack of isolation means that conflicting updates can happen simultaneously, and it is up to the system to resolve these conflicts later.

Efficiency: Dynamo is designed to handle a large number of read and write requests per second. The system aims to provide consistently low response times for operations to meet the real-time needs of an e-commerce platform. By running on commodity hardware, Dynamo keeps costs low while maintaining performance.

2.2 Service Level Agreements (SLA)

In the world of cloud services and distributed systems, Service Level Agreements (SLAs) are crucial. They define the expected performance and availability standards that a system must meet to ensure a satisfactory user experience. For Dynamo, these SLAs were particularly important to meet the high demands of Amazon’s e-commerce platform. The key SLA requirements for Dynamo include:

  1. 99.9% Availability
  2. Low latency for read and write operations
  3. High throughput

2.3 Design Considerations

Data replication algorithms used in commercial systems traditionally perform synchronous replica coordination (all nodes are updated with the latest write simultaneously) in order to provide a strongly consistent data access interface. To achieve this level of consistency, these algorithms are forced to tradeoff the availability of the data under certain failure scenarios. For instance, rather than dealing with the uncertainty of the correctness of an answer, the data is made unavailable until it is absolutely certain that it is correct.

An optimized replication technique could be to propagate the changes in the background. The challenge with this approach is it can lead to conflicting changes which must be detected and resolved. This process of conflict resolution introduces two problems: when to resolve them and who resolves them. Dynamo is designed to be an eventually consistent data store; that is all updates reach all replicas eventually.
Dynamo is an “always writeable” data source. It always allows user to write to the node even if it has not been updated with the latest data. This can result to inconsistencies which are resolved during the read operation.

We will see how these inconsistencies are resolved in later parts of the article.
The other key factors for design considerations are:

  1. Incremental Scalability: The system can be scaled out by adding more nodes, without significant reconfiguration or downtime. The architecture allows Dynamo to handle the growth of Amazon’s services by incrementally adding resources as needed.
  2. Symmetry: Dynamo’s design ensures that every node has the same set of responsibilities as any other node, promoting a symmetric system structure where nodes can easily take over the responsibilities of any failed or removed node.
  3. Decentralized Failure Detection: The system must detect and handle failures efficiently. Dynamo employs a decentralized failure detection and recovery mechanism, ensuring that node failures do not lead to system-wide outages.
  4. Heterogeneity: Dynamo must accommodate the heterogeneity in the infrastructure, meaning it must efficiently operate on nodes with varying capacities and performance characteristics. This design allows it to utilize hardware upgrades and additions seamlessly.

3. Related Work

The “Related Work” section discusses various systems and concepts that influenced Dynamo’s design. It references traditional distributed databases like Tandem’s NonStop for their fault tolerance, and Distributed Hash Tables (DHTs) such as Chord and Pastry for their efficient data distribution using consistent hashing. It also highlights the importance of eventual consistency from systems like Bayou, contrasting it with strong consistency models like Paxos. Fault-tolerant systems such as the Google File System (GFS) are noted for their replication strategies. Additionally, peer-to-peer (P2P) systems like Napster, Gnutella, and BitTorrent are mentioned for their decentralized approaches to resource sharing and node responsibility management. These diverse influences helped shape Dynamo into a highly available and scalable key-value store.

It will be helpful to have good understanding of consistent hashing before you proceed with the next part.

In the next part we will look into the System Architecture of dynamodb.

Let’s Get in Touch

You are most welcome to follow me here on Medium. In addition, feel free to check out:

--

--