DynamoDB is a powerful data persistence offering in the AWS suite that allows for highly scalable data access. It’s quite simple to get started using DynamoDB, and there are a good number of documents on the topic including Lambda and AppSync integration. While it’s easy to get started, modeling complex data can sometimes be challenging to visualize, especially coming from alternative systems like relational databases.
What we will and will not be covering:
- Conceptual data modelling
- Dealing with relational data
- Comparison to relational databases and SQL
- Specific DynamoDB APIs or SDKs (you can use CLI, JS, C#, and others)
- Authentication, authorization or access control
The Relational Way
Let’s cover some of the relational database concepts we know to help us better contrast how DynamoDB changes many of the patterns we are used to.
SQL / Normalization
RDBMS (Relational Database Management Systems) generate or materialize views dynamically from a normalized and optimized version of the data. SQL is the language of choice when making these queries. Normalization is designed to keep the data consistent and reduce redundancy. This often means spreading data across multiple tables to ensure the data is only entered in one place, then linked using complex SQL queries. This helps avoid insertion, update, and deletion anomalies, allows flexible redesign and extensibility, and helps conserve storage space.
Where the ORM (Object Relational Mapper) comes in
When looking at normalized data within a database, it becomes very hard to understand the relationships at a glance without having some way to group things together that makes sense to the human mind. Objects just simply make more sense when approaching a problem. ORMs were designed to help us bridge that gap and hide much of the normalization and SQL under an object grouping that makes sense to the application developer.
When dealing with RDBMS, we tend to think of large, monolithic data stores. When working with these stores, we can control the transaction from beginning to end. However, as we start to expand into more distributed systems and scale out, it becomes much harder to efficiently maintain transactions the same way we used to with the ACID paradigm. As a refresher:
A: Atomic – Tasks are all performed or none of them are. If any one fails, the entire transaction fails.
C: Consistency – The database must remain in a consistent state, meaning there are no half-completed transactions.
I: Isolation – No transaction has access to any transaction in an unfinished state, and each transaction is independent upon itself.
D: Durability – Once the transaction is complete, it will persist such that power loss or system breakdowns don’t affect the data.
When we have a largely distributed and scalable system, it becomes very challenging to follow all the rules above. As distributed systems started to become more popular, the CAP theorem was coined to describe the limitations of these systems. In a nutshell, you get only two, not all three.
C: Consistency – Do all nodes in your cluster access data like they’re supposed to? Do they reliably follow the established rules of the given system?
A: Availability – Is the service available upon request? Does each request get either a failure or success response?
P: Partition Tolerance – The system continues to operate, even when there is data loss or node failure in parts of the system.
So, according to the CAP theorem, you can have consistency and availability, availability and partition tolerance, consistency and partition tolerance, etc., but you can never have more than two of the above. With the CAP theorem in mind, a new end of the data spectrum was added to complement ACID, like in our pH scales: BASE.
BA: Basically Available – There will be a response to every request, but that data could be in an inconsistent state or return a failure response.
S: Soft State – Data may change over time, even without input due to the eventually consistent patterns used to propagate data.
E: Eventual Consistency – Once the system stops receiving input, it will eventually become consistent. This does not happen immediately, hence why it is considered eventually consistent.
These paradigms are at odds with one another, and they provide context to different kinds of systems and their needs.
In the past, storage was very expensive, and relational databases optimized for that constraint. Today, storage is relatively cheap, while compute is where most of the cost goes. SQL queries can be quite computationally expensive to join complex data together into views our application can use. DynamoDB turns many of these concepts on their head by recommending storing data in ways that are the optimized to limit computational costs such as duplicating data and keeping data less normalized.
DynamoDB is designed to scale to levels of multinational always-on systems like Amazon run. The main focus of DynamoDB is its availability and scalability. To meet these needs, DynamoDB works very differently from a relational database.
DynamoDB in a Nutshell
- You can store a lot of items (no limit)
- It’s blazingly fast
- Very scalable (options to scale automatically)
- Good for most apps where we know the kind of business questions we will ask ahead of time and the aggregated structures are well-known
- NoSQL, so no schemas to maintain and update
Capacity and Scaling
The main focus of DynamoDB’s pricing model revolves around capacity rather than the hours consumed or storage used. After 25GB, you are still charged for your storage usage, but you are not charged for hourly EC2 instances running, only the requests or capacity used. Let’s take a moment to break down what each of these means.
WCU: Write Capacity Unit
1KB write / second
Example: A 3KB document would take 3 WCUs
Calculation: (# of docs * avg size / 1KB)
RCU: Read Capacity Unit
2 4KB eventually consistent reads / second
1 4KB strongly consistent reads / second
Example: An 8KB document would take 2 RCUs for a strongly consistent read
Calculation: (# of docs * avg size / 4kb * ( ec == true ? 2 : 1 ))
Scaling and Replication
Automatically replicates to 3 availability zones within a region
Replication is ensured on at least 2 AZs before a write is considered complete
Automatic scaling is an option