System design fundamentals: What is the CAP theorem?

System design fundamentals: What is the CAP theorem?

As you progress through your career as a developer, you’ll be required to think more and more about software architecture and system design. It’s important to be able to design efficient systems and make tradeoffs at scale. System design is a vast field that incorporates many important concepts. A fundamental concept within system design is the CAP theorem. Understanding the CAP theorem is key to understanding how to design strong distributed systems. Today, we’ll dive deeper into the CAP theorem, explaining its meaning, its components, and more.

Let’s get started!

We’ll cover:

  • What is the CAP theorem?
  • Consistency, availability, and partition tolerance
  • CAP theorem NoSQL databases
  • CAP theorem and microservices
  • Wrapping up and next steps

What is the CAP theorem?

The CAP theorem, or Brewer’s theorem, is a fundamental theorem within the field of system design. It was first presented in 2000 by Eric Brewer, a computer science professor at U.C. Berkeley, during a talk on principles of distributed computing. In 2002, MIT professors Nancy Lynch and Seth Gilbert published a proof of Brewer’s Conjecture. The CAP theorem states that a distributed system can only provide two of three properties simultaneously: consistency, availability, and partition tolerance. The theorem formalizes the tradeoff between consistency and availability when there’s a partition.

A distributed system is a collection of computers that work together to form a single computer for end users. All of the distributed machines have one shared state and operate concurrently. With distributed systems, users must be able to communicate with any of the distributed machines without knowing it’s only one machine. The distributed system network stores its data on more than just a single node, using multiple physical or virtual machines at the same time.

CAP theorem proof

Let’s look at a simple proof of the CAP theorem. Imagine a distributed system consisting of two nodes:

cap theorem proof.png

The distributed system acts as a plain register with the value of variable X. There’s a network failure that results in a network partition between the two nodes in the system. An end-user performs a write request, and then a read request. Let’s examine a case where a different node of the system processes each request. In this case, our system has two options:

  • It can fail at one of the requests, breaking the system’s availability
  • It can execute both requests, returning a stale value from the read request and breaking the system’s consistency

The system can’t process both requests successfully while also ensuring that the read returns the latest value written by the write. This is because the results of the write operation can’t be propagated from node A to node B because of the network partition.

Consistency, availability, and partition tolerance explained

Now that we have a basic understanding of the CAP theorem, let’s break down the acronym and discuss the meanings of consistency, availability, and partition tolerance.

Consistency

In a consistent system, all nodes see the same data simultaneously. If we perform a read operation on a consistent system, it should return the value of the most recent write operation. The read should cause all nodes to return the same data. All users see the same data at the same time, regardless of the node they connect to. When data is written to a single node, it is then replicated across the other nodes in the system.

Availability

When availability is present in a distributed system, it means that the system remains operational all of the time. Every request will get a response regardless of the individual state of the nodes. This means that the system will operate even if there are multiple nodes down. Unlike a consistent system, there’s no guarantee that the response will be the most recent write operation.

Partition tolerance

When a distributed system encounters a partition, it means that there’s a break in communication between nodes. If a system is partition-tolerant, the system does not fail, regardless of whether messages are dropped or delayed between nodes within the system. To have partition tolerance, the system must replicate records across combinations of nodes and networks.

CAP theorem NoSQL databases

cap databases.png

NoSQL databases are great for distributed networks. They allow for horizontal scaling, and they can quickly scale across multiple nodes. When deciding which NoSQL database to use, it’s important to keep the CAP theorem in mind. NoSQL databases can be classified based on the two CAP features they support:

CA databases

CA databases enable consistency and availability across all nodes. Unfortunately, CA databases can’t deliver fault tolerance. In any distributed system, partitions are bound to happen, which means this type of database isn’t a very practical choice. That being said, you still can find a CA database if you need one. Some relational databases, such as PostgreSQL, allow for consistency and availability. You can deploy them to nodes using replication.

CP databases

CP databases enable consistency and partition tolerance, but not availability. When a partition occurs, the system has to turn off inconsistent nodes until the partition can be fixed. MongoDB is an example of a CP database. It’s a NoSQL database management system (DBMS) that uses documents for data storage. It’s considered schema-less, which means that it doesn’t require a defined database schema. It’s commonly used in big data and applications running in different locations. The CP system is structured so that there’s only one primary node that receives all of the write requests in a given replica set. Secondary nodes replicate the data in the primary nodes, so if the primary node fails, a secondary node can stand-in.

AP databases

AP databases enable availability and partition tolerance, but not consistency. In the event of a partition, all nodes are available, but they’re not all updated. For example, if a user tries to access data from a bad node, they won’t receive the most up-to-date version of the data. When the partition is eventually resolved, most AP databases will sync the nodes to ensure consistency across them. Apache Cassandra is an example of an AP database. It’s a NoSQL database with no primary node, meaning that all of the nodes remain available. Cassandra allows for eventual consistency because users can resync their data right after a partition is resolved.

CAP theorem and microservices

Microservices are defined as loosely coupled services that can be independently developed, deployed, and maintained. They include their own stack, database, and database model, and communicate with each other through a network. Microservices have become especially popular in hybrid cloud and multi-cloud environments, and they are also widely used in on-premises data centers. If you want to create a microservices application, you can use the CAP theorem to help you determine a database that will best fit your needs.

Wrapping up and next steps

Congrats on taking your first step with the CAP theorem and distributed systems! Distributed systems allow for lower latency, scalability, increased interconnectivity, and more. The CAP theorem is very important within distributed systems and system design as a whole. While we covered a lot today, there’s still so much more to learn about distributed systems. Some recommended topics to learn next include:

  • Distributed data stores
  • Distributed system algorithms
  • Atomicity

To get started learning these concepts and more, check out Educative’s course Distributed Systems for Practitioners. In this hands-on course, you’ll explore the design of real-world distributed systems, how to efficiently design large-scale systems, and what concepts you need to know to get the most out of your distributed system.

Happy learning!

Continue learning about system design