Distribution System Practice Series

Introduction

A distribution system is a collection of independent computers in terms of hardware but communicating and coordinating with each other through a computer network, making end users feel like a single centralized system. In the era of big data and cloud computing, distributed architecture plays a backbone role for almost all large-scale online services.

Core Characteristics

For a system to be considered truly distributed, it must face and solve the following technical characteristics:

No Shared Clock: Each node has its own physical clock and there is always a certain clock drift. Therefore, you cannot rely on absolute real time to determine the order of events happening between nodes.
No Shared Memory: Nodes cannot directly read or write to each other's RAM. Every communication activity and state synchronization must be done through message passing such as protocols like HTTP/REST, gRPC or message brokers like Kafka, RabbitMQ.
Concurrency: Multiple nodes process independent tasks or the same part of a task at the same time. Managing race conditions and state consistency in a distributed environment is much more complex than in a single-machine environment (multithreading on one machine).
Independent Failures: One dead node does not mean the entire system is dead. The system must be designed according to the Design for Failure philosophy to be ready to accept errors. A congested network part (network partition), a crashed node, or a bad hard drive... the system must still continue to serve.

CAP Theorem and The Trade-offs

When designing a distributed system, any engineer must follow the CAP theorem. The system can only choose a maximum of 2 out of the following 3 elements when a network partition (P) occurs:

Consistency (C): Every node sees the same data at the same time (reading anywhere yields the latest result).
Availability (A): Every request sent to the system receives a response (success or failure), without timeout, even if a node crashes.
Partition Tolerance (P): The system still operates normally even when the network connection between nodes is broken.

Since P is mandatory in a real network environment, you only have 2 architectural choices:

CP System: Prioritize data consistency. If a network error occurs, the system will refuse to accept requests (sacrificing availability) to avoid data discrepancies between partitions.
AP System: Prioritize availability. The system still accepts read and write requests, data between zones can be temporarily mismatched and will be synchronized later.

Outstanding Advantages

Scalability: Allows easy system expansion by adding new servers (horizontal scaling) to meet increased load without upgrading excessively expensive hardware.
High Availability: Thanks to data backup mechanisms and flexible coordination, if one or a few nodes encounter issues, the system still operates normally, minimizing service downtime.
Superior Performance: Workloads are divided and processed in parallel across multiple different machines, helping optimize speed and reduce response times for users.
Fault Tolerance: The system is designed to automatically detect errors and recover data from backups, ensuring integrity and avoiding information loss.

Real-world Architecture

In modern enterprise and cloud-native systems, distributed systems appear in every corner constituting the infrastructure:

Microservices Architecture: Instead of a monolith block, the system is divided into many independent services (order service, payment service, inventory service) running on K8s clusters (EKS/GKE) or ECS Fargate. They communicate via gRPC (synchronous) or event-driven using Kafka (asynchronous).
Distributed Databases & Storage: Sharding & replication with excessively large data being divided into multiple database nodes and copied into multiple versions for backup and fast reading.
Distributed Caching: To reduce load on the database and increase throughput (processing bandwidth), Redis clusters are set up so that cached data is distributed and keys are distributed evenly to nodes.
Load Balancing & CDN: Using a load balancer like Nginx in front to distribute millions of requests from users down to the instances behind properly. Like CDNs (CloudFront, Cloudflare) helping distribute static assets (images, video, js) to edge servers globally for users to download data with the lowest latency.
Distributed Computing: When data exceeds the processing capability of a supercomputer, frameworks like Apache Spark, Hadoop MapReduce will split the dataset, distribute it to a cluster consisting of hundreds of worker nodes to process in parallel, then aggregate the results.