Reading List of Distributed System

Posted by Nickolas on March 27, 2022

分布式问题在不同的性能(Avaliable), 一致性时间约束(Consistency)下, 有不同的理论和实践方案, 是在CAP中设计取舍, 本身有足够复杂又有趣问题. 分布式理论在1970-2000年有非常快速的发展. 比如CAP, 2PC, Paxos.

当下的互联网是构建在精妙设计的分布式系统之上的, 比如GFS, MapReduce, Spanner, Zookeeper, Cassandra, BigTable, Dynamo, Kafka. 这些系统在2000-2013年诞生. 当下这些分布式系统技术已经相对成熟, 虽然现在已经很少有机会去设计这些系统, 但熟悉其原理和取舍, 对于商业系统架构设计非常重要.

所以梳理了领域中的经典理论和系统, 以及近年对理论的新的解读. 整理出来, 抽时间慢慢阅读消化, 希望对大家有帮助.

Consensus Theory

notes 2PC. Gray described 2PC in “Notes on Database Operating Systems” (1979). 在Consensus Protocols: Two-Phase Commit(2008)这篇文章中有更straight forward的介绍. Thought works这篇文章two-phase-commit(2022)通过一个分布式kv store作为例子来解释2pc理论.

notes 3PC. NonBlocking Commit Protocols” (1981)

notes SAGA Pattern - Hector Garcia Molina & Keneth Salem, Princeton University, 1987. Denis在Saga Pattern Application Transactions Using Microservices这篇文章简明扼要的介绍了Saga Pattern以及2种实现方式.

Paxos Made Simple, a more terse readable Paxos paper by Lamport himself. Shorter and more easier compared to the original.

Raft Consensus Algorithm An alternative to Paxos for distributed consensus, that is much simpler to understand. Do checkout an interesting visualization of raft

Replicated Log

Distributed Storage System

notes Dynamo: Amazon’s Highly Available Key Value Store Paraphrasing @fogus from their blog, it is very rare for a paper describing an active production system to influence the state of active research in any industry; this is one of those seminal distributed systems paper that solves the problem of a highly available and fault tolerant database in an elegant way, later paving the way for systems like Cassandra, and many other AP systems using a consistent hashing.

Cassandra - A Decentralized Structured Storage System

BigTable

Scaling Memcache at Facebook

Spanner: Google’s Globally Distributed Database - Google 2012. Google’s Spanner - Kleppmann 2020

Calvin: fast distributed transactions for partitioned database systems - Thomson et al., SIGMOD’12

No compromises: distributed transactions with consistency, availability, and performance FaRM - main memory distributed computing platform

Concurrency/Network Theory

The C10K problem

A High-Speed Load-Balancer Design with Guaranteed Per-Connection-Consistency - Barbette et. al., NSDI 2020

TCP Thoughput Fast Incast - Amar et. al. 2008

Consistency/Lock System

The Chubby lock service for loosely-coupled distributed systems

Other Distributed System

Kafka: a Distributed Messaging System for Log Processing

Bigtable: A Distributed Storage System for Structured Data

The Google File System

Spark

Mesos

Other List & Resource

Awesone Distributed Systems

Readings in distributed systems

MIT 6.824 Distributed Systems

Google Research

Read List From Paper Trail