From the beginning of this year, I started to take lecture courses of undergrad distributed systems course at UC Santa Cruz (CSE 138) by Lindsey Kuper. It consists of 23 lectures (you can see the schedule of topics from here) and recently I’ve finished all of them. I’m not a student at UCSC but due to the COVID-19 situation, all these lectures were delivered online and are available on YouTube. So I was able to take them (Read about the background of online lecture and setup of it on Twitch plays CSE138).
Now, nearly every application is a distributed system running on multiple machines or containers. If you structure it properly, it gives us more reliability and leads us to more scalable organizational models. But, at the same time, it’s very complicated to design and build. Because of this, and as a platform engineer and SRE, “distributed system” is always one of the important topics to learn for me.
There are good books to learn about distributed system like “Designing Data-Intensive Applications” by Martin Kleppmann or “Designing Distributed Systems” by Brendan Burns. While these are great, I wanted to explore more basic algorithms and the core theory behind. For this, one way is reading fundamental papers but I didn’t know where to start… After searching on the internet, I found Distributed Systems Course made by Chris Colohan and, on this site, he recommends CSE 138. The topics are what I wanted to learn so decided to take them.
In the 23 lectures, you can lean the following:
- Time and asynchrony: Causality and happens-before, Partial orders, Total orders, Lamport clocks, Vector clocks, FIFO delivery, Causal delivery, Totally-ordered delivery, Consistent snapshots, and Chandy-Lamport snapshot algorithm
- Fault tolerance and replication: Fault classification and fault models, Two generals problem, Reliable delivery, At-least-once/at-most-once/exactly-once delivery, Reliable broadcast, Primary-backup replication, and Chain replication
- Consistency and consensus: FLP result, Paxos, Multi-Paxos, Eventual consistency, Strong convergence, Application-specific conflict resolution, Quorum consistency, Consistent hashing, Upper bounds, Least upper bounds, Join-semilattices, and CRDT
- Parallelism: Online systems vs. offline systems, Raw data vs. derived data, and MapReduce
The main papers you read are followings (each paper is described deeply in the lecture):
- Dynamo: Amazon’s Highly Available Key-value Store. Giuseppe DeCandia, Deniz Hastorun et al. 2007.
- MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat.2004.
The lecture is very easy to understand for me. The drawing and the examples the professor uses helps to grab the overview of the complex algorithms (e.g., Paxos). The students asked questions via Slack and she repeated it on the video and answered them. It was also very helpful (even I didn’t in the Slack). I was very impressed that she opens YouTube comments and answers questions from anyone on the internet as well.
For the next, I’m thinking to take MIT 6.824: Distributed Systems. The topic is similar to CSE138 but some is different and it introduces more recent topics. So it’s very interesting, too (seems more papers I need to read before each lecture but I will try!).
The followings are other papers mentioned in the lecture (I didn’t read all but I’m thinking to read later to lean more):
- Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Reinhard Schwarz and Friedemann Mattern. 1994 (Lecture 4, Total orders and Lamport clocks)
- Distributed snapshots: determining global states of distributed systems. K. Mani Chandy and Leslie Lamport. 1985 (Lecture 8, Chandy-Lamport Snapshot Algorithm)
- Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. Flaviu Cristian and Houtan Aghili et al. 1994 (Lecture 10, Fault classification and fault models)
- Chain replication for supporting high throughput and availability. Robbert van Renesse and Fred B. Schneider. 2004. (Lecture 12, Chain Replication)
- Impossibility of distributed consensus with one faulty process. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985 (Lecture 14, FLP Result)
- Paxos Made Simple. Leslie Lamport. 2001 (Lecture 14, Paxos)
- Vive La Difference: ´Paxos vs. Viewstamped Replication vs. Zab. Robbert van Renesse, Nicolas Schiper, Fred B. Schneider. 2015 (Lecture 16, Other Consensus Protocols)
- A Conflict-Free Replicated JSON Datatype. Martin Kleppmann and Alastair R. Beresford. 2016 (Lecture 22, CRDTs)
- Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov. 1999 (Lecture 23)
- Consensus on Transaction Commit. Jim Gray and Leslie Lamport. 2004 (Lecture 23)
- Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport. 1978 (Lecture 23, Vector Clock)
- RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters. Sage A. Weil and Andrew W. Leung et al. 2007. (Lecture 23, Replication)
- In Search of an Understandable Consensus Algorithm. Diego Ongaro and John Ousterhout. 2014. (Lecture 23, Consensus)
- Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. Brian M. Oki and Barbara H. Liskov. 1988 (Lecture 23, Consensus)
- Paxos vs Raft: Have we reached consensus on distributed consensus?. Heidi Howard and Richard Mortier. 2020 (Lecture 23, Consensus)