CSE 138: Distributed Systems

by Taichi Nakashima,

From the beginning of this year, I started to take lecture courses of undergrad distributed systems course at UC Santa Cruz (CSE 138) by Lindsey Kuper. It consists of 23 lectures (you can see the schedule of topics from here) and recently I’ve finished all of them. I’m not a student at UCSC but due to the COVID-19 situation, all these lectures were delivered online and are available on YouTube. So I was able to take them (Read about the background of online lecture and setup of it on Twitch plays CSE138).

Now, nearly every application is a distributed system running on multiple machines or containers. If you structure it properly, it gives us more reliability and leads us to more scalable organizational models. But, at the same time, it’s very complicated to design and build. Because of this, and as a platform engineer and SRE, “distributed system” is always one of the important topics to learn for me.

There are good books to learn about distributed system like “Designing Data-Intensive Applications” by Martin Kleppmann or “Designing Distributed Systems” by Brendan Burns. While these are great, I wanted to explore more basic algorithms and the core theory behind. For this, one way is reading fundamental papers but I didn’t know where to start… After searching on the internet, I found Distributed Systems Course made by Chris Colohan and, on this site, he recommends CSE 138. The topics are what I wanted to learn so decided to take them.

In the 23 lectures, you can lean the following:

  • Time and asynchrony: Causality and happens-before, Partial orders, Total orders, Lamport clocks, Vector clocks, FIFO delivery, Causal delivery, Totally-ordered delivery, Consistent snapshots, and Chandy-Lamport snapshot algorithm
  • Fault tolerance and replication: Fault classification and fault models, Two generals problem, Reliable delivery, At-least-once/at-most-once/exactly-once delivery, Reliable broadcast, Primary-backup replication, and Chain replication
  • Consistency and consensus: FLP result, Paxos, Multi-Paxos, Eventual consistency, Strong convergence, Application-specific conflict resolution, Quorum consistency, Consistent hashing, Upper bounds, Least upper bounds, Join-semilattices, and CRDT
  • Parallelism: Online systems vs. offline systems, Raw data vs. derived data, and MapReduce

The main papers you read are followings (each paper is described deeply in the lecture):

The lecture is very easy to understand for me. The drawing and the examples the professor uses helps to grab the overview of the complex algorithms (e.g., Paxos). The students asked questions via Slack and she repeated it on the video and answered them. It was also very helpful (even I didn’t in the Slack). I was very impressed that she opens YouTube comments and answers questions from anyone on the internet as well.

Chris Whealy opens his lecture notes on GitHub. It is very well written and also helps you to understand the course.

For the next, I’m thinking to take MIT 6.824: Distributed Systems. The topic is similar to CSE138 but some is different and it introduces more recent topics. So it’s very interesting, too (seems more papers I need to read before each lecture but I will try!).

The followings are other papers mentioned in the lecture (I didn’t read all but I’m thinking to read later to lean more):