Saral Shiksha Yojna
Courses/Distributed Systems

Distributed Systems

CS3.401
Prof. Kishore KothapalliMonsoon 2025-264 credits

Cheatsheet

Ultra-condensed. Revise a chapter in minutes.

Unit 1 — Introduction, Challenges & CAP Theorem

Foundations — Definition, Motivation, Challenges, CAP
One-liners
  • Tanenbaum: 'many computers, appear as one'.
  • Five features: many + concurrent + independent failure + no global clock + no shared memory.
  • CAP: pick 2; P is unavoidable → trade C vs A.
  • CP examples: Spanner, HBase. AP: Dynamo, Cassandra, DNS.
  • Two-Generals: deterministic consensus over unreliable channels is impossible.
Formulas
  • CAP triangle: {C, A, P}
  • Real systems: pick CP or AP (P forced)
Definitions
  • Distributed system: appears as one computer to the user.
  • Consistency: all replicas show the same recent value.
  • Availability: every non-failing node responds.
  • Partition tolerance: works through network partitions.
Algorithms
  • Two-Generals: send msg → ack → ack-of-ack → ... never terminates with certainty.
Comparisons
  • Parallel vs Distributed: Parallel: shared memory + single clock (tightly coupled). Distributed: no shared memory + no global clock (loosely coupled, message passing only).
  • CP system vs AP system: CP blocks during partition to keep replicas in sync (Spanner). AP keeps serving stale data and reconciles later (Dynamo).
Keywords
TanenbaumCAPconsistencyavailabilitypartition toleranceTwo-GeneralsFLPhorizontal scalingBrewerPACELC

Unit 2 — Models, Events & Logical Time

Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync
One-liners
  • Lamport : same-process / send-recv / transitive. Concurrent = neither precedes.
  • Scalar R2: .
  • Vector R2: componentwise max, then ++.
  • Strong consistency: vector ✓, matrix ✓, scalar ✗.
  • Singhal-Kshemkalyani: send changed entries; needs FIFO.
  • Matrix clock GC: → discard.
  • Cristian's: ; Berkeley: average; NTP: stratum tree.
Formulas
  • Scalar R1/R2:
  • Vector R1/R2: ++; componentwise max then ++
  • Comparison:
  • Matrix GC:
Definitions
  • Happened-before: same-process / send-recv / transitive.
  • Strong consistency: (both directions).
  • Vector clock entry: causal-event count for that process.
  • Matrix entry : i's knowledge of j's knowledge of k's clock.
Algorithms
  • Scalar Lamport: R1 then R2 + tie-break (t, i).
  • Vector clock: R1 ++ own; R2 max-then-++.
  • Singhal-Kshemkalyani: send (k, V[k]) for k where LU[k] > LS[j].
  • Cristian's: RTT measure → .
  • Berkeley: poll slaves → average → send deltas.
Comparisons
  • Scalar clock vs Vector clock: Scalar: 1 int, consistent only (forward). Vector: n ints, strongly consistent. Vector detects concurrent events; scalar can't.
  • Vector clock vs Matrix clock: Vector: causal precedence. Matrix: causal + second-order knowledge → enables obsolete-message GC.
  • Cristian's vs Berkeley: Cristian's: single server, UTC, . Berkeley: master polls all, computes average, no UTC, internal LAN agreement.
  • NTP vs Cristian's: NTP: hierarchical strata; Internet-wide; uses 4 timestamps for offset+delay. Cristian's: 1-hop polling; single server; simple but vulnerable.
Keywords
happened-beforeLamport clockvector clockmatrix clockstrong consistencySinghal-KshemkalyaniFIFOCristian'sBerkeleyNTPstratumUTCRTT

Unit 3 — Global Snapshots

Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts
One-liners
  • Consistent cut: no message arrow future → past.
  • C1: send recorded ⇒ msg in channel XOR received.
  • C2: send NOT recorded ⇒ msg NOT in channel AND NOT received.
  • Chandy-Lamport: FIFO, markers, msgs.
  • Lai-Yang: non-FIFO, white/red, heavy history.
  • Acharya-Badrinath: causal, msgs, SENT/RECD.
Formulas
  • C1: send recorded ⇒ in-channel ⊕ received
  • C2: send NOT recorded ⇒ NOT in channel AND NOT received
  • Acharya-Badrinath channel:
Definitions
  • Snapshot = global state recorded without stopping the system.
  • Marker (CL) = control msg separating pre/post on FIFO channel.
  • White/red (Lai-Yang) = colour-based pre/post on non-FIFO channels.
  • SENT/RECD (Acharya-B) = per-process message counters.
Algorithms
  • Chandy-Lamport: initiator records → marker on outgoing; on first marker: record + propagate; on later marker: stop recording.
  • Lai-Yang: white turns red on snapshot; red msg forces receiver's snapshot; channel state from history.
  • Acharya-Badrinath: initiator broadcasts token; each replies with (LS, SENT, RECD); channel state computed.
Comparisons
  • Chandy-Lamport vs Lai-Yang: CL: FIFO required, marker-based, light storage. Lai-Yang: non-FIFO OK, colour-based, heavy history.
  • Chandy-Lamport vs Acharya-Badrinath: CL: FIFO, msgs. Acharya-B: causal, msgs — much cheaper but stronger channel assumption.
Keywords
snapshotconsistent cutC1C2Chandy-LamportmarkerFIFOLai-YangAcharya-Badrinathcausal channel

Unit 4 — Causal Order Message Delivery

BSS Algorithm + Causal vs FIFO vs Total Order
One-liners
  • Causal: delivered before .
  • BSS (a): .
  • BSS (b): .
  • FIFO ⊊ Causal ⊊ Total order.
Formulas
  • Cond (a):
  • Cond (b):
Definitions
  • Causal order delivery: cause-then-effect at every receiver.
  • BSS: vector-clock-based causal delivery; buffer if conditions fail.
  • Total order: all msgs in same global sequence at every receiver.
Algorithms
  • BSS receive: check (a) AND (b); deliver or buffer; re-check buffered msgs after each delivery.
Comparisons
  • FIFO vs Causal: FIFO orders only same-sender pairs. Causal orders all pairs related by happened-before.
  • Causal vs Total: Causal: orders causally-related pairs. Total: orders ALL pairs (including concurrent). Total requires consensus.
Keywords
causal orderBSSBirman-Schiper-Stephensonvector clockFIFOtotal orderstate-machine replication

Unit 5 — Distributed Mutual Exclusion

Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison
One-liners
  • Lamport: 3(N-1) msgs, FIFO, L1+L2 entry.
  • Ricart-Agrawala: 2(N-1) msgs, no FIFO, deferred REPLY.
  • Maekawa: . V1 deadlocks; V2 adds FAILED/INQUIRE/YIELD; SD = 2T.
  • Suzuki-Kasami: 0 or N msgs; freshness .
  • Raymond: O(log N), Holder pointer, root bottleneck.
  • Lamport needs FIFO; R-A doesn't.
Formulas
  • Lamport:
  • R-A:
  • Maekawa:
  • S-K: or
  • Raymond:
Definitions
  • Safety = ≤1 in CS; Liveness = eventual entry; Fairness = timestamp order.
  • SD = time from one exit to next entry.
  • Throughput = .
  • Quorum (Maekawa) = set with pairwise intersection.
  • Token freshness (S-K) = .
Algorithms
  • Lamport entry: L1 (later msg from every site) ∧ L2 (own at top).
  • R-A defer: in CS, OR requesting with smaller ts.
  • Maekawa V2: FAILED (already higher replied), INQUIRE (higher arrived; ask), YIELD (relinquish).
  • S-K: broadcast REQ; token to iff fresh.
  • Raymond: Holder up, token down, requests aggregate.
Comparisons
  • Lamport vs Ricart-Agrawala: Lamport: 3(N-1) msgs, FIFO. R-A: 2(N-1) msgs, no FIFO, deferred REPLY replaces RELEASE.
  • Maekawa V1 vs Maekawa V2: V1: 3√N msgs, deadlocks. V2: up to 5√N, three new messages (FAILED, INQUIRE, YIELD) break deadlock.
  • Suzuki-Kasami vs Raymond: S-K: broadcast REQ, 0 or N msgs, scales worse. Raymond: tree-routed REQ, O(log N) msgs, root bottleneck.
  • Non-token vs Token-based: Non-token: each request asks permission; no token to lose; more msgs per CS. Token: single privilege passes around; vulnerable to token loss.
Keywords
Lamport DMERicart-AgrawalaRoucairol-CarvalhoMaekawaquorumFAILEDINQUIREYIELDSuzuki-KasamiRaymondHolder pointersafetylivenessfairnesssynchronisation delaythroughput

Unit 6 — Distributed Deadlock Detection

Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion
One-liners
  • Cycle in WFG ⇒ deadlock (Single, AND).
  • Knot (SCC, no outgoing) ⇒ deadlock (OR).
  • Detection is the dominant strategy in DS.
  • CMH probe ; on return ⇒ deadlock.
  • Mitchell-Merritt probes go OPPOSITE WFG edges.
  • Chandy diffusion handles OR via query/reply waves.
Formulas
  • Single/AND: cycle ⇒ deadlock
  • OR: knot ⇒ deadlock
  • CMH msgs:
Definitions
  • WFG: iff blocks on .
  • Knot: SCC with no outgoing edges to non-knot.
  • Phantom: detected cycle that never existed simultaneously.
  • CMH probe = .
Algorithms
  • CMH: blocked → probe → forwarded along WFG → if on return, deadlock.
  • Mitchell-Merritt: blocked → new label → transmit backward → own label returns ⇒ deadlock.
  • Chandy diffusion: query forward + reply wave; all-branch reply ⇒ deadlock.
  • Ho-Ramamoorthy: collect twice (2-phase) or use cross-table consistency (1-phase).
Comparisons
  • Single/AND model vs OR model: Single/AND: cycle in WFG ⇒ deadlock. OR: only a KNOT (SCC with no outgoing edges) ⇒ deadlock.
  • CMH probe direction vs Mitchell-Merritt direction: CMH probes follow WFG edges. Mitchell-Merritt probes go OPPOSITE — from waiter back toward holder.
  • Prevention/Avoidance vs Detection: Prevention/Avoidance need global state — expensive in DS. Detection runs on-demand — dominant strategy.
Keywords
WFGcycleknotSingle resourceAND modelOR modelAND-ORphantomCMH probeMitchell-MerrittChandy diffusionHo-Ramamoorthyedge-chasingdiffusion computation

Unit 7 — Consensus & Byzantine Agreement

Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP
One-liners
  • Crash: rounds, msgs.
  • Byzantine: , rounds.
  • FLP: no deterministic consensus in async even with 1 crash.
  • OM(m): , exponential msgs, recursive.
  • Phase King: , polynomial msgs, rounds.
  • N=3, f=1 Byzantine impossible (indistinguishability).
Formulas
  • Crash msgs:
  • Byz bounds:
  • Phase King:
  • PK decision: mult > ⇒ keep majority; else king
Definitions
  • Crash failure: silent halt.
  • Byzantine failure: arbitrary including lies & conflicting msgs.
  • Consensus: each has input, agree on one.
  • Interactive consistency: agree on a vector.
  • BA: source broadcasts; others agree on source value.
  • FLP: no async deterministic consensus.
Algorithms
  • Crash: rounds, each min over received.
  • OM(m): recursive, each lieutenant runs OM(m-1), majority at end.
  • Phase King: phases × 2 rounds; multiplicity threshold > .
Comparisons
  • Crash failure vs Byzantine failure: Crash: silent halt; n > f. Byzantine: arbitrary lies; n ≥ 3f+1 + f+1 rounds + async impossible.
  • OM(m) vs Phase King: OM: , rounds, exponential msgs, complex recursion. PK: , rounds, polynomial msgs, simple two-round phases.
  • Consensus vs Byzantine Agreement: Consensus: each has own input. BA: source broadcasts. BA solves consensus + interactive consistency.
Keywords
consensusByzantinecrash failureFLPOM(m)Lamport-Shostak-PeasePhase Kinginteractive consistencyagreementvalidityterminationindistinguishability

Unit 8 — Distributed Transactions, 2PC & 3PC

ACID + 2PC + 3PC + Blocking & In-Doubt States
One-liners
  • ACID = Atomicity + Consistency + Isolation + Durability.
  • 2PC = PREPARE + DECIDE; coord drives.
  • 2PC blocks when: all <ready> AND coord crashed.
  • <commit> redo / <abort> undo / <ready> ask / nothing abort.
  • 3PC = PREPARE + PRE-COMMIT (K acks) + COMMIT.
  • 3PC unused in practice: no-partition assumption unrealistic.
Formulas
  • Point of no return: coord's <commit T> on stable log
  • Blocking: all <ready> + coord crashed
  • 3PC recovery: any <pre-commit> ⇒ commit, else abort
Definitions
  • Fail-stop: silent halt; no lies.
  • <ready T>: participant voted yes (forced stable).
  • In-doubt: <ready T> only; must hold all locks.
  • <pre-commit T>: replicated decision intent in 3PC.
Algorithms
  • 2PC: coord <prepare> → PREPARE → READY/NO → coord <commit/abort> → COMMIT/ABORT → ack.
  • 3PC: 2PC + PRE-COMMIT phase with K acks.
Comparisons
  • 2PC vs 3PC: 2PC blocks when coord crashes after all <ready>. 3PC's PRE-COMMIT replicates decision at K+1 sites → non-blocking under ≤K failures; assumes no partition.
  • Blocking scenario (2PC) vs Network partition (2PC): Blocking: coord crashed + all <ready>. Partition: cut-off sites act as if coord crashed → may block, but no incorrect outcome.
Keywords
2PC3PCACIDfail-stopPREPAREREADYCOMMITABORTPRE-COMMITin-doubtblockingcoordinatorparticipantstable log

Unit 9 — Raft Consensus

Leader Election + Log Replication + Safety
One-liners
  • Raft = Paxos rewritten for understandability.
  • Fail-stop only (not Byzantine). Majority must be up.
  • Three sub-problems: Election + Replication + Safety.
  • Three states: Follower / Candidate / Leader.
  • Higher term observed → step down.
  • Random election timeouts prevent split votes.
  • Election restriction: candidate's log ≥ voter's log.
  • No direct commit of prior-term entries.
Formulas
  • Majority:
  • Election: random timeout → Candidate → majority vote → Leader
  • Election restriction: last-term wins; tie → longer log
Definitions
  • Term: monotonic logical period of one election.
  • AppendEntries: leader → follower (entries + prev index/term).
  • Commit: replicated to majority in current term.
  • Election restriction: up-to-date log requirement for voting.
  • Joint consensus: transitional membership-change config.
Algorithms
  • Election: timeout → ++term → vote self → RequestVote → majority → Leader.
  • Log replication: leader appends → AppendEntries → majority ack → commit → apply.
  • AppendEntries consistency: prevIndex+term match → accept; mismatch → back up.
  • Safety: prior-term entry committed only via new current-term entry on top.
Comparisons
  • Raft vs 2PC: Raft: replicated state machine, leader-based, tolerates leader crash via re-election. 2PC: atomic commit across sites; blocks on coordinator crash.
  • Raft vs Paxos: Same correctness, same fault tolerance. Raft simpler decomposition, more understandable; Paxos older and more theoretical.
Keywords
replicated state machinefail-stopleader electiontermAppendEntriesmajority commitelection restrictionFigure 8joint consensusetcdConsulCockroachDB

Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)

GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject
One-liners
  • Cut property ⇒ MWOE ∈ MST.
  • Rule A: → absorb.
  • Rule B: + mutual MWOE → merge level+1.
  • Rule C: else WAIT.
  • Max level ≤ .
  • Complexity msgs.
  • Test reply: reject (same F) / accept (diff F, ) / defer (diff F, ).
Formulas
  • Cut property: lightest crossing edge ∈ MST
  • Rule B name:
  • Max level:
  • Complexity:
Definitions
  • Fragment: subtree of some MST.
  • MWOE: lightest edge with one endpoint outside.
  • Core edge: most recent Rule-B merger edge.
  • Test reply: reject / accept / defer.
Algorithms
  • GHS round: each fragment finds MWOE; apply A/B/C rules; repeat until single fragment.
  • Test/Accept/Reject: probe basic edges in weight order; classify based on level/fragment.
Comparisons
  • Prim vs GHS: Prim: single growing fragment, centralised. GHS: many parallel fragments (distributed Kruskal), each grows via MWOE.
  • Rule A vs Rule B: A: → smaller absorbs into bigger, level unchanged. B: + mutual MWOE → merge into new fragment at level .
Keywords
GHSGallager-Humblet-SpiraMSTMWOEfragmentlevelcore edgeRule ARule BRule Ccut propertycycle propertytestacceptrejectdeferinitiateconnectreportchangeroot

Unit 11 — Google File System (GFS)

GFS — Architecture, Reads, Writes, Consistency, Recovery
One-liners
  • Three components: single master + chunkservers + clients.
  • 64 MB chunks, 3× replicas across racks.
  • Master metadata in MEMORY; locations NOT logged (heartbeat-rebuilt).
  • Lease (60 s) grants primary serialisation; data pipelined + control star.
  • Four states: defined / consistent / undefined / inconsistent.
  • Atomic record append at-least-once → defined interspersed with inconsistent.
  • Snapshots: copy-on-write; chunk copy on first subsequent write.
  • Stale replica detection: chunk version number via heartbeat.
  • Deleted files retained 3 days hidden before GC.
Formulas
  • Chunk size: 64 MB
  • Replication: 3× (configurable per file)
  • Lease: 60 s, renewable
  • Hidden retention: 3 days
  • Checksum: per 64 KB block
Definitions
  • Chunk: 64 MB unit, 64-bit handle, 3× replicas.
  • Lease: master → primary for 60 s; primary serialises writes.
  • Atomic record append: GFS picks offset; at-least-once.
  • Defined: consistent + reflects mutation entirely.
  • Pipelined data flow: linear chain along closest replica path.
Algorithms
  • Read: client → master (file, idx) → handle+replicas → client cache → nearest replica → data.
  • Write: client → master (lease) → push data pipelined → client → primary (write req) → primary serialises + forwards → secondaries apply + ACK → primary → client SUCCESS.
  • Snapshot: master duplicates metadata + ++refcount; chunk copy on first subsequent write.
  • Stale GC: chunk version mismatch via heartbeat → master commands chunkserver to delete.
Comparisons
  • Defined region vs Consistent-but-undefined region: Defined: same bytes everywhere AND reflects a single writer's mutation. Undefined: same bytes everywhere BUT mingled fragments from concurrent writers.
  • Data flow vs Control flow: Data flow: pipelined linearly along closest replica chain (bandwidth optimal). Control flow: star, client → primary → secondaries (small msgs).
  • GFS vs HDFS: GFS: 64 MB chunks, atomic record append. HDFS: 128 MB chunks, write-once-read-many (early). HDFS NameNode = GFS master; DataNode = chunkserver.
Keywords
GFSGoogle File Systemchunkchunkservermasterleaseatomic record appendpipelined data flowconsistency statedefinedconsistentundefinedinconsistentcopy-on-write snapshotstale replicachunk versionheartbeatgarbage collectionHDFS