Distributed Algorithms

Lecturers:	Prof. Jeff Magee & Prof. Jeff Kramer
Office:	Huxley 572A & 568
Contact by e-mail:	j.magee@imperial.ac.uk & j.kramer@imperial.ac.uk

The Distributed Algorithms course is concerned with the algorithmic aspects of distributed computing. In particular, we focus on Distributed Systems which are prone to hardware and/or software failures. Two classes of subjects are discussed:

What distributed algorithms cannot do in certain system models - impossibility results.
How do algorithms solve certain classes of computational problems in certain system models (even in the presence of failures).

Reading List

“Distributed Algorithms”, Nancy Lynch, Morgan Kaufmann, 1996.

“Distributed Computing”, Hagit Attiya and Jennifer Welch, McGraw-Hill, 1998.

“Distributed Systems”, S. Mullender (Ed.), 2nd ed., Addison-Wesley, 1993.

“Concurrency Control and Recovery in Database Systems”, Philip Bernstein, Vassos Hadzilacos, Nathan Goodman, Addison-Wesley, 1987.

Software

The Labelled Transition System Analyser (LTSA) tool is used throughout the course for modelling and demonstrating the execution of various algorithms. The ZIP file contains the LTSA and the demonstrations mentioned below.

Course Outline

	Models of distributed computing (Introduction) Lecture Notes: 4-up printable in PDF References: J. Turek and D. Shasha, “The Many Faces of Consensus in Distributed Systems”, IEEE Computer, June 1992. R. Schlichting and F. Schneider, "Fail-Stop processors: an approach to designing fault-tolerant computing", ACM Transactions on Computer Systems, 1(3), August 1983. F. Cristian, "Understanding Fault-Tolerant Distributed Systems", Communications of the ACM, 34(2), February 1991.
	Synchronous message-passing distributed systems Leader Election and Breadth-First Search algorithms Lecture Notes: HTML (requires a browser that can handle frames); 4-up printable in PDF and Postscript format. Demos: A zip file containing the LTSA and models for the following algorithms: LCR, LCR + halt, FloodMax, FloodMaxOpt, SyncBFS, SyncBFS + termination. Tutorial Questions: Tutorial 1 : Questions (PDF, Postscript); solutions (PDF, Postscript). Tutorial 2 : Questions (PDF, Postscript); solutions (PDF, Postscript). References: “Distributed Algorithms”, Nancy Lynch, Morgan Kaufmann, 1996. Chapters 2, 3 and 4. The Atomic Commitment Problem Lecture Notes: HTML with animations which also includes a PowerPoint show (works well only with MS Internet Explorer 5 - blaim Microsoft for that); 4-up handouts in PDF and Postscript format. The handouts do not include the animated slides of the PowerPoint presentation. Fluent Linear Temporal Logic: 4-up notes handout in PDF and "Analyzing Synchronous Distributed Algorithms" PDF. Additional notes on modelling link failure- PDF. Demos: A zip file containing the LTSA and models for the following algorithms: 2PC, 3PC, 2PC -linkfail, 3PC - linkfail Errata: In slide 14 of the handouts, in the very last line of the slide, the process must block and not abort! See also the corrected handouts above. The material in this section is based on earlier notes by Vassos Hadzilacos. Additional notes: Impossibility of Atomic Commitment in the presence of link failures -- proof for 2 processes. (PDF, Postscript) Tutorial 3 : Questions (PDF, Postscript); solutions (PDF, Postscript). References: J.N. Gray, "Notes on Database Operating Systems: An Advanced Course", Lecture Notes in Computer Science 60:393-481, Springer-Verlag, Berlin, 1978. D. Skeen, "Nonblocking Commit Protocols", In Proc. ACM SIGMOD Conf. on Management of Data, pp. 133-147, Orlando FL, June, 1982. D. Cheung and T. Kameda, "Site-Optimal Termination Protocols for a Distibuted Database under Networking Partitioning", In Proc. 4th ACM SIGACT-SIGOPS Symp. on Principles of Distributed Computing, pp. 111-121, Minaki, Ontario, August, 1985. The Byzantine Generals Problem Lecture Notes: 4-up handouts in PDF. Tutorial 4: Questions (PDF, Postscript); solutions (PDF, Postscript). References: L. Lamport, R. Shostak, M. Pease, "The Byzantine Generals Problem", ACM TOPLAS 4(3), pp. 382-401, July 1982. D. Dolev, H. R. Strong, "Authenticated Algorithms for Byzantine Agreement", SIAM Journal of Computer, 12(4), pp. 656-666, 1983.
	Asynchronous message-passing distributed systems Impossibility of Consensus Lecture Notes: HTML with animations which also includes a PowerPoint show; 4-up handouts in PDF and Postscript format. Additional notes: Top-level induction of the FLP impossibility proof (PDF, Postscript): Inductive construction of a schedule S, such that S is infinite, admissible, applicable to an initial bivalent configuration C", and for every prefix S' of S, S'(C") is also bivalent. The material in this section is based on earlier notes by Vassos Hadzilacos. Tutorial 5 :* Question (PDF, Postscript). Sample solutions: PDF, Postscript. References: M. Fischer, N. Lynch, M. Peterson, "Impossibility of Distributed Consensus with one Faulty Process", Journal of the ACM, Vol. 32, No. 2, April 1985. J. Turek and D. Shasha, “The Many Faces of Consensus in Distributed Systems”, IEEE Computer, June 1992. Logical Time Lecture Notes: HTML with animations which also includes a PowerPoint show; 4-up handouts in PDF and Postscript format. Tutorial 6 (21/02/00): Questions (PDF, Postscript). References: L. Lamport, "Time, Clocks and Ordering of Events in a Distributed System", Communications of the ACM, Vol. 21, No. 7, July 1978, pp. 558-565. G. Ricart, A.K. Agrawala, "An Optimal Algorithm for Mutual Exclusion in computer Networks", Communications of the ACM, Vol. 24, No. 1, January 1981. F. Mattern, "Virtual Time and Global States of Distributed Systems", in Proceedings of the International Conference on Parallel and Distributed Algorithms, pp. 215-216, 1989. C. Fidge, "Tmestamps in message-passing systems that preserve the partial ordering", in Proceedings of the 11th Australian Computer Science Conference, February 1988. Fault-Tolerant Broadcasts Lecture Notes: HTML with animations which also includes a PowerPoint show; 4-up handouts in PDF and Postscript format. Tutorial 7 (06/03/00): Questions (PDF, Postscript); solutions (PDF, Postscript). References: Vassos Hadzilacos and Sam Toueg, "A modular approach to fault tolerant broadcasts and related problems", Technical Report # 94-1425, Dept of Computer Science, University of Toronto and Dept of Computer Science, Cornell University, May 1994. Available as http://www.cs.utoronto.ca/~vassos/research/publications/HT94/paper.ps.gz Ken Birman, "The process group approach to reliable distributed computing", Communications of the ACM, 36(12), pp.37-53, 1993. K. Birman, A. Schiper and P. Stephenson, "Lightweight causal and atomic group multicast", ACM Transactions on Computer Systems, 9(3), August 1991.
	Partially synchronous message-passing distributed systems Failure Detectors Lecture Notes: HTML with animations which also includes a PowerPoint show; 4-up handouts in Postscript format. Additional notes: Informal correctness arguments (PDF, Postscript) for the algorithm that solves Consensus using the Perfect Failure Detector (P). The algorithm is presented in the Lecture Notes, page 14. Tutorial 8 (13/03/00): Questions (PDF, Postscript); solutions (PDF, Postscript). References: T.D. Chandra and S. Toueg “Unreliable failure detectors for reliable distributed systems”, Journal of the ACM, 43(2), March 1996. T.D. Chandra, V. Hadzilacos and S. Toueg “The weakest failure detector for solving consensus”, Journal of the ACM, 43(4), July 1996.

Maintained by Jeff Magee. Last modified: 5th January 2004 Based on material originally prepared by Christos Karamanolis

Distributed Algorithms

Reading List

Software

Course Outline

Models of distributed computing (Introduction)

Synchronous message-passing distributed systems

Leader Election and Breadth-First Search algorithms

The Atomic Commitment Problem

The Byzantine Generals Problem

Asynchronous message-passing distributed systems

Impossibility of Consensus

Logical Time

Fault-Tolerant Broadcasts

Partially synchronous message-passing distributed systems

Failure Detectors