Resilience of mutual exclusion algorithms to transient memory faults

Thomas Moscibroda, Rotem Oshman

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We study the behavior of mutual exclusion algorithms in the presence of unreliable shared memory subject to transient memory faults. It is well-known that classical 2-process mutual exclusion algorithms, such as Dekker and Peterson's algorithms, are not fault-tolerant; in this paper we ask what degree of fault tolerance can be achieved using the same restricted resources as Dekker and Peterson's algorithms, namely, three binary read/write registers. We show that if one memory fault can occur, it is not possible to guarantee both mutual exclusion and deadlock-freedom using three binary registers; this holds in general when fewer than 2f+1 binary registers are used and f may be faulty. Hence we focus on algorithms that guarantee (a) mutual exclusion and starvation-freedom in fault-free executions, and (b) only mutual exclusion in faulty executions. We show that using only three binary registers it is possible to design an 2-process mutual exclusion algorithm which tolerates a single memory fault in this manner. Further, by replacing one read/write register with a test&set register, we can guarantee mutual exclusion in executions where one variable experiences unboundedly many faults. In the more general setting where up to f registers may be faulty, we show that it is not possible to guarantee mutual exclusion using 2f + 1 binary read/write registers if each faulty register can exhibit unboundedly many faults. On the positive side, we show that an n-variable single-fault tolerant algorithm satisfying certain conditions can be transformed into an ((n-1)f + 1)-variable f-fault tolerant algorithm with the same progress guarantee as the original. In combination with our three-variable algorithm, this implies that there is a (2f+1)-variable mutual exclusion algorithm tolerating a single fault in up to f variables without violating mutual exclusion.

Original languageEnglish
Title of host publicationPODC'11 - Proceedings of the 2011 ACM Symposium Principles of Distributed Computing
Pages69-78
Number of pages10
DOIs
StatePublished - 2011
Externally publishedYes
Event30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC'11, Held as Part of the 5th Federated Computing Research Conference, FCRC - San Jose, CA, United States
Duration: 6 Jun 20118 Jun 2011

Publication series

NameProceedings of the Annual ACM Symposium on Principles of Distributed Computing

Conference

Conference30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC'11, Held as Part of the 5th Federated Computing Research Conference, FCRC
Country/TerritoryUnited States
CitySan Jose, CA
Period6/06/118/06/11

Keywords

  • fault tolerance
  • mutual exclusion
  • transient memory faults

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Resilience of mutual exclusion algorithms to transient memory faults'. Together they form a unique fingerprint.

Cite this