Fault Tolerance with High Performance for Fast Matrix Multiplication.

Noam Birnbaum, Roy Nissim, Oded Schwartz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The increase in machine size and the decrease in operating voltage have made errors more common, both soft (bit flip) and hard (component failure). Checkpoint-restart solutions address the problem of hard errors, but incur significant costs in time and hardware. For some algorithms, classical matrix multiplication included, tailored solutions incur much lower overhead. Existing efficient algorithmic tolerance techniques typically aim for iterative algorithms, and thus cannot be applied to recursive algorithms, such as fast matrix multiplication algorithms. By utilizing combinatorial aspects of the computational graphs of these algorithms, we obtain fault resilience with small overhead costs, for Strassen's and other recursive fast matrix multiplication algorithms. To the best of our knowledge, this is the first fault-tolerant solution tailored for fast matrix multiplication algorithm. Our solution is asymptotically better than any of the previous (classical based) fault-tolerant solutions, unless the error rate is extremely high. Our technique can be used to obtain fault tolerance for other recursive algorithms. In addition, we show how to reduce communication costs using additional processors, and discuss inherent fault tolerance capabilities of fast matrix multiplication algorithms.
Original languageEnglish
Title of host publicationCSC 2020
Pages106-117
Number of pages12
ISBN (Electronic)978-1-61197-622-9
DOIs
StatePublished - 2020
EventSIAM Workshop on Combinatorial Scientific Computing, CSC20 - Seattle, United States
Duration: 11 Feb 202013 Feb 2020
https://epubs.siam.org/doi/10.1137/1.9781611976229

Conference

ConferenceSIAM Workshop on Combinatorial Scientific Computing, CSC20
Abbreviated titleCSC20
Country/TerritoryUnited States
CitySeattle
Period11/02/2013/02/20
Internet address

Fingerprint

Dive into the research topics of 'Fault Tolerance with High Performance for Fast Matrix Multiplication.'. Together they form a unique fingerprint.

Cite this