Abstract
The increase in machine size and the decrease in operating voltage have made errors more common, both soft (bit flip) and hard (component failure). Checkpoint-restart solutions address the problem of hard errors, but incur significant costs in time and hardware. For some algorithms, classical matrix multiplication included, tailored solutions incur much lower overhead. Existing efficient algorithmic tolerance techniques typically aim for iterative algorithms, and thus cannot be applied to recursive algorithms, such as fast matrix multiplication algorithms. By utilizing combinatorial aspects of the computational graphs of these algorithms, we obtain fault resilience with small overhead costs, for Strassen's and other recursive fast matrix multiplication algorithms. To the best of our knowledge, this is the first fault-tolerant solution tailored for fast matrix multiplication algorithm. Our solution is asymptotically better than any of the previous (classical based) fault-tolerant solutions, unless the error rate is extremely high. Our technique can be used to obtain fault tolerance for other recursive algorithms. In addition, we show how to reduce communication costs using additional processors, and discuss inherent fault tolerance capabilities of fast matrix multiplication algorithms.
Original language | English |
---|---|
Title of host publication | CSC 2020 |
Pages | 106-117 |
Number of pages | 12 |
ISBN (Electronic) | 978-1-61197-622-9 |
DOIs | |
State | Published - 2020 |
Event | SIAM Workshop on Combinatorial Scientific Computing, CSC20 - Seattle, United States Duration: 11 Feb 2020 → 13 Feb 2020 https://epubs.siam.org/doi/10.1137/1.9781611976229 |
Conference
Conference | SIAM Workshop on Combinatorial Scientific Computing, CSC20 |
---|---|
Abbreviated title | CSC20 |
Country/Territory | United States |
City | Seattle |
Period | 11/02/20 → 13/02/20 |
Internet address |