TY - GEN
T1 - Accelerating Distributed Matrix Multiplication with 4-Dimensional Polynomial Codes
AU - Nissim, Roy
AU - Schwartz, Oded
N1 - Publisher Copyright: © 2023 Copyright for this paper is retained by the authors.
PY - 2023
Y1 - 2023
N2 - A single straggler worker may delay an entire distributed system. The state-of-the-art strategies for mitigating delays in large-scale distributed matrix multiplication are polynomial-based coded computations such as the Polynomial Codes and Entangled Polynomial Codes. While such strategies deal with stragglers efficiently, they discard partial computations performed by stragglers. Hence, they are sub-optimal. Here, we present the Multi Entangled Polynomial Codes, a straggler mitigation strategy that utilizes the computations performed by all workers and significantly reduces the running time. Furthermore, it allows the final output to be decoded before any worker completes its tasks, thereby breaking the lower bound of Yu, Maddah-Ali, and Avestimehr (2020). Previous studies that utilize partial computations performed by stragglers require large Maximal Distance Separable codes, resulting in high overhead costs. In contrast, our strategy requires short codes comparable to Entangled Polynomial Codes. Thus, we preserve efficient encoding and decoding complexity and reduce the arithmetic overhead of previous solutions by a factor of (Formula presented), where N and W are the matrices dimension and the number of workers, respectively. We provide experimental results on an Amazon EC2 cluster that demonstrate up to 15% speedup over previous strategies. Moreover, we show that our strategy is optimal up to a factor of (1 + o(1)).
AB - A single straggler worker may delay an entire distributed system. The state-of-the-art strategies for mitigating delays in large-scale distributed matrix multiplication are polynomial-based coded computations such as the Polynomial Codes and Entangled Polynomial Codes. While such strategies deal with stragglers efficiently, they discard partial computations performed by stragglers. Hence, they are sub-optimal. Here, we present the Multi Entangled Polynomial Codes, a straggler mitigation strategy that utilizes the computations performed by all workers and significantly reduces the running time. Furthermore, it allows the final output to be decoded before any worker completes its tasks, thereby breaking the lower bound of Yu, Maddah-Ali, and Avestimehr (2020). Previous studies that utilize partial computations performed by stragglers require large Maximal Distance Separable codes, resulting in high overhead costs. In contrast, our strategy requires short codes comparable to Entangled Polynomial Codes. Thus, we preserve efficient encoding and decoding complexity and reduce the arithmetic overhead of previous solutions by a factor of (Formula presented), where N and W are the matrices dimension and the number of workers, respectively. We provide experimental results on an Amazon EC2 cluster that demonstrate up to 15% speedup over previous strategies. Moreover, we show that our strategy is optimal up to a factor of (1 + o(1)).
UR - http://www.scopus.com/inward/record.url?scp=85168972323&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - SIAM Conference on Applied and Computational Discrete Algorithms, ACDA 2023
SP - 134
EP - 146
BT - SIAM Conference on Applied and Computational Discrete Algorithms, ACDA 2023
A2 - Berry, Jonathan
A2 - Shmoys, David
A2 - Cowen, Lenore
A2 - Naumann, Uwe
PB - Society for Industrial and Applied Mathematics Publications
T2 - 2nd SIAM Conference on Applied and Computational Discrete Algorithms, ACDA 2023
Y2 - 31 May 2023 through 2 June 2023
ER -