TY - GEN
T1 - Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications
AU - Moran, Yoav
AU - Schwartz, Oded
N1 - Publisher Copyright: © 2023 Owner/Author.
PY - 2023/6/17
Y1 - 2023/6/17
N2 - Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 × 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 × 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 × 2 block multiplication, thus showing our technique is optimal.
AB - Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires ρ times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, ρ is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of ρ + 3 over 2(ρ + 1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The ρ values for energy costs are typically larger than the ρ values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 × 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 × 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 × 2 block multiplication, thus showing our technique is optimal.
KW - commutative matrix multiplication
KW - matrix multiplication
UR - http://www.scopus.com/inward/record.url?scp=85164294243&partnerID=8YFLogxK
U2 - 10.1145/3558481.3591083
DO - 10.1145/3558481.3591083
M3 - منشور من مؤتمر
T3 - Annual ACM Symposium on Parallelism in Algorithms and Architectures
SP - 379
EP - 390
BT - SPAA 2023 - Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures
T2 - 35th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2023
Y2 - 17 June 2023 through 19 June 2023
ER -