TY - GEN
T1 - Inter-Thread communication in multithreaded, reconfigurable coarse-grain arrays
AU - Voitsechov, Dani
AU - Port, Oron
AU - Etsion, Yoav
N1 - Publisher Copyright: © 2018 IEEE.
PY - 2018/12/12
Y1 - 2018/12/12
N2 - Traditional von Neumann GPGPUs only allow threads to communicate through memory on a group-To-group basis. In this model, a group of producer threads writes intermediate values to memory, which are read by a group of consumer threads after a barrier synchronization. To alleviate the memory bandwidth imposed by this method of communication, GPGPUs provide a small scratchpad memory that prevents intermediate values from overloading DRAM bandwidth. In this paper we introduce direct inter-Thread communications for massively multithreaded CGRAs, where intermediate values are communicated directly through the compute fabric on a point-To-point basis. This method avoids the need to write values to memory, eliminates the need for a dedicated scratchpad, and avoids workgroup global barriers. We introduce our proposed extensions to the programming model (CUDA) and execution model, as well as the hardware primitives that facilitate the communication. Our simulations of Rodinia benchmarks running on the new system show that direct inter-Thread communication provides an average speedup of 2.8x (10.3x max) and reduces system power by an average of 5x (22x max), when compared to an equivalent Nvidia GPGPU.
AB - Traditional von Neumann GPGPUs only allow threads to communicate through memory on a group-To-group basis. In this model, a group of producer threads writes intermediate values to memory, which are read by a group of consumer threads after a barrier synchronization. To alleviate the memory bandwidth imposed by this method of communication, GPGPUs provide a small scratchpad memory that prevents intermediate values from overloading DRAM bandwidth. In this paper we introduce direct inter-Thread communications for massively multithreaded CGRAs, where intermediate values are communicated directly through the compute fabric on a point-To-point basis. This method avoids the need to write values to memory, eliminates the need for a dedicated scratchpad, and avoids workgroup global barriers. We introduce our proposed extensions to the programming model (CUDA) and execution model, as well as the hardware primitives that facilitate the communication. Our simulations of Rodinia benchmarks running on the new system show that direct inter-Thread communication provides an average speedup of 2.8x (10.3x max) and reduces system power by an average of 5x (22x max), when compared to an equivalent Nvidia GPGPU.
KW - CGRA
KW - Dataflow
KW - GPGPU
KW - Inter-Thread communication
KW - MPI
KW - Non-von Neumann-Architectures
KW - Reconfigurable-Architectures
KW - SIMD
UR - http://www.scopus.com/inward/record.url?scp=85060018171&partnerID=8YFLogxK
U2 - https://doi.org/10.1109/MICRO.2018.00013
DO - https://doi.org/10.1109/MICRO.2018.00013
M3 - منشور من مؤتمر
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 42
EP - 54
BT - Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
T2 - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
Y2 - 20 October 2018 through 24 October 2018
ER -