TY - GEN
T1 - Practical challenges in delivering the promises of real processing-in-memory machines
AU - Talati, Nishil
AU - Ali, Ameer Haj
AU - Ben Hur, Rotem
AU - Wald, Nimrod
AU - Ronen, Ronny
AU - Gaillardon, Pierre Emmanuel
AU - Kvatinsky, Shahar
N1 - Publisher Copyright: © 2018 EDAA.
PY - 2018/4/19
Y1 - 2018/4/19
N2 - Processing-in-Memory (PiM) machines promise to overcome the von Neumann bottleneck in order to further scale performance and energy efficiency of computing systems by reducing the extent of data transfer and offering ample parallelism. In this paper, we take the memristive Memory Processing Unit (mMPU) as a case study of a PiM machine and scrutinize it in practical scenarios. Specifically, we explore the limitations of parallelism and data transfer elimination. We argue that lack of operand locality and arrangement might make data transfer inevitable in the mMPU. We then devise techniques to move data within the mMPU, without transferring it off-chip, and quantify their costs. Additionally, we present electrical parameters that might limit the parallelism offered by the mMPU and evaluate their impact. Using benchmarks from the LGsynth91 suite, their vector extensions, and a few synthetic data-parallel workloads, we show that the internal data transfer results in an increase of up to 1.5× in the execution time, while the parallelism can be limited in some cases to 256 gates, resulting in an increase in execution time by 1.1× to 2×.
AB - Processing-in-Memory (PiM) machines promise to overcome the von Neumann bottleneck in order to further scale performance and energy efficiency of computing systems by reducing the extent of data transfer and offering ample parallelism. In this paper, we take the memristive Memory Processing Unit (mMPU) as a case study of a PiM machine and scrutinize it in practical scenarios. Specifically, we explore the limitations of parallelism and data transfer elimination. We argue that lack of operand locality and arrangement might make data transfer inevitable in the mMPU. We then devise techniques to move data within the mMPU, without transferring it off-chip, and quantify their costs. Additionally, we present electrical parameters that might limit the parallelism offered by the mMPU and evaluate their impact. Using benchmarks from the LGsynth91 suite, their vector extensions, and a few synthetic data-parallel workloads, we show that the internal data transfer results in an increase of up to 1.5× in the execution time, while the parallelism can be limited in some cases to 256 gates, resulting in an increase in execution time by 1.1× to 2×.
KW - mMPU
KW - memristors
KW - von Neumann bottleneck
UR - http://www.scopus.com/inward/record.url?scp=85048778402&partnerID=8YFLogxK
U2 - 10.23919/DATE.2018.8342275
DO - 10.23919/DATE.2018.8342275
M3 - Conference contribution
T3 - Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition, DATE 2018
SP - 1628
EP - 1633
BT - Proceedings of the 2018 Design, Automation and Test in Europe Conference and Exhibition, DATE 2018
T2 - 2018 Design, Automation and Test in Europe Conference and Exhibition, DATE 2018
Y2 - 19 March 2018 through 23 March 2018
ER -