TY - GEN
T1 - Proactive Runtime Detection of Aging-Related Silent Data Corruptions
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
AU - Ma, Jiacheng
AU - Ganaiem, Majd
AU - Burbage, Madeline
AU - Gregersen, Theo
AU - McAmis, Rachel
AU - Gabbay, Freddy
AU - Kasikci, Baris
N1 - Publisher Copyright: © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/4/10
Y1 - 2025/4/10
N2 - Recent advancements in semiconductor process technologies have unveiled the susceptibility of hardware circuits to reliability issues, especially those related to transistor aging. Transistor aging gradually degrades gate performance, eventually causing hardware to behave incorrectly. Such misbehaving hardware can result in silent data corruptions (SDCs) in software - -a type of failure that comes without logs or exceptions, but causes miscomputing instructions, bitflips, and broken cache coherency. Alas, while design efforts can be made to mitigate transistor aging, complete elimination of this problem during design and fabrication cannot be guaranteed. This emerging challenge calls for a mechanism that not only detects potentially aged hardware in the field, but also triggers software mitigations at application runtime.We propose Vega, a novel workflow that allows efficient detection of aging-related failures at software runtime. Vega leverages the well-studied gate-level modeling of aging effects to identify susceptible signal propagation paths that could fail due to transistor aging. It then utilizes formal verification techniques to generate short test cases that activate these paths and detect any failure within them. Vega integrates the test cases into a user application by directly fusing them together, or by packaging the test cases into a library that the application can invoke. We demonstrate our proposed techniques on the arithmetic logic unit and floating-point unit of a RISC-V CPU. We show that Vega generates effective test cases and integrates them into applications with an average of 0.8% performance overhead.
AB - Recent advancements in semiconductor process technologies have unveiled the susceptibility of hardware circuits to reliability issues, especially those related to transistor aging. Transistor aging gradually degrades gate performance, eventually causing hardware to behave incorrectly. Such misbehaving hardware can result in silent data corruptions (SDCs) in software - -a type of failure that comes without logs or exceptions, but causes miscomputing instructions, bitflips, and broken cache coherency. Alas, while design efforts can be made to mitigate transistor aging, complete elimination of this problem during design and fabrication cannot be guaranteed. This emerging challenge calls for a mechanism that not only detects potentially aged hardware in the field, but also triggers software mitigations at application runtime.We propose Vega, a novel workflow that allows efficient detection of aging-related failures at software runtime. Vega leverages the well-studied gate-level modeling of aging effects to identify susceptible signal propagation paths that could fail due to transistor aging. It then utilizes formal verification techniques to generate short test cases that activate these paths and detect any failure within them. Vega integrates the test cases into a user application by directly fusing them together, or by packaging the test cases into a library that the application can invoke. We demonstrate our proposed techniques on the arithmetic logic unit and floating-point unit of a RISC-V CPU. We show that Vega generates effective test cases and integrates them into applications with an average of 0.8% performance overhead.
UR - http://www.scopus.com/inward/record.url?scp=105007015516&partnerID=8YFLogxK
U2 - 10.1145/3622781.3674182
DO - 10.1145/3622781.3674182
M3 - منشور من مؤتمر
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 220
EP - 235
BT - ASPLOS 2024 - Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
Y2 - 27 April 2024 through 1 May 2024
ER -