Skip to main navigation Skip to search Skip to main content

CSR: Core surprise removal in commodity operating systems

  • Noam Shalev
  • , Eran Harpaz
  • , Hagar Porat
  • , Idit Keidar
  • , Yaron Weinsberg

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future operating systems for many-core machines will have to incorporate fault tolerance mechanisms. We present CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems. Our approach overcomes surprise removal of faulty cores, and also tolerates cascading core failures. When a core fails in user mode, CSR terminates the process executing on that core and migrates the remaining processes in its run-queue to other cores. We further show how hardware transactional memory may be used to overcome failures in critical kernel code. Our solution is scalable, incurs low overhead, and is designed to integrate into modern operating systems. We have implemented it in the Linux kernel, using Haswell's Transactional Synchronization Extension, and tested it on a real system.

Original languageEnglish
Title of host publicationASPLOS 2016 - 21st International Conference on Architectural Support for Programming Languages and Operating Systems
Pages773-787
Number of pages15
ISBN (Electronic)9781450340915
DOIs
StatePublished - 25 Mar 2016
Event21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016 - Atlanta, United States
Duration: 2 Apr 20166 Apr 2016

Publication series

NameInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
Volume02-06-April-2016

Conference

Conference21st International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2016
Country/TerritoryUnited States
CityAtlanta
Period2/04/166/04/16

Keywords

  • CSR
  • Core Surprise Removal
  • Hotplug
  • Operating Systems
  • Reliability
  • Transactional Memory

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'CSR: Core surprise removal in commodity operating systems'. Together they form a unique fingerprint.

Cite this