UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Tolerating intermittent hardware errors : characterization, diagnosis and recovery Rashid, Layali

Abstract

Over three decades of continuous scaling in CMOS technology has led to tremendous improvements in processor performance. At the same time, the scaling has led to an increase in the frequency of hardware errors due to high process variations, extreme operating conditions and manufacturing defects. Recent studies have found that 40% of the processor failures in real-world machines are due to intermittent hardware errors. Intermittent hardware errors are non-deterministic bursts of errors that occur in the same physical location. Intermittent errors have characteristics that are different from transient and permanent errors, which makes it challenging to devise efficient fault tolerance techniques for them. In this dissertation, we characterize the impact of intermittent hardware faults on programs using fault injection experiments at the micro-architecture level. We find that intermittent errors are likely to generate software visible effects when they occur. Based on our characterization results, we build intermittent error tolerance techniques with focus on error diagnosis and recovery. We first evaluate the impact of different intermittent error recovery scenarios on a processor's performance and availability. We then propose DIEBA (Diagnose Intermittent hardware Errors in microprocessors by Backtracing Application), a software-based technique to diagnose the fault-prone functional units in a processor.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International