Speaker
Dr
Matthew Curry
(Sandia National Laboratories)
Description
When designing large scale storage systems, failure is always a serious concern that demands constant attention. However, the ability for system designers to objectively evaluate their risk of data loss for a given storage system is minimal. Instead, they must enumerate possible failure modes, estimate their relative probability, identify possible mitigations, and decide whether the expense is worthwhile. This process relies on folk wisdom, rules of thumb, and anecdotal experience. For systems that grow larger and more complex year-by-year, this methodology is too imprecise to guarantee safety while ensuring efficiency.
This talk will detail some of the progress in the SIMS^2 project, a collaboration between Sandia National Laboratories, University of Wisconsin-Madison, and Los Alamos National Laboratory charged with increasing the science and rigor behind evaluating system designs. It will cover some of the pitfalls of current methods of evaluating systems, methods for determining complex behavior of aggregated components, evaluation of different types of failure modes, and some interesting inflection points for different system designs.
HPC content
Contained in Abstract
Primary author
Dr
Matthew Curry
(Sandia National Laboratories)