Reliability of SSD Storage Systems
Loading...
Date
2015-12-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Solid-state drives (SSDs) are attractive storage components due to their many attractive properties, however, concerns about their reliability still remain and this delays the wider deployment of the SSDs.
Many protection schemes have been proposed to improve the reliability of SSDs.
For example, some techniques like error correction codes (ECC), log-like writing of ash translation layer (FTL), garbage collection and wear leveling improve the reliability of SSD at the device level. Composing an array of SSDs and employing system level parity protection is one of the popular protection schemes at the system level. Enterprise class (high-end) SSDs are faster and more resilient than client class (low-end) SSDs but they are expensive to be deployed in large scale storage systems. It is an attractive and practical alternative to exploit the high-end SSDs as a cache and low-end SSDs as main storage. The high-end SSD cache equipped on a low-end SSD array enhances both latency and reduces write count of the SSD storage system at the same time.
This work analyzes the effectiveness of protection schemes originally designed for HDDs but applied to SSD storage systems. We find that different characteristics of
HDDs and SSDs make integration of those solutions in SSD storage systems not so straight-forward.
This work, at first, analyzes the effectiveness of the device level protection schemes such as ECC and scrubbing. A Markov model based analysis of the protection schemes is presented. Our model considers time varying nature of the reliability of ash memory as well as write amplification of various device level protection schemes. Our study shows that write amplification from these various sources can significantly affect the benefits of protection schemes in improving the lifetime. Based on the results from our analysis, we propose that bit errors within an SSD page be left uncorrected until a threshold of errors are accumulated. We show that such an approach can significantly improve lifetimes by up to 40%.
This work also analyzes the effectiveness of parity protection over SSD arrays, a widely used protection scheme for SSD arrays at system level. The parity protection is typically employed to compose reliable storage systems. However, careful consideration is required when SSD based systems employ parity protection. Additional writes are required for parity updates. Also, parity consumes space on the device, which results in write amplification from less efficient garbage collection at higher space utilization. We present a Markov model to estimate the lifetime of SSD based RAID systems in different environments. In a small array, our results show that parity protection provides benefit only with considerably low space utilizations and low data access rates. However, in a large system, RAID improves data lifetime even when we take write amplification into account.
This work explores how to optimize a mixed SSD array in terms of performance and lifetime. We show that simple integration of different classes of SSDs in traditional caching policies results in poor reliability. We also reveal that caching policies with static workload classifiers are not always efficient. We propose a sampling based adaptive approach that achieves fair workload distribution across the cache and the storage. The proposed algorithm enables fine-grained control of the workload distribution which minimizes latency over lifetime of mixed SSD arrays. We show that our adaptive algorithm is very effective in improving the latency over lifetime metric, on an average, by up to 2.36 times over LRU, across a number of workloads.
Description
Keywords
Storage, Reliability, SSD