TY - GEN
T1 - SDGEN
T2 - 13th USENIX Conference on File and Storage Technologies, FAST 2015
AU - Gracia-Tinedo, Raúl
AU - Harnik, Danny
AU - Naor, Dalit
AU - Sotnikov, Dmitry
AU - Toledo, Sivan
AU - Zuck, Aviad
N1 - Funding Information: We first thank our shepherd Xiaosong Ma and anonymous reviewers. This work has been partly funded by the EU projects CloudSpaces (FP7-317555) and IOStack (H2020-644182) and Spanish research projects DELFIN (TIN-2010-20140-C03-03) and Cloud Services and Community Clouds (TIN2013-47245-C2-2-R) funded by the Ministry of Science and Innovation.
PY - 2015
Y1 - 2015
N2 - Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication. To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system’s performance. In our implementation, called SDGen, characterizations take at most 2.5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator’s accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).
AB - Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication. To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system’s performance. In our implementation, called SDGen, characterizations take at most 2.5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator’s accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).
UR - http://www.scopus.com/inward/record.url?scp=84969821497&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015
SP - 317
EP - 330
BT - Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015
Y2 - 16 February 2015 through 19 February 2015
ER -