Accelerating content-defined-chunking based data deduplication by exploiting parallelism

Wen Xia, Dan Feng, Hong Jiang, Yucheng Zhang, Victor Chang, Xiangyu Zou

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

Original languageEnglish
Pages (from-to)406-418
Number of pages13
JournalFuture Generation Computer Systems
Volume98
DOIs
Publication statusPublished - 29 Mar 2019

Fingerprint

Data reduction
Throughput
Big data

Cite this

Xia, Wen ; Feng, Dan ; Jiang, Hong ; Zhang, Yucheng ; Chang, Victor ; Zou, Xiangyu. / Accelerating content-defined-chunking based data deduplication by exploiting parallelism. In: Future Generation Computer Systems. 2019 ; Vol. 98. pp. 406-418.
@article{3cbe3c7ab83342bbaa3ce80cf54349a5,
title = "Accelerating content-defined-chunking based data deduplication by exploiting parallelism",
abstract = "Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02{\%} decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.",
author = "Wen Xia and Dan Feng and Hong Jiang and Yucheng Zhang and Victor Chang and Xiangyu Zou",
year = "2019",
month = "3",
day = "29",
doi = "10.1016/j.future.2019.02.008",
language = "English",
volume = "98",
pages = "406--418",
journal = "Future Generation Computer Systems",
issn = "0167-739X",
publisher = "Elsevier",

}

Accelerating content-defined-chunking based data deduplication by exploiting parallelism. / Xia, Wen; Feng, Dan; Jiang, Hong; Zhang, Yucheng; Chang, Victor; Zou, Xiangyu.

In: Future Generation Computer Systems, Vol. 98, 29.03.2019, p. 406-418.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Accelerating content-defined-chunking based data deduplication by exploiting parallelism

AU - Xia, Wen

AU - Feng, Dan

AU - Jiang, Hong

AU - Zhang, Yucheng

AU - Chang, Victor

AU - Zou, Xiangyu

PY - 2019/3/29

Y1 - 2019/3/29

N2 - Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

AB - Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

UR - http://www.scopus.com/inward/record.url?scp=85063748545&partnerID=8YFLogxK

U2 - 10.1016/j.future.2019.02.008

DO - 10.1016/j.future.2019.02.008

M3 - Article

AN - SCOPUS:85063748545

VL - 98

SP - 406

EP - 418

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

SN - 0167-739X

ER -