TY - JOUR
T1 - A crowdsourced set of curated structural variants for the human genome
AU - Chapman, Lesley M
AU - Spies, Noah
AU - Pai, Patrick
AU - Lim, Chun Shen
AU - Carroll, Andrew
AU - Narzisi, Giuseppe
AU - Watson, Christopher M.
AU - Proukakis, Christos
AU - Clarke, Wayne E.
AU - Nariai, Naoki
AU - Dawson, Eric
AU - Jones, Garan
AU - Blankenberg, Daniel
AU - Brueffer, Christian
AU - Xiao, Chunlin
AU - Kolora, Sree Rohit Raj
AU - Alexander, Noah
AU - Wolujewicz, Paul
AU - Ahmed, Azza E.
AU - Smith, Graeme
AU - Shehreen, Saadlee
AU - Wenger, Aaron M.
AU - Salit, Marc
AU - Zook, Justin M.
PY - 2020/6/19
Y1 - 2020/6/19
N2 - A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.
AB - A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.
U2 - 10.1371/journal.pcbi.1007933
DO - 10.1371/journal.pcbi.1007933
M3 - Article
C2 - 32559231
SN - 1553-734X
VL - 16
JO - PLoS Computational Biology
JF - PLoS Computational Biology
IS - 6
M1 - e1007933
ER -