Curated variation benchmarks for challenging medically relevant autosomal genes

Justin Wagner1, Nathan D Olson1, Lindsay Harris1, Jennifer McDaniel1, Haoyu Cheng2, Arkarachai Fungtammasan3, Yih-Chii Hwang3, Richa Gupta3, Aaron M Wenger4, William J Rowell4, Ziad M Khan5, Jesse Farek5, Yiming Zhu5, Aishwarya Pisupati5, Medhat Mahmoud5, Chunlin Xiao6, Byunggil Yoo7, Sayed Mohammad Ebrahim Sahraeian8, Danny E Miller9,10, David Jáspez11, José M Lorenzo-Salazar11, Adrián Muñoz-Barrera11, Luis A Rubio-Rodríguez11, Carlos Flores11,12,13, Giuseppe Narzisi14, Uday Shanker Evani14, Wayne E Clarke14, Joyce Lee15, Christopher E Mason16, Stephen E Lincoln17, Karen H Miga18, Mark T W Ebbert19,20,21, Alaina Shumate22,23, Heng Li2, Chen-Shan Chin24, Justin M Zook25, Fritz J Sedlazeck26

  1. Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
  2. Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
  3. DNAnexus, Inc., Mountain View, CA, USA.
  4. Pacific Biosciences, Menlo Park, CA, USA.
  5. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
  6. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
  7. Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA.
  8. Roche Sequencing Solutions, Santa Clara, CA, USA.
  9. Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children's Hospital, Seattle, WA, USA.
  10. Department of Genome Sciences, University of Washington, Seattle, WA, USA.
  11. Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain.
  12. CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain.
  13. Research Unit, Hospital Universitario N.S. de Candelaria, Santa Cruz de Tenerife, Spain.
  14. New York Genome Center, New York, NY, USA.
  15. Bionano Genomics, San Diego, CA, USA.
  16. Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
  17. Invitae, San Francisco, CA, USA.
  18. UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
  19. Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA.
  20. Department of Internal Medicine, Division of Biomedical Informatics, University of Kentucky, Lexington, KY, USA.
  21. Department of Neuroscience, University of Kentucky, Lexington, KY, USA.
  22. Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
  23. Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
  24. DNAnexus, Inc., Mountain View, CA, USA. jchin@dnanexus.com.
  25. Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA. jzook@nist.gov.
  26. Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. Fritz.Sedlazeck@bcm.edu.

Abstract

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.

Presented By Justin Wagner