Skip to content

Cytosine Methylation Variant Calling with MinION Nanopore Sequencing

MetadataDetails
Publication Date2016-05-17
JournaleScholarship (California Digital Library)
AuthorsArthur C Rand

Cytosine Methylation Variant Calling with MinION Nanopore Sequencing Arthur C. Rand, Miten Jain, Jordan Eizenga, Audrey Musselman-Brown, Hugh E. Olsen, Mark Akeson and Benedict Paten Department of Biomolecular Engineering, University of California, Santa Cruz Abstract Strand Template Complement B Accuracy A Accuracy Chemical modifications to DNA regulate cellular state and function. The Oxford Nanopore MinION is a portable single-molecule DNA sequencer that can sequence long fragments of genomic DNA. Here we show that the MinION can be used to detect and map three cytosine variants: cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine. We present a probabilistic method that enables expansion of the nucleotide alphabet to include bases containing chemical modifications. Our results on synthetic DNA show that individual cytosine base modifications can be classified with accuracy up to 95% in a three-way comparison and 98% in a two-way comparison. We also demonstrate that 5-methylcytosine can be accurately mapped in E. coli genomic DNA Base modification calling accuracy results on synthetic oligonucleotides Nanopore Sequencing C MLE C HDP C MLE mC HDP mC MLE hmC HDPhmC D MLE C HDP C MLE mC HDP mC MLE hmC HDPhmC Template True Label pA time ATGCACTGAACA ATGCAC TGCACT A nanometer-sized protein pore embeded in a membrane. GCACTG X i The membrane seperates two chambers containing an ionic solution. CACTGA A voltage is applied, and the ionic current through the pore is recorded. ACTGAA DNA is threaded through the pore, and partially blocks the ionic current. CTGAAC The level of the ionic current (e ) is due to six nucleotide words (x ). j G 0 G σni γ B γ M γ L G 0 G σn G σni Īø ji C H D Īø ji T G T A C* G C* T TGTA GTAC TACG ACGC CGCT GCTA CTAA TAAG GTAC m TAC m G ACGC m CGC m T GC m TA C m TAA GTAC TAC G h ACGC CGC T GC TA C TAA AC m GC C m GCT AC m GC m C m GC m T AC m GC h C m GC h T AC h GC C h GCT AC h GC m C h GC m T AC h GC h C h GC h T h h h h A A PCR Reads B Mean pairwise Hellinger Distance A. Data partitioning for HDP training on E. coli. 1,709 high-confidence methylated CCWGG sites (pins) were divided into training (unstarred) and test (starred). The HDP is trained on reads from PCR amplified DNA (orange lines) and events aligned to the training sites from genomic DNA reads (magenta lines). These combined data constitute the training dataset (dashed box). The trained model is then tested on genomic and PCR DNA reads aligned to the test sites from separate flow-cells. B. ROC plot shows HMM-HDP two-way classification performance on cytosines in test group (A, starred pins). Methylation calls are made by combining marginal probabilities from template and complement reads. Genomic reads were used to assess true positive rate, the PCR reads were used to assess the false positive rate. Genomic Reads True Positive Rate H A G h Comparison of different HDP topologies Three-Way Accuracy Model Mean Accuracy (read) Median Accuracy (read) Mean Accuracy (site) Median Accuracy (site) MLE singlelevel multiset composition middleNts group Two-Way Accuracy Model Mean Accuracy (read) Median Accuracy (read) Mean Accuracy (site) Median Accuracy (site) singlelevel multiset MLE is the maximum likelihood estimate of a normal distribution. ā€˜Two-level’ is an HDP model with no subgroupings of 6-mers, ā€˜Multiset’, ā€˜Composition’, ā€˜MiddleNucleotides’, and ā€˜GroupMultiset’ are three-level HDP models. Three-way classification was performed between cytosine, 5-methylcytosine, and 5-hydroxymethylcytosine. Two-way classifications were between cytosine and 5-methylcytosine. False Positive Rate The HDP more realistically models ionic current distributions AGCTAA KDE γ B B Mapping 5-methylcytosine in E. coli genomic DNA MLE γ L A and B. The accuracy distribution by read (A) and by context (B) is shown for the MLE emission distributions and the ā€˜Multiset’ HDP model on synthetic oligonucleotides. The triangles represent the mean of the distribution. C. Confusion matrix showing HMM-HDP three-way cytosine classification performance on template reads of synthetic oligonucleotides. D. Scatter plot shows the correlation between log-odds of correct classification and the mean pairwise Hellinger distance between the methylation statuses of the 6-mer distributions overlapping a cytosine. A. Architecture of hidden Markov model used in this study. The match state ā€˜M’ (square) emits an event-6-mer pair and proceeds along the reference, Insert-Y ā€˜Iy’ (diamond) emits a pair but stays in place, and Insert-X ā€˜Ix’ (circle) proceeds along the reference but does not emit a pair. Two-level (B) and three-level (C) hierarchical Dirichlet process shown in graphical form. Circles represent random variables. The base distribution ā€˜H’ is a normal inverse- gamma distribution for both models. The Dirichlet processes ā€˜G 0 ’, ā€˜G σn ’, and ā€˜G σni ’ are parameterized by their parent distribution and shared concentration parameters ā€˜Ī³ B ’, γ M ’, and γ L ’. The factors ā€˜Īøji’ specify the parameters of the normal distribution mixture component that generates observation ā€˜xji’. D. Variable-order HMM meta-structure over an example reference sequence. Each C in the reference X ji represents a potentially methylated cytosine. The structure expands around the C* base to accommodate for all possible methylation states. Each cell contains the three states shown in A, and transitions span between cells. The transitions are restricted so that methylation states are labeled X ji consistently within a path. The match states are drawn with 4-mers for simplicity, but the model is implemented with 6-mers. I y (-,e j ) Predicted Label HDP (Multiset) M (x i ,e j ) Modeling Ionic Current with a hidden Markov model I x (x i ,-) i A Log-odds of correct classification e j : μ,σ,t TTGCTG GAACTT C mC hmC Probability distributions for three representative 6-mers by multiple methods. The first row shows the kernel density estimate (KDE). The middle row shows maximum likelihood estimated (MLE) normal distribution probability density functions. The bottom row shows probability density functions from the ā€˜Multiset’ hierarchical Dirichlet process (HDP). All data shown are from template reads.