DNA Methylation Analysis Isn’t Just Another Genomic Assay - Here’s What Often Goes Wrong (and How to Fix It)

Need expert help with your WGBS, RRBS, or EPIC array data? Our senior bioinformaticians help troubleshoot normalization, DMR calling, smoothing pitfalls, and tricky integration — so your methylation analysis is reliable and publication-ready. Request a free consultation →

DNA methylation is one of the major epigenetic modifications. It influences development, aging, disease progression, and even cellular response to environmental exposure. However, when it comes to actual data analysis, the process is often more complicated than many people expect.

We have helped many collaborators working with WGBS, RRBS, EPIC array and other methylation assays. The challenges they meet are not always the same, but often follow similar pattern.

This article is a summary of common issues in methylation data analysis. Hopefully it can be helpful for researchers who already have data, or plan to generate it.

Which Methylation Assay Are We Talking About? (It Matters)

There are several popular methods to measure DNA methylation, and they are not interchangeable. Each assay has its own resolution, coverage bias, and analysis workflow. Some measure genome-wide methylation with base-pair precision, others focus on selected CpG sites. This distinction becomes very important when choosing alignment, normalization, and downstream modeling tools.

Assay	Description	Typical Issues
WGBS (Whole-Genome Bisulfite Sequencing)	Genome-wide, base-resolution methylation	Sparse signal in some regions; over-smoothing risk
RRBS (Reduced Representation Bisulfite Sequencing)	Targets CpG-dense fragments	Coverage not uniform; some regulatory regions missed
EPIC array (850K CpG)	Array covering selected regulatory regions	Batch effect, probe bias, dye-channel correction needed
MeDIP-seq	Antibody enrichment for methylated DNA	Lower resolution; not ideal for quantification
oxBS-seq / TAB-seq	Separate 5mC from 5hmC	Difficult protocol; error-prone in low-coverage sites
scBS-seq / snmC-seq	Single-cell bisulfite sequencing	Extreme sparsity; needs advanced imputation methods

Assay choice determines not only data quality, but also which bioinformatics pipeline will be appropriate. Sometimes, using the wrong method for a particular data type will cause wrong biological interpretation.

Common Challenges in Methylation Data Analysis

1. Normalization May Change Your Result Too Much

Whether you are using array or sequencing, normalization is always a difficult issue. If you normalize too much, the biological difference may disappear. But if you don’t normalize, technical noise can dominate.

In EPIC array, quantile normalization is commonly used. But this assumes the distribution across samples should be similar, which may not be true in cancer or developmental studies. In WGBS or RRBS, coverage normalization is still debated.

Suggestion: always look at the methylation beta distribution before and after normalization. If group-level differences are reduced, think carefully whether it is desired or not.

2. Wrong Choice of Differential Testing Method

Some users try to use DESeq2 or edgeR on methylation counts. This usually comes from habit (from RNA-seq), but these methods were not designed for bisulfite data. The variance structure and read coverage behavior is quite different.

There are better tools available, like bsseq, DSS, and methylKit. But they also have assumptions - about smoothing, dispersion, or spatial correlation - that may not be valid for every dataset.

Differential methylation is often more reliable at region-level (DMR) than at individual CpG site, especially when coverage is low or sample size is small.

3. Smoothing Helps - But Can Also Hide Patterns

For low-coverage WGBS, smoothing nearby CpGs can reduce noise. But over-smoothing can hide sharp regulatory boundaries, like TSS or enhancer borders.

Some pipelines apply smoothing without considering CpG density or chromatin structure. This can create artifacts or lead to missing real biological signal.

In our opinion, adaptive smoothing (or no smoothing) is better in sparse regions. Visual inspection is still important.

4. Naive Peak-to-Gene Assignment Can Mislead

In many papers, people assign DMRs to the nearest TSS. This is convenient, but can be misleading. Not all methylation changes affect the closest gene. Enhancers often regulate distal targets, and there is looping interaction not captured by linear genome distance.

Also, intergenic methylation is not necessarily non-functional - sometimes it reflects transposon regulation, or early developmental programming.

Better to consider chromatin interaction data or public annotation like FANTOM5 enhancers when assigning function to DMRs.

5. Array-Specific Pitfalls You Shouldn’t Ignore

EPIC array is popular due to its cost and simplicity. But it also has well-known issues:

- Some probes overlap SNPs
- Some probes map to multiple genome locations
- Dye-bias correction can change values by a lot

Ignoring these can cause false positive or false negative results. Tools like minfi or ChAMP can help, but you still need to check their assumptions.

6. Difficulties in Multi-Omics Integration

When people integrate methylation with RNA-seq or ATAC-seq, they often expect direct correlation. But the biology is not that simple.

- Promoter methylation may repress gene expression - but not always
- Methylation at enhancers may only matter in specific cell types
- Bulk methylation may not reflect true state in heterogeneous tissue

Some studies show strong anti-correlation between methylation and expression, but others don’t. That’s not always error - sometimes it reflects real complexity.

Best practice is to visualize each locus carefully, and not rely only on statistics or heatmaps.

A Few Words to End

We have seen many datasets from different labs - WGBS, RRBS, arrays, and sometimes single-cell methylation. Even when the protocol is similar, the results can be very different.

Sometimes the problem is technical: bad library, low conversion rate, or incomplete trimming. But many times, it is the analysis pipeline that causes confusion - either by making hidden assumptions, or by applying RNA-seq-style methods that are not valid here.

Methylation analysis is not always hard. But it is very easy to make mistakes quietly. That’s why we think people should not just follow standard pipeline without understanding what each step is doing.

We hope this post can help other researchers avoid common issues, and get more confidence in their methylation data interpretation.

About the author: Justin T. Li received his Ph.D. in Neurobiology from the University of Wisconsin–Madison and an M.S. in Computer Science from the University of Houston. He has published more than 50 articles in bioinformatics, computational biology, and neuroscience. Between 2004 and 2009, he served as an Assistant Professor at the University of Minnesota Medical School. Since 2013, he has been a Lead Bioinformatician at AccuraScience, where he has contributed to dozens of DNA methylation and epigenomics projects across both academic and industrial settings. His work emphasizes careful data interpretation and method adaptation - especially for complex or non-standard datasets.

Need help analyzing methylation or other epigenetic data? Learn more about how we can help, or visit our FAQ page.

Send us an inquiry, chat with us online (during business hours 9–5 Mon–Fri U.S. Central Time), or reach us in other ways!

FAQs

Company