Why So Many scRNA-seq Analyses Go Wrong - And How to Avoid It: Part 1

Introduction

Single-cell RNA-seq (scRNA-seq) has transformed how we study cell types, states, and transitions. But it’s also surprisingly easy to get wrong. What looks like beautiful biology may be the artifact of a careless threshold, wrong annotation, or tool misuse. We've seen this many times - in grant submissions, conference talks, even published papers.

This blog series is written for those who already use scRNA-seq, not for beginners. We’re not going to teach what Seurat or Scanpy does. Instead, we’ll walk through 12 common - but critical - mistakes we’ve seen even experienced groups make, and how top analysts approach things differently. Every section reflects painful lessons learned, including some from our own early projects. We hope they’ll save you from re-learning them the hard way.

Need help avoiding common scRNA-seq pitfalls? Our experienced bioinformatics team has rescued dozens of projects from review disasters, hidden artifacts, and analysis missteps. Request a free consultation →

Part 1 - Foundations That Undermine Everything

Part 2 - Assigning Meaning and Modeling Gone Wrong

Part 3 - From DEGs to Reviewers: Where Projects Collapse

Part 1 - Foundations That Undermine Everything

1. QC Filtering That Deletes Biology

The problem:
Filtering cells based on simple QC cutoffs - like mitochondrial percentage or gene count - seems harmless. But in many datasets, especially those from stressed or rare cell types, these rules cut out real biology.

People apply fixed thresholds, like “remove cells with >10% MT” or “less than 500 genes,” thinking they’re cleaning data. But these cutoffs may delete genuine cells undergoing stress, activation, or terminal differentiation. Some may even target the exact population of interest without noticing.

What actually happens:

- A reviewer notices that a stress-related cell population was lost.

- QC plot shows a bimodal distribution, but only the “high-quality” mode was kept.

- Important downstream cluster disappears when less strict thresholds are used.

- The most informative cells - those reacting to treatment - are exactly the ones that got filtered out.

Why this happens:

- Blind reuse of pipeline defaults: Many tools suggest generic thresholds, which don't generalize well across tissues or conditions.

- Failure to visualize distributions: Analysts skip plotting QC metrics in a cell-type - aware manner. They don’t see when subpopulations cluster around “bad” QC ranges.

- Lack of domain knowledge: Some tissues - like placenta, tumors, or bronchoalveolar lavage - naturally contain cells with high mitochondrial content.

What experienced analysts do differently:

- Always check QC distributions by cluster or sample, not globally.

- Use flexible, data-driven cutoffs, not fixed numbers borrowed from tutorials.

- Consider the biology: if stress or mitochondrial activity is part of the hypothesis, don't remove such cells automatically.

Hard lesson learned:
QC is not just about removing “junk” - it’s about deciding which biology is real. We've seen cases where entire cell states vanished just because someone followed Seurat defaults without thinking. Real experts argue with the QC metrics, not blindly accept them.

2. Doublets and Ambient RNA: Ghost Signals That Mislead You

The problem:
Doublets and ambient RNA are technical artifacts, but they masquerade as real biological signals. If not handled properly, they distort everything from clustering to differential expression, leading to very confident, very wrong conclusions.

Some cell types are more likely to form doublets. And in droplet-based platforms like 10x, ambient RNA - free-floating transcripts - gets incorporated into empty or low-content droplets, contaminating real signal.

What actually happens:

- A “new” cluster turns out to be B cells plus epithelial cell doublets.

- Marker genes seem confusing: one cluster expresses both T-cell and monocyte genes.

- A reviewer questions why lung surfactant genes show up in every single cluster.

- The DEGs in treated vs untreated samples reflect ambient cytokines, not true response.

Why this happens:

- No doublet removal: Analysts forget - or skip - doublet detection tools like DoubletFinder or Scrublet, or they use them with unrealistic parameters.

- No ambient RNA correction: SoupX and related tools can help, but many teams don’t use them, especially under time pressure.

- Misinterpretation of mixed signatures: People see strange combinations and assume novel cell types rather than technical noise.

What experienced analysts do differently:

- Always run at least one doublet detection tool - before clustering.

- Use biological knowledge to flag impossible combinations of marker genes.

- Correct ambient RNA if there's strong reason to suspect its presence - especially in inflamed or necrotic samples.

Hard lesson learned:
Many beautiful clusters are fakes. What makes them dangerous is that they’re reproducible - and reviewers may not catch them. You must.

3. Over-reliance on Default Pipelines: Dangerous Comfort

The problem:
Default pipelines make scRNA-seq analysis accessible. But they also hide critical assumptions. Too many teams run tools like Seurat, Scanpy, or Harmony with out-of-the-box settings and never question what those settings mean.

Each step - normalization, HVG selection, PCA, neighbor graph, clustering - is filled with decisions. Defaults reflect someone’s guess, not your biology. If you don't understand them, you're not analyzing your data. You're watching someone else do it.

What actually happens:

- Changing “dims=30” to “dims=20” makes a cluster disappear.

- A reviewer asks why SCTransform was used instead of log-normalization.

- Same pipeline applied to tumor and blood fails on one of them.

- Interpretation hinges on a cluster that only appears with one clustering resolution.

Why this happens:

- Black-box use: Analysts run scripts without understanding what each function does.

- No sensitivity analysis: Pipelines are brittle, but teams never test how stable results are under different parameter choices.

- Tool mismatch: Some tools perform poorly on small sample sizes or high heterogeneity but are used anyway.

What experienced analysts do differently:

- Read the documentation, then test key parameters on your dataset.

- Don’t blindly trust the clustering result - try different resolutions, or use complementary methods (e.g., Leiden vs Louvain).

- Understand what normalization method does to your variance structure before running DE tests.

Hard lesson learned:
Default pipelines give you reproducibility - but not necessarily truth. They’re a starting point, not an end. We’ve seen clients build entire stories on Seurat defaults, only to collapse under reviewer questions.

4. UMAP and Clustering: Beautiful Plots, But Maybe Not True

The problem:
UMAP is powerful. It makes high-dimensional data look beautiful. But people forget - UMAP is not a microscope. It distorts. It simplifies. And it often misleads.

Too many teams interpret UMAP plots as “truth”: this cluster is close to that one, or this gradient means lineage. But UMAP reflects local structure and sampling density. Not biology.

What actually happens:

- A reviewer says: “Why do activated T cells lie closer to monocytes than naïve T cells?”

- UMAP shows a nice linear gradient, but known differentiation stages don’t match it.

- Cells from different tissues appear to “merge,” but batch labels explain everything.

- Entire cluster interpretations are based on distance that isn’t preserved in UMAP.

Why this happens:

- Over-interpretation of UMAP: People use UMAP distance or orientation as if they reflect biology.

- No check with PCA or other embeddings: Teams don’t compare UMAP to raw PC space, where global structure is better preserved.

- Failure to test stability: UMAP has randomness. Without setting seeds or comparing runs, people get fooled by artifacts.

What experienced analysts do differently:

- Always interpret UMAP as qualitative, not quantitative.

- Validate spatial relationships using multiple embeddings - PCA, diffusion maps, force-directed graphs.

- Be skeptical of clusters that rely only on UMAP shape without marker gene support.

Hard lesson learned:
UMAP can be beautiful and deadly. It shows you patterns - true or not. Experienced analysts know that UMAP plots sell a story, but deeper analysis must test whether that story holds up.

Continue Reading Part 2

FAQs

Company

Why So Many scRNA-seq Analyses Go Wrong - And How to Avoid It: Part 1

Introduction

Table of Contents

Part 1 - Foundations That Undermine Everything

1. QC Filtering That Deletes Biology

2. Doublets and Ambient RNA: Ghost Signals That Mislead You

3. Over-reliance on Default Pipelines: Dangerous Comfort

4. UMAP and Clustering: Beautiful Plots, But Maybe Not True