Need help avoiding common scRNA-seq pitfalls? Our experienced bioinformatics team has rescued dozens of projects from review disasters, hidden artifacts, and analysis missteps. Request a free consultation →
The problem:
Assigning cell types is a basic task - but a dangerous one. A single wrong label can affect every downstream result, from clustering to trajectory to DEGs. Unfortunately, most annotations are done using marker genes alone, without robust verification or accounting for dataset-specific quirks.
In some tissues, markers are ambiguous. In others, they’re expressed in unexpected states. Still, people run SingleR or scType or manual scripts and call it done.
What actually happens:
- A “fibroblast” cluster expresses CD45 and CXCR4.
- A reviewer points out that “neuron” cluster came from peripheral blood.
- Label transfer assigns thymocytes in bone marrow.
- Downstream interpretation collapses because annotations were off by one or two key clusters.
Why this happens:
- Over-reliance on markers: Teams use one or two genes to define cell identity - ignoring the broader expression profile.
- Poor reference data: Many annotation tools rely on public references that don’t match the tissue, age, or condition of your data.
- Transfer across incompatible platforms: Smart-seq2 vs 10X, human vs mouse, inflamed vs healthy - all create mismatches in expression scale and context.
What experienced analysts do differently:
- Use multiple marker panels, not single genes.
- Validate annotations with both gene-level and module-level expression.
- When using automated tools, always manually check their assignments - no tool is perfect.
Hard lesson learned:
Annotation seems easy, but it’s one of the most error-prone steps in the whole workflow. Mislabeling early means wrong conclusions later. Experts spend more time verifying than labeling.
The problem:
Batch correction is essential - but also risky. Done improperly, it erases not just technical artifacts but also real biological variation. And once it’s removed, you can’t get it back.
Too often, teams apply strong correction methods - like Harmony or Seurat CCA - without testing what signal got flattened. Worse, they rely on “good looking” UMAPs as evidence of success, when in fact, meaningful differences are gone.
What actually happens:
- Samples from different time points align perfectly - but lose treatment response.
- Tissue-specific subtypes become indistinguishable.
- A reviewer says: “Why does your disease cluster look identical to control?”
- Trajectory fails because real transitions were aligned away.
Why this happens:
- Overcorrection: Algorithms remove variance that’s actually biological, mistaking it for batch.
- No uncorrected comparison: Teams don’t compare pre- and post-correction to see what changed.
- Ignoring experimental design: If all cases were run in one batch and controls in another, then correction is almost guaranteed to erase the disease signal.
What experienced analysts do differently:
- Always check correction impact by plotting marker genes and DEGs before/after.
- Use lighter correction (e.g., MNN or BBKNN) when the risk of erasing biology is high.
- Be transparent - show reviewers what you did and what was lost or gained.
Hard lesson learned:
Batch correction is not a cosmetic fix. It’s a blunt tool. When misused, it destroys the very signal you’re trying to study.
The problem:
Trajectory inference sounds deceptively simple: map out how cells transition from one state to another using static single-cell snapshots. Tools like Monocle, Slingshot, PAGA, and scVelo promise to reconstruct dynamic processes - like differentiation, activation, or treatment response - by “ordering” cells along a pseudotime axis.
But in real datasets, these arrows often point the wrong way.
What actually happens:
- In a paper draft, reviewers question whether the trajectory reflects true biology or an artifact of batch alignment. The direction of pseudotime contradicts known gene expression patterns or experimental expectations. Multiple tools produce different, even contradictory, trajectories - leaving the team unsure which to believe. A “branch” looks like a fate decision but turns out to reflect cell cycle or stress.
Why this happens:
- False continuity: Many methods assume that cells form a smooth manifold in gene space, but biology isn't always continuous - especially in immune cells or stress responses, which can flip expression programs rapidly.
- Inappropriate root choice: Often, analysts pick a “start point” arbitrarily, without verifying that marker genes support that origin. In tools like Monocle3, this flips the direction of the entire trajectory.
- Improper cell selection: Including unrelated cell types, or cells from divergent experimental conditions, can distort the geometry of the manifold. Trajectory methods will still try to connect them.
- RNA velocity failure: Velocity-based methods (e.g., scVelo) rely on spliced/unspliced ratios, but these are notoriously noisy in certain cell types (e.g., epithelial, neurons) or poorly captured in 10X v3 libraries.
What experienced analysts do differently:
- Subset carefully: Only include cell populations for which a plausible, testable progression exists - and separate trajectories by lineage if needed.
- Validate against known markers: Plot canonical genes along pseudotime and ask whether their expression increases or decreases as expected.
- Compare multiple methods: Use pseudotime (e.g., Slingshot, Monocle3), graph-based (PAGA), and velocity-based (scVelo) approaches. If they disagree dramatically, that’s a red flag.
- Don’t force it: Some processes simply don’t form reliable trajectories. If the structure is too sparse or branching is ambiguous, abandon it - or switch to more robust approaches like gene module scoring or diffusion maps.
Hard lesson learned:
Trajectory tools will always give you an output - even when the structure doesn’t support one. The hardest part is recognizing when that output is biologically meaningful versus geometrically imposed. In real projects, we’ve seen elegant pseudotime paths that reversed known biology or failed to reproduce under small parameter changes. The most skilled teams don’t just run trajectory analysis - they know when not to trust it.
The problem:
Differential expression (DE) in scRNA-seq is seductive. The tools are fast. The results are lists. But many teams forget a critical fact: cells are not independent samples. Treating them as such - what we call pseudoreplication - leads to inflated p-values, fake reproducibility, and irreproducible biology.
What actually happens:
- DE analysis between case and control finds 1,200 “significant” genes - even when there’s only 2 mice per group.
- A reviewer asks: “Are you comparing mice, or cells?”
- Re-analysis using patient-level aggregation wipes out 95% of reported DEGs.
- Genes validated in vitro don’t match the published list at all.
Why this happens:
- Treating cells as replicates: Most scRNA-seq tools assume every cell is a sample. This breaks down when you have only a few biological units.
- No aggregation or mixed models: Analysts skip pseudobulk or fail to use tools like MAST or muscat that can account for donor effects.
- Misunderstanding of statistics: People report p-values without realizing they violate model assumptions.
What experienced analysts do differently:
- Always model the donor, not just the cell.
- Use pseudobulk approaches or mixed-effect models for hypothesis testing.
- Be honest about statistical power. If you have 2 donors per group, don’t pretend you’re doing real DE analysis.
Hard lesson learned:
The illusion of high power comes from many cells, but it’s fake. True replication happens at the biological level. Real experts know this - and reviewers do too.