Long-Read Assembly and SV Detection: Why So Many Projects Go Wrong - And What Experienced Bioinformaticians Do Differently

Introduction

Long-read sequencing has changed how we think about genome assembly and structural variant detection. Tools like PacBio HiFi and Oxford Nanopore now give us reads long enough to span complex repeats, detect large insertions or inversions, and resolve regions that short reads simply can’t touch. But these same tools also create a new set of traps - some subtle, some devastating.

We’ve seen long-read genome projects that seemed successful on paper - high N50, great coverage, lots of SVs called - yet fell apart during interpretation, publication, or integration with other datasets. Poor tool choices, untested assumptions, and false confidence in default settings often cause real damage - sometimes without the team even noticing.

This article is not an introduction to long-read sequencing. It’s a hard-earned set of observations from real-world projects: where they break, why they fail, and how the best teams prevent those mistakes - both PacBio and Nanopore. We also include our own lessons - including ones learned the hard way.

1. Overrated N50: Assemblies That Don’t Actually Work
2. SV Calling: Thousands of Variants, Zero Confidence
3. Assembler Selection Without Understanding the Tradeoffs
4. Polishing Gone Wrong: "Improvements" That Break Genes
5. Collapsed Repeats That Silently Delete Biology
6. Hybrid Assemblies That Mix Strengths - and Weaknesses
7. Small Genomes, Big Contamination
8. No Truth Set, No Trust: Validation Gaps in Real Projects

Long-read sequencing is powerful - but unforgiving. Our team identifies weak spots in your assembly and SV analysis before they compromise your results. Request a free consultation →

1. Overrated N50: Assemblies That Don’t Actually Work

The Problem

Too many people finish their long-read assembly and stop at the N50. If the number looks high - say 8 Mb for a mammalian genome - they assume success. But then problems appear: genes missing, alignments broken, duplicated scaffolds that shouldn’t be there. This is especially common in plant, amphibian, or hybrid genomes.

Why It Happens

- N50 rewards length, not accuracy or completeness. Assemblers can inflate N50 by erroneously joining regions.

- Teams rarely inspect alignment to known references or gene sets.

- Small misassemblies may go unnoticed - but have huge biological impact later.

Real Example

One group came to us after reviewers asked why a critical gene cluster showed up twice. They had a “good” HiFi assembly - 99.9% BUSCO, 9 Mb N50 - but we found a false duplication due to graph simplification. It affected two entire figures in their manuscript.

What We Do Differently

We never evaluate assemblies just by metrics. We align to reference genomes (if available), annotate gene models, and visually inspect tricky regions (like MHC, rDNA, or telomeric repeats). We also cross-check contigs using both long-read and short-read coverage to find suspicious duplications or collapses.

2. SV Calling: Thousands of Variants, Zero Confidence

The Problem

Structural variants (SVs) are one of the key advantages of long reads - but most SV callsets are unreliable. We’ve seen projects with 20,000 “SVs” in human samples - but no controls, no visual checks, and no evidence of accuracy. In many cases, the most interesting variants were missed, while artifacts made it into the final tables.

Why It Happens

- Tools like Sniffles, SVIM, or CuteSV are often used with default settings.

- Teams don’t compare across tools or verify with orthogonal data.

- Insertion/deletion boundaries can shift wildly, making integration difficult.

- Reference bias and alignment errors inflate or misclassify SVs. With Nanopore data in particular, error-prone homopolymer stretches and alignment ambiguity can exacerbate false positives in SV detection.

Real Example

A cancer genome project identified a novel 8 kb deletion in a known oncogene. The paper was submitted. Later, short-read alignment showed the deletion was never real - it was a soft-clipped alignment artifact caused by low-complexity sequence.

What We Do Differently

We use multiple SV callers and compare overlaps. We verify key SVs in IGV or Ribbon. We require support from both ends of the breakpoint when possible. For population-scale projects, we build custom filters based on sample metadata and coverage statistics. And when needed, we run short-read SV callers (like Manta or LUMPY) in parallel to test long-read predictions.

Long-read pipelines can mislead - if unchecked. Our experts identify hidden errors in genome assembly and SV detection before they derail your findings. Request a free consultation →

3. Assembler Selection Without Understanding the Tradeoffs

The Problem

Assemblers are not interchangeable. Flye, HiCanu, Raven, Shasta, and Miniasm all use different assumptions. Choosing the wrong one for your platform, genome size, or read quality guarantees trouble. Yet most teams just pick one based on speed or popularity. This is especially risky for Nanopore genome assembly, where tool behavior varies widely with read length and basecalling quality.

Why It Happens

- Documentation is unclear or outdated.

- Few teams benchmark with simulated reads or references.

- Many users treat assembler choice as a one-time decision.

Real Example

A fungal genome project used Flye for Illumina-corrected Nanopore reads. The assembly looked okay - until the team noticed that mitochondrial contigs were missing. Flye had filtered them out due to size thresholds. The authors didn't notice until journal proofs.

What We Do Differently

We don’t assume one assembler fits all. For PacBio HiFi, we favor HiCanu or Hifiasm. For ultra-long Nanopore, we test Flye, Shasta, or NECAT. We downsample and simulate where needed. And we always check assembler logs, alignment to trusted loci, and overlap with previous studies.

4. Polishing Gone Wrong: "Improvements" That Break Genes

The Problem

Polishing is supposed to fix errors. But we’ve seen polishing steps that introduce more problems than they solve: frame-shifted genes, broken start codons, missing stop sites. These issues are especially dangerous because they’re hard to spot unless you’re looking.

Why It Happens

- Short-read polishing uses misaligned reads with low confidence.

- Racon or Medaka are run too many times, degrading instead of refining.

- Polishing is done before proper repeat masking or error-aware alignment.

- Indel errors near homopolymers confuse aligners and tools.

Real Example

One team submitted a polished genome with unusually low CDS counts. Turns out, a polishing step replaced the original coding sequence of dozens of genes with soft-clipped junk - because the reads were mapped with loose settings.

What We Do Differently

We align short reads with stringent filters before polishing. We compare gene annotations before and after polishing to identify shifts. We run BUSCO, but also inspect frame-retention across known protein-coding genes. And we limit polishing to one or two rounds, with careful version tracking.

5. Collapsed Repeats That Silently Delete Biology

The Problem

Assemblers often collapse tandem repeats, rDNA arrays, centromeres, or large gene families. This can eliminate key biology - like immune receptor genes, transcription factor expansions, or mobile elements - without any warning in the QC report.

Why It Happens

- Many assemblers simplify graphs by collapsing similar sequences.

- Long but noisy reads - especially from Nanopore genome assembly - often fail to distinguish tandem copies.

- Repeat-rich regions are excluded from alignment-based QC, so collapse is not obvious.

- Tools like BUSCO can pass even when major gene families are missing.

Real Example

A team studying immune gene evolution had what seemed like a clean PacBio assembly. But when they ran VDJ annotation tools, most of the expected T-cell receptor gene segments were missing. The assembler had collapsed multiple paralogs into one - and nobody noticed until the biological results didn’t make sense.

What We Do Differently

We don’t rely only on global metrics. For any region known to harbor repeats or gene families, we map raw reads back and inspect coverage profiles. We use tools like RepeatMasker, Tandem Repeats Finder, and coverage peak analysis to flag possible collapses. When needed, we use phased assemblies or targeted local reassembly to recover the missing segments.

Assembly and SV tools aren’t magic - and errors are easy to miss. Our experts bring cross-platform experience to catch what automated tools can’t. Request a free consultation →

6. Hybrid Assemblies That Mix Strengths - and Weaknesses

The Problem

Hybrid assemblies - combining long reads with short reads, or reads with Hi-C/optical maps - promise the best of all worlds. But in practice, they often inherit the worst. We’ve seen hybrid assemblies that introduce more misjoins, chimeras, or polishing errors than they fix.

Why It Happens

- Scaffolding tools can introduce misjoins across low-support regions.

- Short-read polishing can undo benefits of long-read consensus.

- Inconsistent repeat resolution between data types causes artifacts.

- Teams don’t validate across each assembly stage.

Real Example

A group used long reads plus 10X Genomics and Hi-C to build a “chromosome-level” plant genome. It looked perfect - until linkage analysis showed that two chromosomes had been fused incorrectly. Similar errors can occur when layering HiFi with Nanopore, if scaffold junctions are not carefully validated. The Hi-C scaffolder followed a spurious signal, and nobody verified it until the paper was nearly published.

What We Do Differently

We treat each data type as complementary - but not absolute. We scaffold in stages and inspect each link manually. We validate scaffolds using synteny with close species, genetic maps if available, and coverage plots. And we maintain assembly versions at each step so we can revert when needed.

7. Small Genomes, Big Contamination

The Problem

In microbial or organellar genomes, even small amounts of contamination can ruin conclusions. We’ve seen assemblies that accidentally include vector sequences, host nuclear fragments, or even entire bacterial contigs from other samples.

Why It Happens

- Teams don’t perform pre-assembly filtering rigorously.

- Contaminants may have higher coverage and get assembled preferentially.

- Assemblers don’t inherently distinguish host vs. foreign DNA.

- Post-assembly filtering is often skipped.

Real Example

One team submitted a high-quality mitochondrial genome for publication - with coverage depth, gene order, everything looking clean. But later someone noticed a region that matched a plasmid from E. coli. It turned out the lab used the same extraction column for both species. No filtering step had been applied.

What We Do Differently

For small genomes, we perform aggressive pre-assembly filtering using Kraken2, BMTagger, and manual taxonomic inspection. We also align final contigs to multiple databases (RefSeq, nt, UniVec). Any suspicious contigs - unexpected GC, coverage, or taxonomy - are flagged and investigated. And we use host genome subtraction if the sample source is known.

8. No Truth Set, No Trust: Validation Gaps in Real Projects

The Problem

Without ground truth - from simulated data, known standards, or orthogonal platforms - it’s hard to know if your assembly or SV calls are trustworthy. Many projects skip validation altogether, assuming that QC metrics and nice figures are enough.

Why It Happens

- Lack of access to validated truth sets or benchmark genomes.

- Simulation is time-consuming and often overlooked.

- Teams confuse QC metrics with true biological validation.

- Reviewers don’t always ask for orthogonal checks.

Real Example

A structural variant catalog of a model organism looked clean and was used to infer evolutionary trajectories. But another lab tried to reproduce the SVs using optical mapping - and only about 50% matched. The project had never validated SV calls outside of the long-read alignments.

What We Do Differently

We use both internal and public truth sets - like GIAB for human, or curated microbial genomes. We simulate reads using PBSIM, NanoSim, or wgsim to test pipeline performance. For SVs, we run cross-platform validation (e.g., short-read, optical map, Hi-C). And for every new species or sample type, we expect that part of the budget goes to validation - not just generation.

Final Remarks

Long-read sequencing truly reshapes genome biology - but only when used with care. We’ve seen firsthand how easy it is to generate beautiful assemblies and striking SV catalogs that fall apart under scrutiny. Too often, teams treat long-read pipelines as black boxes, or trust default outputs without deeper checks. Metrics like N50, BUSCO, or variant counts are not guarantees of success. They are just starting points.

Our approach is rigorous because we’ve seen the consequences of shortcuts. We validate across platforms - including both PacBio and Nanopore - compare tools, inspect visually, and test biological plausibility. And we always keep in mind that an assembly is not just a file - it’s the foundation for everything that follows: annotation, modeling, drug targets, publication, and policy decisions.

We don’t claim perfection. But we’ve learned where the traps are, and we’ve built a process that catches many of them before they cause damage. If your long-read assembly or SV project feels “almost right” - or if the biology doesn’t make sense - it may be worth a second look.

Sometimes the biggest errors are the ones that look the cleanest on paper.

Your genome is only as good as its weakest step. We review every stage - from assembly to SV calls - to catch issues before they cause damage. Request a free consultation →

This blog article was authored by Justin T. Li, Ph.D., Lead Bioinformatician. To learn more about AccuraScience's Lead Bioinformaticians, visit https://www.accurascience.com/our_team.html.

FAQs

Company

Long-Read Assembly and SV Detection: Why So Many Projects Go Wrong - And What Experienced Bioinformaticians Do Differently

Introduction

Table of Contents

1. Overrated N50: Assemblies That Don’t Actually Work

2. SV Calling: Thousands of Variants, Zero Confidence

3. Assembler Selection Without Understanding the Tradeoffs

4. Polishing Gone Wrong: "Improvements" That Break Genes

5. Collapsed Repeats That Silently Delete Biology

6. Hybrid Assemblies That Mix Strengths - and Weaknesses

7. Small Genomes, Big Contamination

8. No Truth Set, No Trust: Validation Gaps in Real Projects

Final Remarks