Skip to content

Feature: Add discovery mode for de novo barcode identification#9

Open
eos-jin wants to merge 39 commits intophipsonlab:mainfrom
eos-jin:main
Open

Feature: Add discovery mode for de novo barcode identification#9
eos-jin wants to merge 39 commits intophipsonlab:mainfrom
eos-jin:main

Conversation

@eos-jin
Copy link
Copy Markdown
Collaborator

@eos-jin eos-jin commented Mar 30, 2026

Summary

Adds a two-pass discovery mode that allows NextClone to identify barcodes without requiring a pre-defined whitelist. Also fixes conda environment compatibility for WEHI HPC.

New Parameters

  • discovery_mode = false — set to true to enable barcode discovery
  • filter_discovered_barcodes = true — set to false to skip knee-plot filtering (recommended for datasets with a low expected number of clones)

How It Works

  1. Pass 1 (Discovery): Run Flexiplex without -k flag, using -f 0 for strict flanking match
  2. Filter (optional): Use flexiplex-filter to identify high-quality barcodes via knee-plot inflection method. Can be disabled with --filter_discovered_barcodes false.
  3. Pass 2 (Mapping): Run Flexiplex with the discovered barcode list

Usage

# Discovery mode with knee-plot filtering (default)
nextflow run phipsonlab/Nextclone -r main --discovery_mode true

# Discovery mode without filtering (low clone count datasets)
nextflow run phipsonlab/Nextclone -r main --discovery_mode true --filter_discovered_barcodes false

# Whitelist mode (original behaviour — unchanged)
nextflow run phipsonlab/Nextclone -r main --clone_barcodes_reference /path/to/barcodes.txt

Files Changed

  • main.nf — workflow logic for two-pass discovery + parameter validation
  • nextflow.config — new discovery_mode and filter_discovered_barcodes parameters
  • modules/extract_sc_clone_barcodes.nf — discovery, filter, and no-filter merge processes
  • modules/extract_dnaseq_barcodes.nf — same for DNA-seq mode
  • conda_env/extract_sc_env.yaml — added flexiplex-filter pip dependency; removed defaults channel
  • conda_env/extract_dnaseq_env.yaml — removed defaults channel (HPC compatibility)
  • README.md — documentation for discovery mode parameters

Testing

  • ✅ Tested on WEHI HPC (scRNAseq, ZR751 senescence dataset): 218 processes, 24min 41s, 12 CPU hours
  • ✅ Discovery mode with filtering: 346,184 entries (vs 386,621 whitelist — ~10% fewer, expected from knee-plot conservatism)
  • --filter_discovered_barcodes false available for low clone count datasets

Backward Compatibility

Original whitelist mode (discovery_mode = false) is unchanged — this is purely additive.

eos-jin and others added 30 commits March 27, 2026 17:39
This adds a two-pass barcode discovery approach when clone_barcodes_reference
is not available or when users want to discover barcodes de novo:

Pass 1 (Discovery):
- Run Flexiplex WITHOUT -k flag to discover all potential barcodes
- Use -f 0 for strict flanking sequence match to reduce errors
- Outputs barcode counts file

Filtering:
- Use flexiplex-filter to select high-quality barcodes
- Applies knee-plot inflection point method
- Optionally intersect with 10x whitelist if tenx_whitelist is provided

Pass 2 (Mapping):
- Run Flexiplex WITH the discovered/filtered barcode list
- Uses standard edit distance parameters (-f and -e)

New parameters:
- discovery_mode (default: false) - enables two-pass discovery workflow
- tenx_whitelist (default: null) - optional 10x barcode whitelist for filtering

Backward compatibility:
- When discovery_mode=false, pipeline behaves exactly as before
- clone_barcodes_reference is required in whitelist mode (default)

This addresses reviewer feedback requesting support for experiments where
the barcode whitelist is not known in advance.
- Error if discovery_mode=false and no whitelist provided
- Warning if discovery_mode=true but whitelist also provided (ignored)
- Clear error messages with actionable guidance
Tests:
- test_synthetic_data_structure: Validates FASTQ format matches expected structure
- test_parameter_validation: Validates error/warning messages (requires Nextflow)
- test_flexiplex_discovery: End-to-end discovery mode test (requires Flexiplex)

Test data:
- whitelist_test.fastq.gz: 60 reads with 6 known barcodes (for whitelist mode)
- discovery_test.fastq.gz: 105 reads with 7 barcodes including novel ones (for discovery mode)
- expected_discovered_barcodes.txt: Ground truth for discovery mode validation

Run with: python tests/test_discovery_mode.py
Flexiplex requires both flanking sequences in the read:
  [5' adapter][BARCODE][3' adapter]

Test results:
- synthetic_data: ✅ PASS
- flexiplex_discovery: ✅ PASS (5/5 barcodes discovered)
- whitelist_mode: ✅ PASS (60/60 reads matched)
- param_validation: ⚠️ SKIP (Nextflow install issue on test machine)
Workflow fixes:
- Fix DSL2 compatibility (remove process reuse issue)
- Restructure DNAseq discovery mode to avoid calling trim/filter twice

Test improvements:
- Test all 4 workflow combinations (DNAseq/scRNAseq × discovery/whitelist)
- All Nextflow workflows validate successfully
- Flexiplex discovery finds 5/5 test barcodes

Test results (local):
- synthetic_data: ✅ PASS
- param_validation: ✅ PASS (all 4 workflows validated)
- flexiplex_discovery: ✅ PASS (5/5 barcodes)
Discovery mode now uses knee-plot inflection method only (via flexiplex-filter).
This is cleaner architecture - the 10x cell barcode whitelist is unrelated to
clone barcode discovery.

Removed:
- tenx_whitelist parameter from nextflow.config
- Whitelist arguments from filter processes
- References in README

All 4 workflows still validated (DNAseq/scRNAseq × discovery/whitelist)
Previously, discovery mode always ran flexiplex-filter with knee-plot
filtering, silently discarding singleton and low-count clones. This is
incorrect for lineage tracing experiments where rare clones are
biologically meaningful.

Changes:
- Add filter_discovered_barcodes parameter (default: false)
- When false: pass --no-inflection to flexiplex-filter → keep ALL barcodes
- When true: apply knee-plot filtering (previous behaviour)
- Apply consistently to both scRNAseq and DNAseq discovery paths
- Document in README with rationale

Affects: sc_merge_discovered_barcodes, dnaseq_filter_discovered_barcodes
Two pure-Python (stdlib only) scripts that generate self-contained
interactive HTML reports from NextClone clone_barcodes.csv output.

reports/generate_report.py
- Single-run dashboard: sample overview table, ranked clone abundance
  (log scale), size distribution, top 20 clones, edit distance QC,
  cross-sample clonality comparison
- Usage: python3 generate_report.py clone_barcodes.csv --output report.html

reports/generate_comparison_report.py
- Side-by-side comparison of two runs (e.g. reference vs discovery mode)
- Shows Δ reads/cells/clones, ranked abundance overlay, clone overlap,
  clonality metrics, cell recovery validation
- Usage: python3 generate_comparison_report.py a.csv b.csv --label-a X --label-b Y

No pip installs required. Chart.js loaded from CDN.
- Add <select> dropdown above the detail section (consistent with
  single-run report UX)
- Auto-selects first sample on load
- Clicking a table row also syncs the dropdown
- Dropdown change selects sample and scrolls to detail charts
- Both interaction methods kept for flexibility
- Add generate_report process to sc_clone_barcodes module
- Calls reports/generate_report.py on clone_barcodes.csv output
- Runs automatically after both discovery and whitelist mode
- Output: nextclone_report.html published to params.publish_dir
- Add report_title param (optional, defaults to date-stamped title)
- Update README with:
  - Standard report: auto-generated, what's in it, how to customise title
  - Comparison report: manual step, full usage instructions, what's in it
Canvas elements inside display:none sections have zero dimensions when
Chart.js tries to render, resulting in blank charts. Fix: move the
auto-select from inline script execution to window.addEventListener('load')
so Chart.js is ready and the DOM is fully laid out before rendering.

Also skip scrollIntoView on initial auto-select (page doesn't jump).
- Remove empty 10x whitelist code block (feature was removed)
- Fix: filtering description now correctly says default keeps all barcodes
- Add Whitelist mode vs Discovery mode sections side by side
- Add full parameters table (was missing report_title, adapter params etc.)
- Unify discovery mode + barcode filtering into one cohesive section
- Clean up structure: Modes → Parameters → HTML Reports
…into single process

- sc_merge_discovered_barcodes already handles filter_discovered_barcodes param
- Fixes 'Cannot find component' error on WEHI HPC
- Simplifies discovery mode workflow
…ensity plot

New features (v2, 2026-04-09):
- Clone overlap table: shared clones across samples at ≥5,10,15,20,50,100 cells
- Heterogeneity metrics: Gini coefficient + Shannon index per sample
- Clone size density plot: KDE-style curve (log scale)
- Reversed top 20 clones: largest at top (easier to read)
- Updated sample table: added Gini + Shannon columns
- Summary bar: added average Gini + Shannon

Implementation:
- compute_gini(): inequality metric (0=equal, 1=unequal)
- compute_shannon(): diversity metric (higher=more diverse)
- compute_clone_overlap(): cross-sample clone sharing at thresholds
- clone_size_density: binned log-scale distribution for KDE plot
- Updated HTML template with new sections and Chart.js visualizations

Backwards compatible: same CLI interface, enhanced output.
- Quick start examples (basic + custom output/title)
- NextClone integration example (from results directory)
- Full command-line options reference
- Multiple usage examples

Makes it clear how users can generate reports from CLI.
- Add v2 feature highlights (overlap table, Gini/Shannon, density plot)
- Add manual CLI report generation examples
- Link to reports/README.md for full documentation
- Keep auto-generation info (Nextflow integration)
Changes:
- Remove AVERAGE GINI COEFFICIENT and AVERAGE SHANNON INDEX from summary
- Keep per-sample Gini/Shannon in table (still useful)
- Parse run info from CSV header (#mode:, #command:, #parameters:)
- Fix run mode detection: show 'Run Mode Unknown' if not specified
- Fix Clone Size Density chart: set x-axis minimum = 0

CSV header format for run info:
  #mode: discovery
  #command: nextflow run main.nf --discovery_mode true
  #discovery_mode: true
  #barcode_edit_distance: 3
- Chart E (Cells per Sample): alphabetical order
- Chart F (Clonality Comparison): alphabetical order
- Overlap table: alphabetical column order
Changes:
- Add all_barcodes.txt: Contains ALL discovered barcodes (no filtering)
  - Useful for debugging and QC
  - Header: #barcode\tcount
- Add run_log.txt: Run parameters and command line for reproducibility
  - Includes all parameters used
  - Shows exact nextflow command
  - Documents output files
- Fix filtering bug: When filter_discovered_barcodes=false, truly no filtering
  - Previous: flexiplex-filter --no-inflection still applied some filtering
  - Now: Simply copy all_barcodes.txt to filtered_barcodes.txt
- Add header to filtered_barcodes.txt: #barcode\tcount
- Update README: Document all output files

Recommended usage for lineage tracing:
  nextflow run main.nf --discovery_mode true --filter_discovered_barcodes false
  → Retains all barcodes including singletons/rare clones
…analysis)

ROOT CAUSE:
flexiplex-filter has DEFAULT BOUNDS even with --no-inflection:
- Default min-rank: 50 (only keeps top 50 barcodes by count!)
- Default max-rank: 95th percentile by count

From flexiplex docs:
> 'This automatic inflection will, by default, use:
>  - Lower bound (smallest rank to be searched): 50
>  - Upper bound (highest rank to be searched): the 95th percentile'

So even with --no-inflection, it was filtering out barcodes ranked >50!

FIX:
- When filter_discovered_barcodes=false: DON'T call flexiplex-filter at all
- Just copy combined_barcodes_counts.txt directly to filtered_barcodes.txt
- This preserves ALL barcodes including singletons and rare clones

TESTING:
With filter_discovered_barcodes=false, filtered_barcodes.txt should now contain
ALL barcodes (same as all_barcodes.txt), not just top 50.

Recommended for lineage tracing:
  nextflow run main.nf --discovery_mode true --filter_discovered_barcodes false
eos-jin added 9 commits April 10, 2026 11:35
Changes:
1. Report: Gini/Shannon to 2 decimal places (was 4)
   - fmt4() → fmt2() for heterogeneity metrics
   - Cleaner display, sufficient precision

2. Barcode files: Add explanatory header
   - all_barcodes.txt: Added 3-line header explaining columns
   - filtered_barcodes.txt: Same header
   - Header format:
     #barcode	count
     # barcode: lineage tracing barcode sequence
     # count: number of reads supporting this barcode

3. run_log.txt: Enhanced with versions + git info
   - Nextflow version
   - Flexiplex version
   - Python version
   - Git commit hash
   - Git branch
   - Full command line
   - All parameters
   - Output file descriptions

These changes address Alistair's feedback for reproducibility
and clarity in output files.
- useMamba = true (was false)
- Mamba is faster and more reliable than conda for env creation
- Fixes 'trim_galore: command not found' error on WEHI HPC

For Alistair to test:
1. Clear conda cache: rm -rf /vast/scratch/users/chalk.a/nextflow_local/conda_cache/
2. Clear work dir: rm -rf work/
3. Re-run: nextflow run main.nf --mode DNAseq ...

Mamba will create fresh conda envs with all tools properly in PATH.
- Recommended usage with timestamped publish_dir
- Example commands for DNA-seq and scRNA-seq modes
- Output file structure
- When to clear work/ directory
- No resume feature (per user request)
- Check if combined_barcodes_counts.txt is empty before proceeding
- Add -e flag to echo in filter_discovered_barcodes=false branch
- Add debug logging to diagnose filtered_barcodes.txt generation
- Exit with error if no barcodes discovered (fail fast)
…isabled

- Replace 'cat combined >> filtered' with 'cp all_barcodes.txt filtered_barcodes.txt'
- This ensures filtered_barcodes.txt is identical to all_barcodes.txt when filter_discovered_barcodes=false
- More reliable than append operation, avoids potential file descriptor issues
- Add validation to fail fast if copy fails
- Log input chunk counts and file sizes
- Track barcode counts at each processing step
- Show first 5 barcodes for verification
- Validate all intermediate files
- Report final file sizes and confirm identity
- Use set -e to fail fast on errors
- Clear [SC_MERGE] prefixed logs for easy grepping

This will help diagnose why filtered_barcodes.txt was empty
in previous runs despite all_barcodes.txt having content.
Nextflow triple-quoted strings treat $ as Groovy interpolation.
All bash variables and $(command) substitutions must be escaped as \$.
This caused the compilation error on the HPC (Nextflow 23.10.0).
BUG 1 - Wrong channel passed to Pass 2 (ROOT CAUSE):
- sc_merge_discovered_barcodes outputs TWO files: all_barcodes.txt [0] and filtered_barcodes.txt [1]
- Old code: ch_filtered_barcodes.first() defaulted to channel [0] = all_barcodes.txt
- Fix: Use named emit (filtered_barcodes) to explicitly select the correct channel
- This means Pass 2 was using all_barcodes.txt (with comment headers) instead of filtered_barcodes.txt

BUG 2 - Comment headers in barcode reference file:
- all_barcodes.txt had '#barcode\tcount' comment headers (fine for QC)
- filtered_barcodes.txt ALSO had comment headers - flexiplex cannot parse these as -k reference
- flexiplex expects raw 'barcode\tcount' format, no comments
- Fix: filtered_barcodes.txt now contains raw barcodes only (no headers)
- all_barcodes.txt keeps headers since it's only for QC/debugging

Also: added emit names to process outputs for clarity
This process was never imported or used in main.nf.
sc_merge_discovered_barcodes handles both filter modes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant