Feature: Add discovery mode for de novo barcode identification by eos-jin · Pull Request #9 · phipsonlab/NextClone

eos-jin · 2026-03-30T07:58:33Z

Summary

Adds a two-pass discovery mode that allows NextClone to identify barcodes without requiring a pre-defined whitelist. Also fixes conda environment compatibility for WEHI HPC.

New Parameters

discovery_mode = false — set to true to enable barcode discovery
filter_discovered_barcodes = true — set to false to skip knee-plot filtering (recommended for datasets with a low expected number of clones)

How It Works

Pass 1 (Discovery): Run Flexiplex without -k flag, using -f 0 for strict flanking match
Filter (optional): Use flexiplex-filter to identify high-quality barcodes via knee-plot inflection method. Can be disabled with --filter_discovered_barcodes false.
Pass 2 (Mapping): Run Flexiplex with the discovered barcode list

Usage

# Discovery mode with knee-plot filtering (default)
nextflow run phipsonlab/Nextclone -r main --discovery_mode true

# Discovery mode without filtering (low clone count datasets)
nextflow run phipsonlab/Nextclone -r main --discovery_mode true --filter_discovered_barcodes false

# Whitelist mode (original behaviour — unchanged)
nextflow run phipsonlab/Nextclone -r main --clone_barcodes_reference /path/to/barcodes.txt

Files Changed

main.nf — workflow logic for two-pass discovery + parameter validation
nextflow.config — new discovery_mode and filter_discovered_barcodes parameters
modules/extract_sc_clone_barcodes.nf — discovery, filter, and no-filter merge processes
modules/extract_dnaseq_barcodes.nf — same for DNA-seq mode
conda_env/extract_sc_env.yaml — added flexiplex-filter pip dependency; removed defaults channel
conda_env/extract_dnaseq_env.yaml — removed defaults channel (HPC compatibility)
README.md — documentation for discovery mode parameters

Testing

✅ Tested on WEHI HPC (scRNAseq, ZR751 senescence dataset): 218 processes, 24min 41s, 12 CPU hours
✅ Discovery mode with filtering: 346,184 entries (vs 386,621 whitelist — ~10% fewer, expected from knee-plot conservatism)
✅ --filter_discovered_barcodes false available for low clone count datasets

Backward Compatibility

Original whitelist mode (discovery_mode = false) is unchanged — this is purely additive.

This adds a two-pass barcode discovery approach when clone_barcodes_reference is not available or when users want to discover barcodes de novo: Pass 1 (Discovery): - Run Flexiplex WITHOUT -k flag to discover all potential barcodes - Use -f 0 for strict flanking sequence match to reduce errors - Outputs barcode counts file Filtering: - Use flexiplex-filter to select high-quality barcodes - Applies knee-plot inflection point method - Optionally intersect with 10x whitelist if tenx_whitelist is provided Pass 2 (Mapping): - Run Flexiplex WITH the discovered/filtered barcode list - Uses standard edit distance parameters (-f and -e) New parameters: - discovery_mode (default: false) - enables two-pass discovery workflow - tenx_whitelist (default: null) - optional 10x barcode whitelist for filtering Backward compatibility: - When discovery_mode=false, pipeline behaves exactly as before - clone_barcodes_reference is required in whitelist mode (default) This addresses reviewer feedback requesting support for experiments where the barcode whitelist is not known in advance.

- Error if discovery_mode=false and no whitelist provided - Warning if discovery_mode=true but whitelist also provided (ignored) - Clear error messages with actionable guidance

Tests: - test_synthetic_data_structure: Validates FASTQ format matches expected structure - test_parameter_validation: Validates error/warning messages (requires Nextflow) - test_flexiplex_discovery: End-to-end discovery mode test (requires Flexiplex) Test data: - whitelist_test.fastq.gz: 60 reads with 6 known barcodes (for whitelist mode) - discovery_test.fastq.gz: 105 reads with 7 barcodes including novel ones (for discovery mode) - expected_discovered_barcodes.txt: Ground truth for discovery mode validation Run with: python tests/test_discovery_mode.py

Flexiplex requires both flanking sequences in the read: [5' adapter][BARCODE][3' adapter] Test results: - synthetic_data: ✅ PASS - flexiplex_discovery: ✅ PASS (5/5 barcodes discovered) - whitelist_mode: ✅ PASS (60/60 reads matched) - param_validation: ⚠️ SKIP (Nextflow install issue on test machine)

Workflow fixes: - Fix DSL2 compatibility (remove process reuse issue) - Restructure DNAseq discovery mode to avoid calling trim/filter twice Test improvements: - Test all 4 workflow combinations (DNAseq/scRNAseq × discovery/whitelist) - All Nextflow workflows validate successfully - Flexiplex discovery finds 5/5 test barcodes Test results (local): - synthetic_data: ✅ PASS - param_validation: ✅ PASS (all 4 workflows validated) - flexiplex_discovery: ✅ PASS (5/5 barcodes)

Discovery mode now uses knee-plot inflection method only (via flexiplex-filter). This is cleaner architecture - the 10x cell barcode whitelist is unrelated to clone barcode discovery. Removed: - tenx_whitelist parameter from nextflow.config - Whitelist arguments from filter processes - References in README All 4 workflows still validated (DNAseq/scRNAseq × discovery/whitelist)

Previously, discovery mode always ran flexiplex-filter with knee-plot filtering, silently discarding singleton and low-count clones. This is incorrect for lineage tracing experiments where rare clones are biologically meaningful. Changes: - Add filter_discovered_barcodes parameter (default: false) - When false: pass --no-inflection to flexiplex-filter → keep ALL barcodes - When true: apply knee-plot filtering (previous behaviour) - Apply consistently to both scRNAseq and DNAseq discovery paths - Document in README with rationale Affects: sc_merge_discovered_barcodes, dnaseq_filter_discovered_barcodes

Two pure-Python (stdlib only) scripts that generate self-contained interactive HTML reports from NextClone clone_barcodes.csv output. reports/generate_report.py - Single-run dashboard: sample overview table, ranked clone abundance (log scale), size distribution, top 20 clones, edit distance QC, cross-sample clonality comparison - Usage: python3 generate_report.py clone_barcodes.csv --output report.html reports/generate_comparison_report.py - Side-by-side comparison of two runs (e.g. reference vs discovery mode) - Shows Δ reads/cells/clones, ranked abundance overlay, clone overlap, clonality metrics, cell recovery validation - Usage: python3 generate_comparison_report.py a.csv b.csv --label-a X --label-b Y No pip installs required. Chart.js loaded from CDN.

- Add <select> dropdown above the detail section (consistent with single-run report UX) - Auto-selects first sample on load - Clicking a table row also syncs the dropdown - Dropdown change selects sample and scrolls to detail charts - Both interaction methods kept for flexibility

- Add generate_report process to sc_clone_barcodes module - Calls reports/generate_report.py on clone_barcodes.csv output - Runs automatically after both discovery and whitelist mode - Output: nextclone_report.html published to params.publish_dir - Add report_title param (optional, defaults to date-stamped title) - Update README with: - Standard report: auto-generated, what's in it, how to customise title - Comparison report: manual step, full usage instructions, what's in it

Canvas elements inside display:none sections have zero dimensions when Chart.js tries to render, resulting in blank charts. Fix: move the auto-select from inline script execution to window.addEventListener('load') so Chart.js is ready and the DOM is fully laid out before rendering. Also skip scrollIntoView on initial auto-select (page doesn't jump).

- Remove empty 10x whitelist code block (feature was removed) - Fix: filtering description now correctly says default keeps all barcodes - Add Whitelist mode vs Discovery mode sections side by side - Add full parameters table (was missing report_title, adapter params etc.) - Unify discovery mode + barcode filtering into one cohesive section - Clean up structure: Modes → Parameters → HTML Reports

…into single process - sc_merge_discovered_barcodes already handles filter_discovered_barcodes param - Fixes 'Cannot find component' error on WEHI HPC - Simplifies discovery mode workflow

…ensity plot New features (v2, 2026-04-09): - Clone overlap table: shared clones across samples at ≥5,10,15,20,50,100 cells - Heterogeneity metrics: Gini coefficient + Shannon index per sample - Clone size density plot: KDE-style curve (log scale) - Reversed top 20 clones: largest at top (easier to read) - Updated sample table: added Gini + Shannon columns - Summary bar: added average Gini + Shannon Implementation: - compute_gini(): inequality metric (0=equal, 1=unequal) - compute_shannon(): diversity metric (higher=more diverse) - compute_clone_overlap(): cross-sample clone sharing at thresholds - clone_size_density: binned log-scale distribution for KDE plot - Updated HTML template with new sections and Chart.js visualizations Backwards compatible: same CLI interface, enhanced output.

- Quick start examples (basic + custom output/title) - NextClone integration example (from results directory) - Full command-line options reference - Multiple usage examples Makes it clear how users can generate reports from CLI.

- Add v2 feature highlights (overlap table, Gini/Shannon, density plot) - Add manual CLI report generation examples - Link to reports/README.md for full documentation - Keep auto-generation info (Nextflow integration)

Changes: - Remove AVERAGE GINI COEFFICIENT and AVERAGE SHANNON INDEX from summary - Keep per-sample Gini/Shannon in table (still useful) - Parse run info from CSV header (#mode:, #command:, #parameters:) - Fix run mode detection: show 'Run Mode Unknown' if not specified - Fix Clone Size Density chart: set x-axis minimum = 0 CSV header format for run info: #mode: discovery #command: nextflow run main.nf --discovery_mode true #discovery_mode: true #barcode_edit_distance: 3

- Chart E (Cells per Sample): alphabetical order - Chart F (Clonality Comparison): alphabetical order - Overlap table: alphabetical column order

Changes: - Add all_barcodes.txt: Contains ALL discovered barcodes (no filtering) - Useful for debugging and QC - Header: #barcode\tcount - Add run_log.txt: Run parameters and command line for reproducibility - Includes all parameters used - Shows exact nextflow command - Documents output files - Fix filtering bug: When filter_discovered_barcodes=false, truly no filtering - Previous: flexiplex-filter --no-inflection still applied some filtering - Now: Simply copy all_barcodes.txt to filtered_barcodes.txt - Add header to filtered_barcodes.txt: #barcode\tcount - Update README: Document all output files Recommended usage for lineage tracing: nextflow run main.nf --discovery_mode true --filter_discovered_barcodes false → Retains all barcodes including singletons/rare clones

…analysis) ROOT CAUSE: flexiplex-filter has DEFAULT BOUNDS even with --no-inflection: - Default min-rank: 50 (only keeps top 50 barcodes by count!) - Default max-rank: 95th percentile by count From flexiplex docs: > 'This automatic inflection will, by default, use: > - Lower bound (smallest rank to be searched): 50 > - Upper bound (highest rank to be searched): the 95th percentile' So even with --no-inflection, it was filtering out barcodes ranked >50! FIX: - When filter_discovered_barcodes=false: DON'T call flexiplex-filter at all - Just copy combined_barcodes_counts.txt directly to filtered_barcodes.txt - This preserves ALL barcodes including singletons and rare clones TESTING: With filter_discovered_barcodes=false, filtered_barcodes.txt should now contain ALL barcodes (same as all_barcodes.txt), not just top 50. Recommended for lineage tracing: nextflow run main.nf --discovery_mode true --filter_discovered_barcodes false

Changes: 1. Report: Gini/Shannon to 2 decimal places (was 4) - fmt4() → fmt2() for heterogeneity metrics - Cleaner display, sufficient precision 2. Barcode files: Add explanatory header - all_barcodes.txt: Added 3-line header explaining columns - filtered_barcodes.txt: Same header - Header format: #barcode count # barcode: lineage tracing barcode sequence # count: number of reads supporting this barcode 3. run_log.txt: Enhanced with versions + git info - Nextflow version - Flexiplex version - Python version - Git commit hash - Git branch - Full command line - All parameters - Output file descriptions These changes address Alistair's feedback for reproducibility and clarity in output files.

- useMamba = true (was false) - Mamba is faster and more reliable than conda for env creation - Fixes 'trim_galore: command not found' error on WEHI HPC For Alistair to test: 1. Clear conda cache: rm -rf /vast/scratch/users/chalk.a/nextflow_local/conda_cache/ 2. Clear work dir: rm -rf work/ 3. Re-run: nextflow run main.nf --mode DNAseq ... Mamba will create fresh conda envs with all tools properly in PATH.

- Recommended usage with timestamped publish_dir - Example commands for DNA-seq and scRNA-seq modes - Output file structure - When to clear work/ directory - No resume feature (per user request)

- Check if combined_barcodes_counts.txt is empty before proceeding - Add -e flag to echo in filter_discovered_barcodes=false branch - Add debug logging to diagnose filtered_barcodes.txt generation - Exit with error if no barcodes discovered (fail fast)

…isabled - Replace 'cat combined >> filtered' with 'cp all_barcodes.txt filtered_barcodes.txt' - This ensures filtered_barcodes.txt is identical to all_barcodes.txt when filter_discovered_barcodes=false - More reliable than append operation, avoids potential file descriptor issues - Add validation to fail fast if copy fails

- Log input chunk counts and file sizes - Track barcode counts at each processing step - Show first 5 barcodes for verification - Validate all intermediate files - Report final file sizes and confirm identity - Use set -e to fail fast on errors - Clear [SC_MERGE] prefixed logs for easy grepping This will help diagnose why filtered_barcodes.txt was empty in previous runs despite all_barcodes.txt having content.

Nextflow triple-quoted strings treat $ as Groovy interpolation. All bash variables and $(command) substitutions must be escaped as \$. This caused the compilation error on the HPC (Nextflow 23.10.0).

BUG 1 - Wrong channel passed to Pass 2 (ROOT CAUSE): - sc_merge_discovered_barcodes outputs TWO files: all_barcodes.txt [0] and filtered_barcodes.txt [1] - Old code: ch_filtered_barcodes.first() defaulted to channel [0] = all_barcodes.txt - Fix: Use named emit (filtered_barcodes) to explicitly select the correct channel - This means Pass 2 was using all_barcodes.txt (with comment headers) instead of filtered_barcodes.txt BUG 2 - Comment headers in barcode reference file: - all_barcodes.txt had '#barcode\tcount' comment headers (fine for QC) - filtered_barcodes.txt ALSO had comment headers - flexiplex cannot parse these as -k reference - flexiplex expects raw 'barcode\tcount' format, no comments - Fix: filtered_barcodes.txt now contains raw barcodes only (no headers) - all_barcodes.txt keeps headers since it's only for QC/debugging Also: added emit names to process outputs for clarity

This process was never imported or used in main.nf. sc_merge_discovered_barcodes handles both filter modes.

eos-jin and others added 30 commits March 27, 2026 17:39

Add parameter validation for discovery_mode and clone_barcodes_reference

f18a3e6

- Error if discovery_mode=false and no whitelist provided - Warning if discovery_mode=true but whitelist also provided (ignored) - Clear error messages with actionable guidance

Add flexiplex-filter to extract_sc_env for discovery mode

d51e658

Merge feature/discovery-mode into main (fork)

e02c08c

Remove defaults channel from conda envs (WEHI HPC Anaconda policy)

45552e4

Sync defaults channel removal from main

2435207

Add filter_discovered_barcodes parameter for low-clone-count datasets

d679c14

Update README: document filter_discovered_barcodes parameter

ee1881f

Fix: remove duplicate 'Sample: xxx' heading below dropdown

c7dc04a

Merge feature/discovery-mode into main

ac21572

Rename auto-generated report to nextclone_qc_report.html

af19581

Fix: remove sc_merge_discovered_barcodes_nofilter - merge both modes …

9118e95

…into single process - sc_merge_discovered_barcodes already handles filter_discovered_barcodes param - Fixes 'Cannot find component' error on WEHI HPC - Simplifies discovery mode workflow

docs: Update main README with v2 report features + CLI usage

8e2630c

- Add v2 feature highlights (overlap table, Gini/Shannon, density plot) - Add manual CLI report generation examples - Link to reports/README.md for full documentation - Keep auto-generation info (Nextflow integration)

fix: Sort cross-sample charts alphabetically

396d218

- Chart E (Cells per Sample): alphabetical order - Chart F (Clonality Comparison): alphabetical order - Overlap table: alphabetical column order

chore: Remove backup file generate_report.py.bak

c5e33a4

eos-jin added 9 commits April 10, 2026 11:35

docs: Add Output Management section to README

e1bb4dd

- Recommended usage with timestamped publish_dir - Example commands for DNA-seq and scRNA-seq modes - Output file structure - When to clear work/ directory - No resume feature (per user request)

fix: Escape all bash $ variables in Nextflow template string

f4cb150

Nextflow triple-quoted strings treat $ as Groovy interpolation. All bash variables and $(command) substitutions must be escaped as \$. This caused the compilation error on the HPC (Nextflow 23.10.0).

chore: Remove dead sc_filter_discovered_barcodes process

f1e6753

This process was never imported or used in main.nf. sc_merge_discovered_barcodes handles both filter modes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add discovery mode for de novo barcode identification#9

Feature: Add discovery mode for de novo barcode identification#9
eos-jin wants to merge 39 commits intophipsonlab:mainfrom
eos-jin:main

eos-jin commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eos-jin commented Mar 30, 2026

Summary

New Parameters

How It Works

Usage

Files Changed

Testing

Backward Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant