DNMBcluster

Dockerized pan-genome clustering pipeline with integrated visualization. A modern, Mac/Linux-native, single-command replacement for BPGA (Bacterial Pan Genome Analysis). Takes a folder of GenBank files, returns Parquet dataframes, a Roary-compatible CSV, and publication plots.

Status: M0–M7 complete. Verified end-to-end on 10 Geobacillus genomes (33,116 proteins → 6,508 clusters) in 65 s including plots.

What it does

Walks a folder of .gb / .gbk / .gbff files.
Extracts CDS features (translations by default, nucleotide with --level nucleotide).
Clusters with your choice of engine: MMseqs2 (default), DIAMOND deepclust, CD-HIT, or usearch12.
Produces a unified clusters.parquet with bidirectional identity (member→seed + seed→member) and both-direction coverage, plus an engine-native sidecar (CIGAR for usearch12, alignment positions for CD-HIT -p 1) when the engine emits extras.
Computes presence/absence bitmaps, pan/core curves, per-cluster summaries, and a long-format cluster_long.parquet that joins every CDS to its cluster + category + alignment metrics in one file.
Emits a Roary-compatible gene_presence_absence.csv for the whole Roary/Scoary/Phandango ecosystem, plus a BPGA-style single-sheet comparative_genomics_N.xlsx (merged-view with DEFINITION/accession headers, vertical separators every 3 columns, centered cells).
Generates 17 publication-quality plots via an internal R package (absorbed from BPGAconverter). See the Visualizations section below.
Optional phylogenomics (--phylo): single-copy core extraction → MAFFT alignment → optional trimAl → concatenation → IQ-TREE best tree with UFBoot + SH-aLRT support → ggtree visualization.
Optional ANI + POCP similarity (--ani): skani triangle ANI matrix + cluster-based POCP, both rendered as pheatmap heatmaps with hierarchical clustering.
Context-ribbon subcommand (dnmbcluster context): pick a genome + locus_tag and get a stacked gggenes neighborhood plot with shared-cluster colors showing ortholog context across every strain that carries the anchor cluster.

Install

Docker (recommended)

docker pull ghcr.io/jaeyoonsung/dnmbcluster:latest

From source (local dev)

git clone https://github.com/JAEYOONSUNG/DNMBcluster.git
cd DNMBcluster

# Create a conda env with all four clustering engines + Python + R
mamba env create -n dnmb -f env.yml

# Install the Python package in editable mode
~/miniforge3/envs/dnmb/bin/pip install -e .

# Install the R visualization package
~/miniforge3/envs/dnmb/bin/R CMD INSTALL R

usearch12 is not on bioconda — the Docker image compiles it from rcedgar/usearch12 at build time. For local dev without Docker, build it yourself and put the binary on $PATH as usearch12.

Quick start

Put your GenBank files in a folder, then run one command:

docker run --rm \
  -e HOST_UID=$(id -u) -e HOST_GID=$(id -g) \
  --tmpfs /tmp:size=8g,mode=1777 \
  -v $PWD:/data \
  ghcr.io/jaeyoonsung/dnmbcluster:latest \
  run /data/my_genbanks

Same command, local install:

dnmbcluster run ./my_genbanks

Outputs land in ./results/ by default.

The identity threshold — the most important parameter

--identity is the single knob that changes everything downstream — cluster count, core gene set, pan/core curves, flower plot values, and the Roary CSV. Pick it carefully for your biological question.

Defaults

`--level`	Default `--identity`	Default `--coverage`	Reasoning
`protein` (default)	0.6 (60%)	0.8 (80%)	Mid-point suitable for species-level pan-genome work across all supported engines; balanced between ortholog capture and paralog separation
`nucleotide`	0.6 (60%)	0.8 (80%)	Same single default so engine/level swaps don't silently change the clustering threshold; tighten to 0.7–0.8 for within-species

When to change it

Scenario	Suggested `--identity`	Why
Within-species core genome (e.g. same species, different strains)	0.90 – 0.95	Tighter grouping; splits out allelic variants
Species-level pan-genome (default BPGA use case)	0.50 – 0.70	Broad enough to capture orthologs, tight enough to separate paralog families
Genus-level comparison (mixed species)	0.40 – 0.50	Needed for cross-species orthology
Family-level or deeper	0.30 – 0.40	Below this, clustering tools lose sensitivity — use MMseqs2 or DIAMOND, not CD-HIT

CD-HIT word-size rule

CD-HIT imposes a hard mapping from --identity to the internal word size (-n):

Protein identity	CD-HIT word size
≥ 0.70	5
0.60 – 0.70	4
0.50 – 0.60	3
0.40 – 0.50	2 (unstable; prefer MMseqs2 below 0.5)

DNMBcluster picks the word size automatically from your --identity, but if you go below 0.4 on CD-HIT the tool becomes unreliable — switch engines.

How `--coverage` interacts with `--identity`

--coverage is the fraction of the shorter sequence that must be aligned for the pair to be clustered. It prevents long divergent proteins from being merged just because a short domain happens to match. Default 0.8 (80%) is the right choice for almost every pan-genome use case. Lower it to 0.5 only if you explicitly want domain-level families.

The bidirectional alignment pass

Running mmseqs2 without --fast triggers two separate alignment passes:

Forward — each member aligned as query against its cluster representative as target. Gives pct_identity_fwd, member_coverage, rep_coverage, alignment_length.
Reverse — each representative aligned as query against all its members as targets. Gives pct_identity_rev.

For length-disparate sequences the heuristic aligner can produce slightly different alignments depending on which sequence is the query, so reporting both directions catches asymmetries that a single-direction score would hide. If you only need cluster membership, pass --fast to skip both alignment passes — the extra columns land as NULL but wall time drops ~6.8× on the Geobacillus-10 benchmark.

Full CLI reference

dnmbcluster run INPUT_DIR [OPTIONS]

Arguments:
  INPUT_DIR                   Folder containing *.gb / *.gbk / *.gbff files

Options:
  -o, --output PATH           Output directory (default: ./results)
  --tool [mmseqs2|diamond|cd-hit|usearch12]
                              Clustering engine (default: mmseqs2)
  --level [protein|nucleotide]
                              Sequence level (default: protein)
  --identity FLOAT            Identity threshold 0.0-1.0 (default: 0.6 for both levels)
  --coverage FLOAT            Alignment coverage threshold (default: 0.8)
  --threads INT               Thread count, 0=auto-detect (default: 0)
  --max-ram SIZE              Memory cap, e.g. "8G" (default: container limit × 0.8)
  --alignment / --fast        Pipeline-wide bidirectional alignment pass (default: --alignment)
  --columns TEXT              Comma-separated middle block columns in comparative_genomics_N.xlsx
                              (default: "product"; see Comparative Excel section)
  --parse-only                Stop after GenBank parsing
  -h, --help                  Show this message and exit

Examples

Default species-level pan-genome (10 Geobacillus genomes):

dnmbcluster run ./geobacillus10

Runs with mmseqs2, protein level, identity 0.5, coverage 0.8, bidirectional alignment on. Outputs in ./results/.

Maximum speed, same clustering:

dnmbcluster run ./geobacillus10 --fast

Drops the alignment pass for ~6.8× speedup. Cluster membership is identical; only the identity/coverage metric columns are null.

Strain-level (tight threshold):

dnmbcluster run ./strains --identity 0.95

Nucleotide clustering of CDS DNA:

dnmbcluster run ./genomes --level nucleotide --identity 0.7

Explicit engine choice + thread cap + memory cap:

dnmbcluster run ./genomes \
    --tool diamond \
    --identity 0.5 --coverage 0.8 \
    --threads 16 --max-ram 32G

Parse only (debug GenBank ingestion):

dnmbcluster run ./genomes --parse-only -o /tmp/parse_check

Legacy CD-HIT path (for BPGA-style reproduction):

dnmbcluster run ./genomes --tool cd-hit --identity 0.5

Comparative genomics Excel — flexible column layout

The comparative_genomics_N.xlsx merged sheet gives every genome its own side-by-side block of columns. Two columns per block are fixed:

First column of every block: locus_tag (falls back to cds_key if the CDS has no locus_tag in the GenBank source)
Last column of every block: identity (%) (= pct_identity_fwd from the realignment stage)

Everything between them is controlled by the --columns CLI flag. The default produces a 3-column block locus_tag | product | identity (%), matching the BPGA triple. Add more attributes by listing them comma- separated:

dnmbcluster run ./genomes \
    --columns product,gene,protein_id,pct_identity_rev

yields a 6-column block per genome:

locus_tag | product | gene | protein_id | identity_rev (%) | identity (%)

Empty string produces the minimal 2-column block locus_tag | identity (%).

Selectable middle columns

Column	Source	Type	Agg rule
`product`	id_map (default)	string	first non-null
`gene`	id_map	string	first non-null
`protein_id`	id_map	string	first non-null
`ec_number`	id_map	string	first non-null
`contig`	id_map	string	first non-null
`aa_length`	id_map	uint32	first non-null
`pct_identity_rev`	clusters	float32 (0-100)	first non-null, 2dp
`member_coverage`	clusters	float32 (0-1)	first non-null, 2dp
`rep_coverage`	clusters	float32 (0-1)	first non-null, 2dp
`alignment_length`	clusters	uint32	first non-null

Unknown column names, duplicates, and the two fixed columns are rejected at the CLI layer before any data is loaded, so typos fail fast with a clear error message.

Column widths, borders, and numeric formats all adjust automatically to the block width. The DEFINITION + accession two-line merged header still spans the entire block regardless of how many middle columns you added.

Output structure

The output directory is split into three zones: inputs/ (engine- agnostic genome parsing artifacts, shared across reruns), raw/ (per-engine native output untouched), and processed/ (canonical Parquet every downstream consumer joins on). Top-level files are the user-facing reports.

results/
├── dnmb/
│   ├── inputs/                        # Engine-agnostic, rerun-stable
│   │   ├── id_map.parquet             # canonical identifiers, 1 row per CDS
│   │   ├── gene_table.parquet         # protein/DNA sequences keyed by protein_uid
│   │   ├── genome_meta.parquet        # 1 row per input file (sha256, GC%, n_cds,
│   │   │                              #   organism, strain, DEFINITION, ...)
│   │   └── proteins.faa (or cds.fna)  # integer-header FASTA fed to the engine
│   ├── raw/                           # Untouched engine-native output
│   │   ├── cd-hit/ | diamond/ | mmseqs2/ | usearch12/
│   │   │   └── (engine's own text / TSV / clstr / uc files)
│   │   └── realign/                   # Shared MMseqs2 realignment stage
│   │       ├── realign_fwd.tsv
│   │       └── realign_rev.tsv
│   └── processed/                     # Canonical Parquet dataframes
│       ├── clusters.parquet           # 1 row per CDS: cluster_id, is_centroid,
│       │                              #   pct_identity_fwd/_rev, member_coverage,
│       │                              #   rep_coverage, alignment_length
│       ├── cdhit_native.parquet       # (CD-HIT only) per-member alignment
│       │                              #   positions from -p 1
│       ├── usearch12_native.parquet   # (usearch12 only) CIGAR + strand
│       ├── presence_absence.parquet   # 1 row per cluster, uint64[] genome bitmap
│       ├── pan_core_curve.parquet     # N_permutations × N_genomes rows
│       ├── cluster_summary.parquet    # 1 row per cluster: category, rep metadata
│       └── cluster_long.parquet       # 1 row per CDS: everything joined
│                                      #   (cluster_id, category, genome_key,
│                                      #   locus_tag, gene, product, is_centroid,
│                                      #   pct_identity_fwd/_rev, coverages, aln_len)
├── comparative_genomics_N.xlsx        # BPGA-style merged sheet, 1 row per cluster
├── gene_presence_absence.csv          # Roary-compatible; Scoary/Phandango direct input
├── gene_presence_absence.Rtab         # Roary binary 0/1 companion
└── plots/                             # 13 R-rendered publication PDFs
    ├── flower_plot.pdf
    ├── pan_core_plot.pdf              # + Heaps' law gamma in subtitle
    ├── category_bar.pdf
    ├── presence_absence_heatmap.pdf
    ├── cluster_size_dist.pdf
    ├── identity_distribution.pdf
    ├── genome_jaccard_heatmap.pdf     # dual-metric, two color scales
    ├── coverage_scatter.pdf
    ├── gene_content_mds.pdf
    ├── centroid_length_distribution.pdf
    ├── core_gene_conservation.pdf
    ├── cumulative_pan_contribution.pdf
    └── singleton_top_n.pdf

Visualizations

Default pipeline runs render 13 plots from dnmb/processed/ + dnmb/inputs/. Opt-in stages (--phylo, --ani) add up to 4 more. The standalone dnmbcluster context subcommand produces per-anchor ortholog ribbon plots on demand.

Category	Plots
Overview	`flower_plot`, `category_bar`, `cluster_size_dist`, `ortho_euler`
Rarefaction	`pan_core_plot` (with Heaps' law γ in subtitle)
Matrix / relatedness	`presence_absence_heatmap`, `genome_jaccard_heatmap`, `gene_content_mds`
Per-genome contribution	`cumulative_pan_contribution`, `singleton_top_n`
QC	`identity_distribution`, `coverage_scatter`
Biology	`centroid_length_distribution`, `core_gene_conservation`
Opt-in — ANI/POCP (`--ani`)	`ani_heatmap`, `pocp_heatmap`
Opt-in — phylogenomics (`--phylo`)	`phylo_tree` (IQ-TREE best tree via ggtree)
Standalone subcommand	`context_ribbon` (gggenes, per-anchor)

The R layer is also importable directly — library(DNMBcluster); load_dnmb("results") returns the Parquet dataframes as tibbles if you want to script your own analyses on top.

Phylogenomics (`--phylo`)

Runs only when the user passes --phylo. Under the hood:

Select single-copy core clusters — category core and exactly one member per genome (multi-copy core is skipped).
Extract per-cluster FASTAs from gene_table.parquet.
Align each cluster with MAFFT --auto in a thread pool; optional trimAl -automated1 trimming when the binary is available.
Concatenate aligned clusters into a single supermatrix.
IQ-TREE with a fixed model (LG+G4 protein / GTR+G4 nucleotide)
- UFBoot 1000 + SH-aLRT 1000. ModelFinder is intentionally skipped to keep wall time under 5 min on 10-genome runs.
Newick tree lands at dnmb/processed/phylo_tree.nwk; ggtree renders plots/phylo_tree.pdf with support-colored internal nodes.

Scales: ~3 min on Geobacillus-10 with 300 core clusters capped via the internal max_clusters=300 default; remove the cap by editing the call in cli.py if you want the full core set.

ANI + POCP (`--ani`)

Runs only when --ani is set. Under the hood:

Lazily extract one contig FASTA per genome into dnmb/raw/genomes/{genome_key}.fna (cached; only touched if missing).
Run skani triangle --full-matrix once across all genomes → raw TSV at dnmb/raw/ani/skani_triangle.tsv.
Parse the matrix into long-form dnmb/processed/ani_matrix.parquet.
Compute POCP directly from clusters.parquet + genome_meta.n_cds using the Qin et al. 2014 ratio. Our approximation replaces the classical BLASTP 40%/50% threshold with "members share a cluster", so the POCP value reflects whatever --identity the clustering run used — document this in your methods.
plots/ani_heatmap.pdf and plots/pocp_heatmap.pdf render via pheatmap with average-linkage hierarchical clustering on 1 - metric/100. Species boundary at ANI ≥ 95%, genus boundary at POCP ≥ 60% are called out in the title.

Context ribbon (`dnmbcluster context`)

Standalone subcommand that works against an existing results directory. Example:

dnmbcluster context ./results \
    --genome GCF_000009785.1 \
    --locus GK_RS07590 \
    --flank 5

Finds the anchor CDS, walks every strain that shares the anchor's cluster, extracts the ±flank flanking CDSs on each strain's matching contig, and renders a stacked gggenes plot. Shared-cluster genes across strains get the same fill color (your visual ribbon), the anchor homolog in every strain is marked with a ★, and strain-unique genes fade to grey. Writes to <results>/plots/context_<genome>_<locus>.pdf by default.

Identifier scheme

DNMBcluster uses a two-level composite primary key for every CDS:

{genome_key}::{cds_key}     e.g.  "GCF_000009785.1::GK_RS07590"

The genome_key comes from a 6-step fallback chain (DBLINK Assembly accession → filename GCF_* → LOCUS/VERSION → organism slug → filename stem) and the cds_key from a 4-step chain (locus_tag → protein_id → {gene}__{ordinal} → coordinate hash). Both fallback rungs are recorded in genome_key_source / cds_key_source columns so you can see exactly how a label was derived. RefSeq WP_ protein accessions are deliberately NOT used as the primary key because they are shared across closely related strains by design.

A compact protein_uid:uint64 (run-scoped) is used as the hot-path join key. The stable primary key is the (genome_key, cds_key) pair. See SPEED.md Section 1 for the full design.

Analyzable scope

What the pipeline is designed to do, and what it's not.

Input scale

Parameter	Supported range	Hard cap	Notes
Genomes per run	2 – 65,535	`uint16 genome_uid`	Single-word genome bitmap up to 64; multi-word beyond, no code-level limit until 65,535
CDS per genome	1 – 4,294,967,295	`uint32 gene_uid`	Packed into the lower 32 bits of `protein_uid`
Total CDS across run	~10M benchmarked	unbounded in schema	Wall time grows roughly linearly; memory grows with clusters × genomes
Sequence level	`protein` (default), `nucleotide`	—	DIAMOND is protein-only; `--level nucleotide --tool diamond` is rejected up-front
Identity threshold	0.0 – 1.0	engine-specific (see below)	MMseqs2 / DIAMOND effective range ≥ 0.3 ; CD-HIT ≥ 0.4 (hard word-size floor); usearch12 ≥ 0.3
Coverage threshold	0.0 – 1.0	—	Applied as `--cov-mode 0` (coverage of shorter sequence)

What each engine covers

Engine	protein	nucleotide	Notes
MMseqs2 (default)	yes	yes	Fastest via linclust; default choice for bacterial pan-genomes
DIAMOND deepclust	yes	no	Up to 82× CD-HIT on large protein sets; rejects `--level nucleotide`
CD-HIT	yes (`cd-hit`)	yes (`cd-hit-est`)	Dispatches to the right binary by level; identity minimum 0.4 (protein)
usearch12	yes	yes	Source-only build; bundled in Docker, manual install for local dev

What you get out (post-run)

11 core Parquet dataframes under dnmb/inputs/ and dnmb/processed/, all schema-validated at stage boundaries.
1 BPGA-style Excel with merged-cell DEFINITION headers + centered cells + 3-column block borders.
2 Roary-compatible files for direct Scoary / Phandango / panGOLIN consumption.
13 publication PDFs covering overview, rarefaction, relatedness, per-genome contribution, QC, and biology (list above).
Uniform schema across all engines — clusters.parquet columns and types are identical regardless of --tool, so downstream code never branches on the clustering backend.

Tested engine × identity matrix

The pipeline is end-to-end tested on the Geobacillus-10 fixture across every combination of:

engines    : mmseqs2, diamond, cd-hit
identities : 0.3, 0.5, 0.7, 0.9   (cd-hit skips 0.3 per its 0.4 floor)

All 11 runs pass structural, schema, file-presence, validation, and plotting checks. Cluster counts scale monotonically with --identity within each engine; cross-engine cluster count spreads fall within the expected algorithmic-variation range.

Caveats & known limits

Things that will bite if you don't know about them.

Pipeline-wide

MMseqs2 is a hard runtime dependency even for non-MMseqs2 engines when --alignment (default) is on. The shared engines/realign.py stage uses MMseqs2 easy-search to populate the bidirectional alignment metric columns uniformly across every engine. Pass --fast to skip realignment entirely (alignment columns land null; cluster membership is unaffected).
Non-centroid members in clusters ≥ 2 must have both pct_identity_fwd and pct_identity_rev populated after realignment. The shared validator enforces this as a hard assertion and fails the run loudly if any row slips through. A short-peptide Biopython fallback covers the rare pairs MMseqs2's k-mer prefilter can't seed.
genome_uid is 16-bit. Single run ceiling is 65,535 genomes. Larger collections need to be split.
protein_uid is run-scoped, not stable across runs. Use the (genome_key, cds_key) composite from id_map.parquet as the stable cross-run key.
cluster_id is dense and run-scoped, re-emitted fresh per engine. Don't compare cluster_id values across different runs or different --tool choices.

Identifiers

Duplicate genome_keys crash early. If two input files resolve to the same composite key (commonly: two copies of the same assembly under different filenames), build_tables raises before clustering starts — disambiguate via a manifest or rename the files.
locus_tag is NOT unique across an assembly for all Prokka reruns. The parser catches duplicates in a single file and appends __dupN to keep identifiers unique, emitting a warning.
RefSeq WP_ protein accessions are intentionally not used as a primary key — RefSeq shares them across related strains by design, so using them would collapse orthologs into the same row.

Engine-specific

CD-HIT identity minimum is 0.4 (protein). Below that, the word-size heuristic becomes unstable — the pipeline keeps passing the threshold through but CD-HIT itself will refuse or produce unreliable clusters. Use MMseqs2 or DIAMOND for --identity < 0.4.
DIAMOND is protein-only. --tool diamond --level nucleotide is rejected at the CLI layer before parsing starts.
DIAMOND has no engine-native sidecar. deepclust emits a bare 2-column TSV and its --aln-out mode is unreliable in the 2.1.x series, so diamond_native.parquet is intentionally not produced. All alignment metrics come from the shared realign stage.
usearch12 is under GPL-3 and has no bioconda package. The Docker image compiles it from source at build time; for local dev you need to build it yourself and put the binary on $PATH as usearch12.

Downstream / reporting

Category definitions are strict 3-tier: core = present in all genomes, unique = present in exactly one, accessory = everything in between. The earlier Roary-style core/soft_core/shell/cloud split has been collapsed — if your downstream analysis depends on a 4-tier breakdown, reclassify off n_genomes in presence_absence.parquet.
R plotting requires Rscript on $PATH (or $DNMBCLUSTER_RSCRIPT). In a conda setup, the R package must be installed into the same env as the python CLI via R CMD INSTALL R; otherwise the Python pipeline will still succeed but the plot stage will warn and skip.
Heaps' law γ needs at least 2 data points on the pan curve. A single-genome run produces no meaningful openness estimate and the subtitle falls back to empty.
gene_content_mds.pdf needs ≥ 3 genomes. Smaller runs emit a warning and skip that specific plot.

Benchmark snapshot

10 Geobacillus genomes (33,116 proteins), Apple Silicon, 8 threads, identity 0.5, coverage 0.8:

Engine	Wall	RSS	Clusters
`mmseqs2 --fast`	9.9 s	329 MB	6508
`diamond`	13.6 s	324 MB	5709
`mmseqs2` default (bidirectional aln)	67.4 s	1004 MB	6508
`cd-hit`	70.6 s	323 MB	5862

Full methodology and per-engine analysis in BENCHMARK.md. usearch12 is bundled in the Docker image but not benchmarked in the local conda path (it requires a source compile from rcedgar/usearch12 under GPL-3).

Roadmap

M0 repo scaffold
M1 GenBank parser + parallel protein extraction
M2 MMseqs2 engine + unified cluster TSV
M3 DIAMOND, CD-HIT, usearch12 engines
M4 DNMB-native Parquet dataframes + Roary-compatible CSV
M5 R visualization layer (absorbed from BPGAconverter)
M6 Geobacillus-10 benchmark — see BENCHMARK.md
M7 Multi-arch CI (amd64/arm64) + GHCR release workflow

Repository layout

DNMBcluster/
├── src/dnmbcluster/        # Python pipeline (CLI, parser, engines, dataframes)
│   ├── cli.py              # Click entry point
│   ├── genbank.py          # Biopython GenBank parser
│   ├── ids.py              # genome_key + cds_key fallback chains
│   ├── schemas.py          # PyArrow schemas for every stage
│   ├── fasta.py            # integer-header FASTA writer
│   ├── engines/            # mmseqs2, diamond, cdhit, usearch12
│   ├── matrix.py           # presence/absence bitmap builder
│   ├── pancore.py          # numpy-vectorized pan/core curves
│   ├── summary.py          # per-cluster summary
│   ├── roary_export.py     # gene_presence_absence.csv / .Rtab
│   └── r_bridge.py         # subprocess bridge to the R package
├── R/                      # DNMBcluster R package (visualization)
│   ├── DESCRIPTION
│   ├── NAMESPACE
│   └── R/
│       ├── load_dnmb.R     # arrow::read_parquet loader
│       ├── flower_plot.R
│       ├── pan_core_plot.R             # + Heaps' law gamma
│       ├── category_bar.R
│       ├── presence_absence_heatmap.R
│       ├── cluster_size_dist.R
│       ├── identity_distribution.R
│       ├── genome_jaccard_heatmap.R    # dual-metric (ggnewscale)
│       ├── coverage_scatter.R
│       ├── gene_content_mds.R
│       ├── centroid_length_distribution.R
│       ├── core_gene_conservation.R
│       ├── cumulative_pan_contribution.R
│       ├── singleton_top_n.R
│       └── run_dnmb_plot.R # entry point called from Python
├── tests/
│   ├── fixtures/
│   │   └── geobacillus10/  # 10-genome benchmark GenBank set
│   └── test_*.py           # 85 unit + integration tests
├── Dockerfile              # multi-stage: usearch12 builder + conda runtime
├── docker-entrypoint.sh    # gosu + tini, HOST_UID auto-detect
├── env.yml                 # conda env spec
├── pyproject.toml          # Python package spec
├── SPEED.md                # speed + integrity design doc (load-bearing)
├── BENCHMARK.md            # Geobacillus-10 timing results
└── .github/workflows/      # ci.yml (pytest + buildx), release.yml (GHCR)

License

MIT for DNMBcluster itself. Bundled third-party tools retain their original licenses:

usearch12 — GPL-3.0 (compiled in Docker image from source)
DIAMOND — GPL-3.0
CD-HIT — GPL-2.0
MMseqs2 — MIT
Biopython — Biopython License (BSD-like)
R packages (arrow, ggplot2, dplyr, tidyr, tibble, magrittr, scales) — MIT / GPL-2+

Invoking GPL tools as subprocesses from DNMBcluster (MIT) is mere aggregation and does not affect the pipeline code's license.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
R		R
src/dnmbcluster		src/dnmbcluster
tests		tests
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SPEED.md		SPEED.md
docker-entrypoint.sh		docker-entrypoint.sh
env.yml		env.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DNMBcluster

What it does

Install

Docker (recommended)

From source (local dev)

Quick start

The identity threshold — the most important parameter

Defaults

When to change it

CD-HIT word-size rule

How --coverage interacts with --identity

The bidirectional alignment pass

Full CLI reference

Examples

Comparative genomics Excel — flexible column layout

Selectable middle columns

Output structure

Visualizations

Phylogenomics (--phylo)

ANI + POCP (--ani)

Context ribbon (dnmbcluster context)

Identifier scheme

Analyzable scope

Input scale

What each engine covers

What you get out (post-run)

Tested engine × identity matrix

Caveats & known limits

Pipeline-wide

Identifiers

Engine-specific

Downstream / reporting

Benchmark snapshot

Roadmap

Repository layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How `--coverage` interacts with `--identity`

Phylogenomics (`--phylo`)

ANI + POCP (`--ani`)

Context ribbon (`dnmbcluster context`)

Packages