Configuration Guide

This guide provides comprehensive documentation for all configuration parameters in biotoolsLLMAnnotate.

Overview

biotoolsLLMAnnotate can be configured through:

Command-line arguments (highest priority)
Configuration file (config.yaml)
Environment variables
Default values (lowest priority)

Configuration File

The main configuration file is config.yaml. You can generate a complete example using:

biotools-annotate --write-default-config

Parameter Reference

How to read this guide

Config key: YAML path under config.yaml (e.g., pipeline.min_bio_score).
CLI flag: Equivalent command-line option, when available.
Invocation priority: CLI flag > environment variable > config file > default.

Config key	CLI flag	Description
`pipeline.custom_pub2tools_biotools_json`	`--custom-pub2tools-json`	Load tool candidates from a custom Pub2Tools to_biotools.json export (overrides date-based fetch)
`pipeline.registry_path`	`--registry`	Load bio.tools registry snapshot to check for existing entries
`pipeline.from_date` / `pipeline.to_date`	`--from-date` / `--to-date`	Control date window for gathering candidates
`pipeline.resume_from_enriched`	`--resume-from-enriched`	Reuse cached enriched tool candidates
`pipeline.resume_from_pub2tools`	`--resume-from-pub2tools`	Reuse cached `to_biotools.json` export
`pipeline.resume_from_scoring`	`--resume-from-scoring`	Reapply thresholds without rerunning scoring
`pipeline.min_bio_score`	`--min-bio-score`	Minimum LLM bio score required
`pipeline.min_documentation_score`	`--min-doc-score`	Minimum documentation score required
`ollama.model`	`--model`	Ollama model to invoke
`ollama.concurrency`	`--concurrency`	Number of parallel scoring workers
`logging.level`	`--verbose` / `--quiet`	Adjust logging verbosity
`logging.file`	(config only)	Write logs to a specific file

Pub2Tools Configuration

Folder Behavior: When the Pub2Tools CLI is invoked, outputs are written directly to out/range_<from>_to_<to>/pub2tools/ within each run's folder. Existing outputs are overwritten on subsequent runs with the same date range.

`pub2tools.edam_owl`

Type: String (URL)
Default: http://edamontology.org/EDAM.owl
Description: URL to the EDAM ontology OWL file used for semantic annotation
CLI equivalent: --edam-owl

`pub2tools.idf`

Type: String (URL)
Default: https://github.com/edamontology/edammap/raw/master/doc/biotools.idf
Description: URL to the IDF (Inverse Document Frequency) file for Pub2Tools scoring
CLI equivalent: --idf

`pub2tools.idf_stemmed`

Type: String (URL)
Default: https://github.com/edamontology/edammap/raw/master/doc/biotools.stemmed.idf
Description: URL to the stemmed IDF file for Pub2Tools scoring
CLI equivalent: --idf-stemmed

`pub2tools.p2t_month`

Type: String (YYYY-MM) or null
Default: null
Description: Specific month to fetch from Pub2Tools using the -all command
CLI equivalent: --p2t-month
Example: "2024-09"

`pipeline.from_date` / `pipeline.to_date`

Type: String (relative window like 7d or ISO-8601) or null
Default: "7d" / null
Description: Date range for fetching candidates (alternative to p2t_month), applied across the entire pipeline.
CLI equivalent: --from-date, --to-date
Example: "2024-09-01" or "30d"

`pub2tools.selenium_firefox`

Type: Boolean or null
Default: null
Description: Enable Selenium Firefox for web scraping
CLI equivalent: --firefox-path (when provided)

`pub2tools.firefox_path`

Type: String (path) or null
Default: null
Description: Path to Firefox binary for Selenium web scraping
CLI equivalent: --firefox-path
Example: "/usr/bin/firefox"

`pub2tools.p2t_cli`

Type: String (path or command) or null
Default: null
Description: Path to Pub2Tools CLI executable or command string (overrides auto-detection)
CLI equivalent: --p2t-cli
Examples:
- File path: "/usr/local/bin/pub2tools"
- Java .jar: "java -jar /path/to/pub2tools-1.1.1.jar"
- Custom command: "java -Xmx4g -jar /opt/pub2tools.jar"
Note: If not set, the tool will auto-detect Pub2Tools CLI using environment variables and common installation paths

Pipeline Configuration

Note: All pipeline artifacts are written to a dedicated folder using fixed filenames: either the time-period path (out/range_YYYY-MM-DD_to_YYYY-MM-DD/…) when using date-based Pub2Tools fetching, or out/custom_tool_set/... when pipeline.custom_pub2tools_biotools_json/--custom-pub2tools-json/BIOTOOLS_ANNOTATE_INPUT is explicitly set. The directory structure is determined solely by the presence of a custom input file - resume flags do NOT affect this decision. After each run the active configuration file (or a generated snapshot) is copied into that folder for record keeping.

`pipeline.resume_from_enriched`

Type: Boolean
Default: false
Description: When true, the pipeline skips ingestion/enrichment and looks for the default cache file (out/<time-period>/cache/enriched_candidates.json.gz). No additional path configuration is required.
CLI equivalent: --resume-from-enriched

`pipeline.payload_version`

Type: String
Default: "__VERSION__"
Description: Version string stored alongside the updated entries payload; the placeholder resolves to the installed package version.

`pipeline.custom_pub2tools_biotools_json`

Type: String (path) or null
Default: null
Description: Path to a custom Pub2Tools to_biotools.json export file. When set, this overrides date-based Pub2Tools fetching and causes the pipeline to use out/custom_tool_set/ as the output directory. When null (default), date-based fetching is used with out/range_YYYY-MM-DD_to_YYYY-MM-DD/ as the output directory.
CLI equivalent: --custom-pub2tools-json
Example: "data/to_biotools.json"
Migration note: This parameter was previously named pipeline.input_path. The old name is deprecated but still supported with a warning.

`pipeline.registry_path`

Type: String (path) or null
Default: null
Description: Explicit bio.tools registry snapshot used to flag existing entries. Accepts either a JSON file (list or object with entries/list) or a directory containing files such as biotools.json.
CLI equivalent: --registry
Example: "data/biotools_snapshot.json"
Notes: When unset, the pipeline scans Pub2Tools export folders (current time-period cache and global cache). Use this option to supply a curated snapshot or to run the pipeline without Pub2Tools artifacts present.

`pipeline.resume_from_pub2tools`

Type: Boolean
Default: false
Description: When true, the pipeline skips the Pub2Tools CLI invocation and reuses the most recent to_biotools.json export found in the time-period or global pub2tools/ cache folders. No manual path configuration is required.
CLI equivalent: --resume-from-pub2tools

`pipeline.resume_from_scoring`

Type: Boolean
Default: false
Description: When true, the pipeline reuses the cached reports/assessment.jsonl for the time-period folder, reapplies the current score thresholds, and regenerates payload outputs without invoking the LLM scorer again. Requires the enriched candidates cache to be present (automatically handled when pipeline.resume_from_enriched is also true).
CLI equivalent: --resume-from-scoring

`pipeline.validate_biotools_api`

Type: Boolean
Default: false
Description: When true, the pipeline validates each payload entry against the live bio.tools API after scoring and registry checks. The validation adds biotools_api_status, api_name, and api_description columns to the CSV output. Use this to verify that tools marked as present in bio.tools actually exist in the live registry and to detect discrepancies with local snapshots. Disabled by default to avoid unnecessary API load during batch runs or offline mode.
CLI equivalent: --validate-biotools-api / --no-validate-biotools-api

`pipeline.min_bio_score`

Type: Float (0.0–1.0)
Default: 0.6
Description: Minimum biological relevance score required for a candidate to be included in the payload. Scores come from either heuristic scoring or the LLM rubric (A1–A5).
CLI equivalent: --min-bio-score

`pipeline.min_documentation_score`

Type: Float (0.0–1.0)
Default: 0.6
Description: Minimum documentation quality score required for inclusion, derived from rubric items B1–B5. Candidates that fail to meet this threshold are reported but excluded from the payload.
CLI equivalent: --min-doc-score
Compatibility: The legacy pipeline.min_score and --min-score options, when present, set both thresholds to the same value.

Ollama Configuration

`ollama.host`

Type: String (URL)
Default: "http://localhost:11434"
Description: Ollama server URL
Example: "http://localhost:11434"

`ollama.model`

Type: String
Default: "llama3.2"
Description: Default Ollama model name used for LLM assessment when the CLI flag is omitted.
CLI equivalent: --model

`ollama.max_retries`

Type: Integer
Default: 3
Description: Number of retry attempts for Ollama HTTP calls after the initial request. Setting 0 disables automatic retries.
Example: 5

`ollama.retry_backoff_seconds`

Type: Float (seconds)
Default: 2.0
Description: Fixed delay between Ollama retry attempts (applied to both HTTP session retries and LLM generation retries).
Example: 0.5

`ollama.timeout`

Type: Integer (seconds)
Default: 300
Description: Timeout for individual Ollama API requests. Increase this value if using slower models or complex prompts that exceed 5 minutes to generate.
Example: 600 (10 minutes), 120 (2 minutes for fast models)
Note: For qwen3:4b and similar lightweight models, consider reducing to 120-180 seconds to fail fast on stuck requests.

`ollama.temperature`

Type: Float
Default: 0.01
Description: Sampling temperature for LLM generation. Lower values (near 0) produce more deterministic outputs; higher values increase randomness.
Range: 0.0 to 2.0
Example: 0.0 (maximum determinism), 0.7 (creative), 1.0 (balanced)

`ollama.force_json_format`

Type: Boolean
Default: true
Description: When enabled, requests Ollama to operate in JSON mode ({"format": "json"}) so the model is restricted to emitting valid JSON. Disable only if using a model that does not support JSON mode.
Example: false

`ollama.schema_retries`

Type: Integer
Default: 1
Description: Number of extra attempts the scorer makes when the LLM output fails schema validation (total attempts = 1 + schema_retries). Validation errors are fed back into the prompt to encourage self-correction.
Example: 2

`ollama.concurrency`

Type: Integer
Default: 8
Description: Maximum number of concurrent scoring workers (shared by both heuristic and LLM scoring).
CLI equivalent: --concurrency
Example: 16

`ollama.num_ctx`

Type: Integer or null
Default: null (uses model default)
Description: Context window size in tokens for the Ollama model. When set, overrides the model's default context length. Required for models with large context windows to prevent prompt truncation.
Important: Ollama defaults to 4096 tokens if not specified, which may truncate prompts for models that support larger contexts (e.g., qwen2.5:4b supports 32K tokens).
Example: 8192, 16384, 32768
Recommended values:
- qwen2.5:4b, qwen2.5:7b: 16384 or 32768
- llama3.2:3b: 131072 (128K)
- gemma2:9b: 8192
Note: Higher values consume more memory. Monitor your system resources when increasing context size.

Logging Configuration

`logging.level`

Type: String
Default: "INFO"
Description: Logging level
Options: "DEBUG", "INFO", "WARNING", "ERROR"
CLI equivalent: --verbose (DEBUG), --quiet (ERROR)

`logging.file`

Type: String (path) or null
Default: null (console only)
Description: Log file path
Example: "logs/biotools-annotate.log"

`logging.llm_log`

Type: String (path)
Default: "out/logs/ollama/ollama.log"
Description: Human-readable append-only log that captures every Ollama request and response for post-run inspection.

`logging.llm_trace`

Type: String (path)
Default: "out/ollama/trace.jsonl"
Description: Machine-readable JSONL trace containing per-attempt metadata (prompt variants, request options, status, schema errors) for reproducible auditing and downstream analysis.

Europe PMC Enrichment

`enrichment.europe_pmc.enabled`

Type: Boolean
Default: true
Description: Toggle Europe PMC enrichment for publication metadata.
Effect: When disabled, abstracts and full text are not retrieved.

`enrichment.europe_pmc.include_full_text`

Type: Boolean
Default: true
Description: Fetch open-access full text (truncated) when a PMCID is available.
Note: Even when full text cannot be retrieved, the first available full-text URL is attached as publication_full_text_url.

Homepage Scraping Enrichment

`enrichment.homepage.enabled`

Type: Boolean
Default: true
Description: Toggle HTML scraping of each candidate homepage to discover documentation and repository links.
Effect: Disabled automatically when --offline or --resume-from-enriched is used.

`enrichment.homepage.timeout`

Type: Integer/Float (seconds)
Default: 8
Description: Network timeout applied to homepage requests.
Example: 5

`enrichment.homepage.user_agent`

Type: String
Default: "biotoolsllmannotate/__VERSION__ (+https://github.com/ELIXIR-Belgium/biotoolsLLMAnnotate)"
Description: Custom User-Agent header for homepage scraping requests. The placeholder is replaced with the package version during configuration load.

`enrichment.europe_pmc.max_publications`

Type: Integer
Default: 1
Description: Maximum number of publication records to enrich per candidate (to limit API calls).

`enrichment.europe_pmc.max_full_text_chars`

Type: Integer
Default: 4000
Description: Maximum number of characters retained from the Europe PMC full-text XML output.

`enrichment.europe_pmc.timeout`

Type: Integer
Default: 15
Description: Timeout (seconds) for Europe PMC HTTP requests.

Scoring Prompt Template

`scoring_prompt_template`

Type: String (multi-line)
Description: Custom prompt template for LLM scoring
Note: Advanced users can customize the scoring instructions
Variables: {title}, {description}, {homepage}, {homepage_status}, {homepage_error}, {documentation}, {documentation_keywords}, {repository}, {tags}, {published_at}, {publication_abstract}, {publication_full_text}, {publication_ids}
Expected response keys: bio_subscores, documentation_subscores, tool_name, homepage, publication_ids, concise_description, rationale
LLM contract: Return bio_subscores and documentation_subscores as JSON objects keyed by rubric IDs (A1–A5, B1–B5) with exactly one of {0, 0.5, 1} values. The pipeline computes the average from these subscores and clamps to [0.0, 1.0] before persisting the final scores. Normalize publication identifiers to DOI:..., PMID:..., PMCID:... format.

Environment Variables

Variable	Description	Example
`PUB2TOOLS_CLI`	Pub2Tools CLI command or path	`java -jar /path/to/pub2tools.jar`
`BIOTOOLS_CONFIG`	Custom config file path	`/etc/biotools-annotate/config.yaml`
`OLLAMA_HOST`	Ollama server URL	`http://localhost:11434`

Note: Configuration file parameters (like pub2tools.p2t_cli) take precedence over environment variables.

Configuration Precedence

Command-line arguments (highest priority)
Environment variables
Configuration file (config.yaml)
Default values (lowest priority)

Examples

Basic Configuration

pipeline:
  since: "7d"
  min_bio_score: 0.6
  min_documentation_score: 0.6
  concurrency: 16

logging:
  level: "INFO"

Advanced Configuration

pub2tools:
  p2t_month: "2024-09"
  p2t_cli: "java -jar /opt/pub2tools/pub2tools-cli-1.1.2.jar"
  firefox_path: "/usr/bin/firefox"

pipeline:
  since: "30d"
  min_bio_score: 0.8
  min_documentation_score: 0.75
  limit: 100
  model: "llama3.1:8b"
  concurrency: 8

enrichment:
  europe_pmc:
    enabled: true
    include_full_text: true

logging:
  level: "DEBUG"
  file: "logs/debug.log"

Minimal Configuration

pipeline:
  since: "2024-01-01"
  min_bio_score: 0.6
  min_documentation_score: 0.6

Troubleshooting

Common Issues

Pub2Tools not found: Set PUB2TOOLS_CLI environment variable or pub2tools.p2t_cli in config file
Firefox not found: Install Firefox or set firefox_path in config
Ollama not accessible: Check ollama.host configuration
Permission errors: Ensure write permissions for output directories

Debug Mode

Enable debug logging to troubleshoot issues:

biotools-annotate run --verbose --from-date 1d

Or in config:

logging:
  level: "DEBUG"

Migration from CLI Args

When migrating from command-line arguments to config file:

CLI Argument	Config Path
`--model llama3.1`	`ollama.model: "llama3.1"`
`--concurrency 16`	`ollama.concurrency: 16`
`--p2t-cli /path/to/pub2tools`	`pub2tools.p2t_cli: "/path/to/pub2tools"`
`--p2t-cli "java -jar /path/to/jar"`	`pub2tools.p2t_cli: "java -jar /path/to/jar"`
`--min-bio-score 0.7`	`pipeline.min_bio_score: 0.7`
`--min-doc-score 0.65`	`pipeline.min_documentation_score: 0.65`

Best Practices

Start simple: Use default config and override specific parameters
Use relative time: Prefer "7d" over specific dates for pipeline.from_date
Set reasonable limits: Use limit for testing with large datasets
Enable logging: Set logging.file for production use
Test configuration: Use --dry-run to validate settings
Version control: Keep config.yaml in version control for reproducibility

FilesExpand file tree

CONFIG.md

Latest commit

History

CONFIG.md

File metadata and controls

Configuration Guide

Overview

Configuration File

Parameter Reference

How to read this guide

Pub2Tools Configuration

pub2tools.edam_owl

pub2tools.idf

pub2tools.idf_stemmed

pub2tools.p2t_month

pipeline.from_date / pipeline.to_date

pub2tools.selenium_firefox

pub2tools.firefox_path

pub2tools.p2t_cli

Pipeline Configuration

pipeline.resume_from_enriched

pipeline.payload_version

pipeline.custom_pub2tools_biotools_json

pipeline.registry_path

pipeline.resume_from_pub2tools

pipeline.resume_from_scoring

pipeline.validate_biotools_api

pipeline.min_bio_score

pipeline.min_documentation_score