This guide provides comprehensive documentation for all configuration parameters in biotoolsLLMAnnotate.
biotoolsLLMAnnotate can be configured through:
- Command-line arguments (highest priority)
- Configuration file (
config.yaml) - Environment variables
- Default values (lowest priority)
The main configuration file is config.yaml. You can generate a complete example using:
biotools-annotate --write-default-config- Config key: YAML path under
config.yaml(e.g.,pipeline.min_bio_score). - CLI flag: Equivalent command-line option, when available.
- Invocation priority: CLI flag > environment variable > config file > default.
| Config key | CLI flag | Description |
|---|---|---|
pipeline.custom_pub2tools_biotools_json |
--custom-pub2tools-json |
Load tool candidates from a custom Pub2Tools to_biotools.json export (overrides date-based fetch) |
pipeline.registry_path |
--registry |
Load bio.tools registry snapshot to check for existing entries |
pipeline.from_date / pipeline.to_date |
--from-date / --to-date |
Control date window for gathering candidates |
pipeline.resume_from_enriched |
--resume-from-enriched |
Reuse cached enriched tool candidates |
pipeline.resume_from_pub2tools |
--resume-from-pub2tools |
Reuse cached to_biotools.json export |
pipeline.resume_from_scoring |
--resume-from-scoring |
Reapply thresholds without rerunning scoring |
pipeline.min_bio_score |
--min-bio-score |
Minimum LLM bio score required |
pipeline.min_documentation_score |
--min-doc-score |
Minimum documentation score required |
ollama.model |
--model |
Ollama model to invoke |
ollama.concurrency |
--concurrency |
Number of parallel scoring workers |
logging.level |
--verbose / --quiet |
Adjust logging verbosity |
logging.file |
(config only) | Write logs to a specific file |
Folder Behavior: When the Pub2Tools CLI is invoked, outputs are written directly to
out/range_<from>_to_<to>/pub2tools/within each run's folder. Existing outputs are overwritten on subsequent runs with the same date range.
- Type: String (URL)
- Default:
http://edamontology.org/EDAM.owl - Description: URL to the EDAM ontology OWL file used for semantic annotation
- CLI equivalent:
--edam-owl
- Type: String (URL)
- Default:
https://github.com/edamontology/edammap/raw/master/doc/biotools.idf - Description: URL to the IDF (Inverse Document Frequency) file for Pub2Tools scoring
- CLI equivalent:
--idf
- Type: String (URL)
- Default:
https://github.com/edamontology/edammap/raw/master/doc/biotools.stemmed.idf - Description: URL to the stemmed IDF file for Pub2Tools scoring
- CLI equivalent:
--idf-stemmed
- Type: String (YYYY-MM) or null
- Default:
null - Description: Specific month to fetch from Pub2Tools using the
-allcommand - CLI equivalent:
--p2t-month - Example:
"2024-09"
- Type: String (relative window like
7dor ISO-8601) or null - Default:
"7d"/null - Description: Date range for fetching candidates (alternative to
p2t_month), applied across the entire pipeline. - CLI equivalent:
--from-date,--to-date - Example:
"2024-09-01"or"30d"
- Type: Boolean or null
- Default:
null - Description: Enable Selenium Firefox for web scraping
- CLI equivalent:
--firefox-path(when provided)
- Type: String (path) or null
- Default:
null - Description: Path to Firefox binary for Selenium web scraping
- CLI equivalent:
--firefox-path - Example:
"/usr/bin/firefox"
- Type: String (path or command) or null
- Default:
null - Description: Path to Pub2Tools CLI executable or command string (overrides auto-detection)
- CLI equivalent:
--p2t-cli - Examples:
- File path:
"/usr/local/bin/pub2tools" - Java .jar:
"java -jar /path/to/pub2tools-1.1.1.jar" - Custom command:
"java -Xmx4g -jar /opt/pub2tools.jar"
- File path:
- Note: If not set, the tool will auto-detect Pub2Tools CLI using environment variables and common installation paths
Note: All pipeline artifacts are written to a dedicated folder using fixed filenames: either the time-period path (
out/range_YYYY-MM-DD_to_YYYY-MM-DD/…) when using date-based Pub2Tools fetching, orout/custom_tool_set/...whenpipeline.custom_pub2tools_biotools_json/--custom-pub2tools-json/BIOTOOLS_ANNOTATE_INPUTis explicitly set. The directory structure is determined solely by the presence of a custom input file - resume flags do NOT affect this decision. After each run the active configuration file (or a generated snapshot) is copied into that folder for record keeping.
- Type: Boolean
- Default:
false - Description: When
true, the pipeline skips ingestion/enrichment and looks for the default cache file (out/<time-period>/cache/enriched_candidates.json.gz). No additional path configuration is required. - CLI equivalent:
--resume-from-enriched
- Type: String
- Default:
"__VERSION__" - Description: Version string stored alongside the updated entries payload; the placeholder resolves to the installed package version.
- Type: String (path) or null
- Default:
null - Description: Path to a custom Pub2Tools
to_biotools.jsonexport file. When set, this overrides date-based Pub2Tools fetching and causes the pipeline to useout/custom_tool_set/as the output directory. When null (default), date-based fetching is used without/range_YYYY-MM-DD_to_YYYY-MM-DD/as the output directory. - CLI equivalent:
--custom-pub2tools-json - Example:
"data/to_biotools.json" - Migration note: This parameter was previously named
pipeline.input_path. The old name is deprecated but still supported with a warning.
- Type: String (path) or null
- Default:
null - Description: Explicit bio.tools registry snapshot used to flag existing entries. Accepts either a JSON file (list or object with
entries/list) or a directory containing files such asbiotools.json. - CLI equivalent:
--registry - Example:
"data/biotools_snapshot.json" - Notes: When unset, the pipeline scans Pub2Tools export folders (current time-period cache and global cache). Use this option to supply a curated snapshot or to run the pipeline without Pub2Tools artifacts present.
- Type: Boolean
- Default:
false - Description: When
true, the pipeline skips the Pub2Tools CLI invocation and reuses the most recentto_biotools.jsonexport found in the time-period or globalpub2tools/cache folders. No manual path configuration is required. - CLI equivalent:
--resume-from-pub2tools
- Type: Boolean
- Default:
false - Description: When
true, the pipeline reuses the cachedreports/assessment.jsonlfor the time-period folder, reapplies the current score thresholds, and regenerates payload outputs without invoking the LLM scorer again. Requires the enriched candidates cache to be present (automatically handled whenpipeline.resume_from_enrichedis alsotrue). - CLI equivalent:
--resume-from-scoring
- Type: Boolean
- Default:
false - Description: When
true, the pipeline validates each payload entry against the live bio.tools API after scoring and registry checks. The validation addsbiotools_api_status,api_name, andapi_descriptioncolumns to the CSV output. Use this to verify that tools marked as present in bio.tools actually exist in the live registry and to detect discrepancies with local snapshots. Disabled by default to avoid unnecessary API load during batch runs or offline mode. - CLI equivalent:
--validate-biotools-api/--no-validate-biotools-api
- Type: Float (0.0–1.0)
- Default:
0.6 - Description: Minimum biological relevance score required for a candidate to be included in the payload. Scores come from either heuristic scoring or the LLM rubric (
A1–A5). - CLI equivalent:
--min-bio-score
- Type: Float (0.0–1.0)
- Default:
0.6 - Description: Minimum documentation quality score required for inclusion, derived from rubric items
B1–B5. Candidates that fail to meet this threshold are reported but excluded from the payload. - CLI equivalent:
--min-doc-score - Compatibility: The legacy
pipeline.min_scoreand--min-scoreoptions, when present, set both thresholds to the same value.
- Type: String (URL)
- Default:
"http://localhost:11434" - Description: Ollama server URL
- Example:
"http://localhost:11434"
- Type: String
- Default:
"llama3.2" - Description: Default Ollama model name used for LLM assessment when the CLI flag is omitted.
- CLI equivalent:
--model
- Type: Integer
- Default:
3 - Description: Number of retry attempts for Ollama HTTP calls after the initial request. Setting
0disables automatic retries. - Example:
5
- Type: Float (seconds)
- Default:
2.0 - Description: Fixed delay between Ollama retry attempts (applied to both HTTP session retries and LLM generation retries).
- Example:
0.5
- Type: Integer (seconds)
- Default:
300 - Description: Timeout for individual Ollama API requests. Increase this value if using slower models or complex prompts that exceed 5 minutes to generate.
- Example:
600(10 minutes),120(2 minutes for fast models) - Note: For qwen3:4b and similar lightweight models, consider reducing to
120-180seconds to fail fast on stuck requests.
- Type: Float
- Default:
0.01 - Description: Sampling temperature for LLM generation. Lower values (near 0) produce more deterministic outputs; higher values increase randomness.
- Range:
0.0to2.0 - Example:
0.0(maximum determinism),0.7(creative),1.0(balanced)
- Type: Boolean
- Default:
true - Description: When enabled, requests Ollama to operate in JSON mode (
{"format": "json"}) so the model is restricted to emitting valid JSON. Disable only if using a model that does not support JSON mode. - Example:
false
- Type: Integer
- Default:
1 - Description: Number of extra attempts the scorer makes when the LLM output fails schema validation (total attempts =
1 + schema_retries). Validation errors are fed back into the prompt to encourage self-correction. - Example:
2
- Type: Integer
- Default:
8 - Description: Maximum number of concurrent scoring workers (shared by both heuristic and LLM scoring).
- CLI equivalent:
--concurrency - Example:
16
- Type: Integer or null
- Default:
null(uses model default) - Description: Context window size in tokens for the Ollama model. When set, overrides the model's default context length. Required for models with large context windows to prevent prompt truncation.
- Important: Ollama defaults to 4096 tokens if not specified, which may truncate prompts for models that support larger contexts (e.g., qwen2.5:4b supports 32K tokens).
- Example:
8192,16384,32768 - Recommended values:
qwen2.5:4b,qwen2.5:7b:16384or32768llama3.2:3b:131072(128K)gemma2:9b:8192
- Note: Higher values consume more memory. Monitor your system resources when increasing context size.
- Type: String
- Default:
"INFO" - Description: Logging level
- Options:
"DEBUG","INFO","WARNING","ERROR" - CLI equivalent:
--verbose(DEBUG),--quiet(ERROR)
- Type: String (path) or null
- Default:
null(console only) - Description: Log file path
- Example:
"logs/biotools-annotate.log"
- Type: String (path)
- Default:
"out/logs/ollama/ollama.log" - Description: Human-readable append-only log that captures every Ollama request and response for post-run inspection.
- Type: String (path)
- Default:
"out/ollama/trace.jsonl" - Description: Machine-readable JSONL trace containing per-attempt metadata (prompt variants, request options, status, schema errors) for reproducible auditing and downstream analysis.
- Type: Boolean
- Default:
true - Description: Toggle Europe PMC enrichment for publication metadata.
- Effect: When disabled, abstracts and full text are not retrieved.
- Type: Boolean
- Default:
true - Description: Fetch open-access full text (truncated) when a PMCID is available.
- Note: Even when full text cannot be retrieved, the first available full-text URL is attached as
publication_full_text_url.
- Type: Boolean
- Default:
true - Description: Toggle HTML scraping of each candidate homepage to discover documentation and repository links.
- Effect: Disabled automatically when
--offlineor--resume-from-enrichedis used.
- Type: Integer/Float (seconds)
- Default:
8 - Description: Network timeout applied to homepage requests.
- Example:
5
- Type: String
- Default:
"biotoolsllmannotate/__VERSION__ (+https://github.com/ELIXIR-Belgium/biotoolsLLMAnnotate)" - Description: Custom User-Agent header for homepage scraping requests. The placeholder is replaced with the package version during configuration load.
- Type: Integer
- Default:
1 - Description: Maximum number of publication records to enrich per candidate (to limit API calls).
- Type: Integer
- Default:
4000 - Description: Maximum number of characters retained from the Europe PMC full-text XML output.
- Type: Integer
- Default:
15 - Description: Timeout (seconds) for Europe PMC HTTP requests.
- Type: String (multi-line)
- Description: Custom prompt template for LLM scoring
- Note: Advanced users can customize the scoring instructions
- Variables:
{title},{description},{homepage},{homepage_status},{homepage_error},{documentation},{documentation_keywords},{repository},{tags},{published_at},{publication_abstract},{publication_full_text},{publication_ids} - Expected response keys:
bio_subscores,documentation_subscores,tool_name,homepage,publication_ids,concise_description,rationale - LLM contract: Return
bio_subscoresanddocumentation_subscoresas JSON objects keyed by rubric IDs (A1–A5, B1–B5) with exactly one of {0, 0.5, 1} values. The pipeline computes the average from these subscores and clamps to[0.0, 1.0]before persisting the final scores. Normalize publication identifiers to DOI:..., PMID:..., PMCID:... format.
| Variable | Description | Example |
|---|---|---|
PUB2TOOLS_CLI |
Pub2Tools CLI command or path | java -jar /path/to/pub2tools.jar |
BIOTOOLS_CONFIG |
Custom config file path | /etc/biotools-annotate/config.yaml |
OLLAMA_HOST |
Ollama server URL | http://localhost:11434 |
Note: Configuration file parameters (like pub2tools.p2t_cli) take precedence over environment variables.
- Command-line arguments (highest priority)
- Environment variables
- Configuration file (
config.yaml) - Default values (lowest priority)
pipeline:
since: "7d"
min_bio_score: 0.6
min_documentation_score: 0.6
concurrency: 16
logging:
level: "INFO"pub2tools:
p2t_month: "2024-09"
p2t_cli: "java -jar /opt/pub2tools/pub2tools-cli-1.1.2.jar"
firefox_path: "/usr/bin/firefox"
pipeline:
since: "30d"
min_bio_score: 0.8
min_documentation_score: 0.75
limit: 100
model: "llama3.1:8b"
concurrency: 8
enrichment:
europe_pmc:
enabled: true
include_full_text: true
logging:
level: "DEBUG"
file: "logs/debug.log"pipeline:
since: "2024-01-01"
min_bio_score: 0.6
min_documentation_score: 0.6- Pub2Tools not found: Set
PUB2TOOLS_CLIenvironment variable orpub2tools.p2t_cliin config file - Firefox not found: Install Firefox or set
firefox_pathin config - Ollama not accessible: Check
ollama.hostconfiguration - Permission errors: Ensure write permissions for output directories
Enable debug logging to troubleshoot issues:
biotools-annotate run --verbose --from-date 1dOr in config:
logging:
level: "DEBUG"When migrating from command-line arguments to config file:
| CLI Argument | Config Path |
|---|---|
--model llama3.1 |
ollama.model: "llama3.1" |
--concurrency 16 |
ollama.concurrency: 16 |
--p2t-cli /path/to/pub2tools |
pub2tools.p2t_cli: "/path/to/pub2tools" |
--p2t-cli "java -jar /path/to/jar" |
pub2tools.p2t_cli: "java -jar /path/to/jar" |
--min-bio-score 0.7 |
pipeline.min_bio_score: 0.7 |
--min-doc-score 0.65 |
pipeline.min_documentation_score: 0.65 |
- Start simple: Use default config and override specific parameters
- Use relative time: Prefer
"7d"over specific dates forpipeline.from_date - Set reasonable limits: Use
limitfor testing with large datasets - Enable logging: Set
logging.filefor production use - Test configuration: Use
--dry-runto validate settings - Version control: Keep
config.yamlin version control for reproducibility