Roles for Monitoring, AI-Horde, Frontpage; CI linting, rendering and integration tests; local deploy by tazlin · Pull Request #9 · Haidra-Org/deployments

tazlin · 2026-03-27T02:04:57Z

Adds a Prometheus driven, Grafana-based monitoring stack via the horde_monitoring role, introduces several new deployment roles for AI Horde services and storage backends, and builds out the CI and test infrastructure from scratch.

New roles

horde_monitoring - Deploys a full Grafana monitoring stack: Prometheus/Mimir, Grafana, Loki, Tempo, Pyroscope, Alertmanager, S3-compatible object storage, Memcached, and HAProxy as a unified entrypoint (including a Loki ingestion frontend with basic auth). Includes backup automation, alerting/recording rules, and Docker Compose orchestration.
horde_alloy - Deploys Grafana Alloy as a telemetry collector (metrics, logs, traces), intended for near-application deployment (especially on machines which might host apps which generate a lot of telemetry).
horde_stats_exporter - Deploys the AI Horde Prometheus exporter as a systemd service. Supports multiple installation sources (git, wheel, local) with manifest resolution.
ai_horde - Deploys the AI Horde application natively or via Docker Compose, including fail-fast checks for PostgreSQL configurations.
aihorde_frontpage - Deploys the AI Horde frontpage (Docker or native mode).
garage - Deploys Garage as a standalone S3-compatible object storage backend with Docker Compose and systemd service management.
rustfs - Deploys a standalone RustFS S3-compatible object storage container for lightweight deployment scenarios.

Test infrastructure

run_tests.sh - Docker-based test runner that builds a systemd-enabled container (Dockerfile.systemd), runs Ansible playbooks against it, and performs idempotency checks. Supports per-suite execution, --list mode, structured log output, and markers (# requires: docker-daemon, # idempotency: skip).
Test suites for monitoring, ai_horde, artbot, frontpage, full_stack, regen_worker, and integration. It covers render-only validation, runtime service health checks, fail-fast scenarios, external S3 deployment scenarios (including Garage), and end-to-end deployment.
Local deployment scripts - local_deploy.sh and Ansible-rendered configs for running the full monitoring stack locally, also through Docker.
- This is suitable for development or test machines to verify deployability (in-principal) and functionality, utilizing default repository mechanisms for easy forks.

CI workflows

test.yml - Lint (yamllint + ansible-lint), integration tests (push-only, --privileged containers), and render-check (PR-only, syntax-check only).
role-tests.yml - Per-suite render and integration tests with a matrix strategy across roles. PRs get lightweight syntax-checks; push/dispatch gets full Docker-based render tests.
Both workflows include Docker Hub login to avoid rate limiting, and follow a security model where --privileged execution is restricted to trusted events (push/dispatch), never fork PRs.

Linting & Best Practices

Applies ansible-lint rules: This includes setting role names correctly and using fully qualified names for ansible functions. Configurable via the .ansible-lint file.
- Of note, this uses role prefixes (e.g., "artbot_revproxy") per the ansible-lint docs.
yamllint: Standardizes yaml files structure and whitespace.
Added strict checks to prevent stale configurations in monitoring tasks to avoid container crashes.

Documentation

MONITORING.md - High-level monitoring stack overview.
monitoring - BACKUP.md, CREDENTIALS.md, MIGRATION.md, OBSERVABILITY.md, UPGRADING.md. Includes clarification on observability components.
Role-level README.md files clearly noting specific scopes, limitations, and requirements.
Example playbooks and inventories for all roles and deployment scenarios (e.g. decision trees).
QUICKSTART.md includes a deployment decision tree and clarifies local checkout usage.

Other

.gitignore for venv, retry files, and selective tracking of local-deploy static configs (network overlays, HAProxy config, fullstack compose).
.git-blame-ignore-rev for formatting-only commits.
Galaxy metadata updates (galaxy.yml, runtime.yml).
Collection-level requirements.yml for role dependencies.

Fixes: - Service template: Jinja2 whitespace bug ({%- + trim_blocks) collapsed ExecStart arguments onto one line, breaking systemd parsing - Service template: Add ExecReload=/bin/kill -HUP $MAINPID (was missing, caused reload handler failures) - Retention default: Changed from '0' to '100y' ('0' omits the flag, giving Prometheus 15d default — not indefinite as documented) - Alert rules: humanizeBytes → humanize1024 (humanizeBytes doesn't exist in Prometheus, caused crash on startup) - Config template: Self-monitoring target uses {{ prometheus_port }} instead of hardcoded 9090 - Config template: Added rule_files glob for alert rules - Download URL: Architecture-aware via prometheus_arch variable (amd64/arm64) - Prometheus install: creates checks /usr/local/bin/prometheus instead of /tmp tarball for better idempotency - Prometheus config notify: restart instead of reload to avoid handler ordering conflict on first deploy - Backup tasks: Added missing become: yes on all privileged tasks - Backup script: Fixed validation to use correct tar flags per compression type (gzip vs zstd) - Firewall tasks: Port variables cast through | string filter - InfluxDB wait_for: Uses {{ influxdb_port }} instead of hardcoded 8086 Additions: - meta/main.yml: Galaxy metadata (was completely missing) - .gitignore: Exclude .venv/ and *.retry - Documentation: Updated all retention references and examples

- Dockerfile.systemd: Ubuntu 22.04 systemd-enabled container - ansible.cfg: Test configuration with roles_path - inventory_docker.ini: Docker connection inventory - test_prometheus_only.yml: Prometheus-only deployment test - test_full_stack.yml: Full stack (Prometheus + Grafana + Backup) - test_backup_zstd.yml: Backup with zstd compression variant

… handlers

…scripts - Introduced `local_deploy.yml` for rendering the full monitoring stack configuration for local validation. - Added `run_tests.sh` script to facilitate local Docker-based integration testing of the horde_monitoring role. - Updated test playbooks to validate Mimir configuration, including retention policies, backup templates, and alerting rules. - Enhanced `test_full_stack.yml` to cover Mimir, Grafana, and HAProxy integration with comprehensive assertions. - Modified `test_prometheus_only.yml` to focus on Mimir-only deployment, ensuring proper configuration and permissions. - Implemented checks for backup service rendering based on MinIO configuration in `test_backup_zstd.yml` and `test_prometheus_only.yml`. - Added fail-fast checks for default passwords in `test_full_stack.yml` to enforce security best practices.

…ent scripts - Introduced a new Jinja2 template for Grafana Tempo configuration in monolithic mode with S3 object storage (MinIO). - Updated local_deploy.sh to streamline the rendering of configurations and remove hardcoded values, sourcing all necessary environment variables from a generated local-deploy.env file. - Removed redundant local configuration generation logic and replaced it with post-tasks in the Ansible playbook to render Prometheus, Alertmanager, Exporter, Alloy, and Docker Compose overlay configurations. - Added tests for the Alloy role to validate configuration rendering and ensure proper permissions. - Enhanced full stack tests to include assertions for Loki and Tempo services, ensuring they are correctly configured and integrated. - Implemented negative tests to verify that Loki and Tempo are not included in the Docker Compose file when disabled.

… and local blocks

…ometheus exporter - Created UPGRADING.md for monitoring stack upgrade procedures. - Added horde_alloy role to deploy Grafana Alloy as a telemetry collector. - Introduced horde_monitoring README.md for comprehensive monitoring stack details. - Updated defaults in horde_monitoring to reflect new features and components. - Added horde_stats_exporter role to deploy AI Horde Prometheus exporter as a systemd service. - Enhanced horde_stats_exporter with detailed variable documentation and logging configuration. - Included meta information for horde_stats_exporter role for Galaxy compatibility.

- Updated `run_tests.sh` to enhance systemd initialization checks with clearer state handling and increased sleep duration. - Modified `test_alloy_role.yml` to include additional configuration variables and ensure proper assertions for rendered Alloy configurations. - Enhanced `test_backup_zstd.yml` with improved assertions for backup service and environment file rendering. - Improved `test_full_stack.yml` by refining assertions and ensuring proper permissions for critical files and directories. - Updated `test_prometheus_only.yml` to assert the absence of Grafana datasource provisioning when Mimir is the only service installed. - Standardized the use of `ansible.builtin` namespace for modules in all test files for consistency. - Adjusted gather_facts settings across various test files for clarity and correctness.

- Introduced `test_prometheus_only.yml` to validate Mimir-only deployment without Grafana, ensuring correct configuration and absence of Grafana-related files. - Added `test_runtime_services.yml` to perform runtime integration tests, verifying the health of Mimir, MinIO, Grafana, and Memcached services. - Created `ansible.cfg` for the regen worker tests to specify roles path and disable host key checking. - Implemented `test_regen_worker_render.yml` to test the rendering of templates for the horde_regen_worker role, asserting on generated file contents. - Enhanced `run_tests.sh` to support listing discoverable tests, improved logging, and refined error handling for test execution.

- Also fixes the glob in ansible-lint

… handling and syntax checks

Per https://docs.ansible.com/projects/lint/rules/no-handler/

Extract duplicated colour/log/prerequisite/ansible-discovery functions from all local_deploy.sh scripts into a shared tests/lib.sh. Add a shared tests/ansible.cfg for consistent Ansible settings across tests. Remove the top-level tests/templates/local-deploy/ directory — each test suite now owns its own templates under its own directory.

Add --jobs N flag for parallel test execution with per-container isolation (test-container-<N>). Each parallel worker gets its own Docker container and writes atomic result files. Add is_localhost_only() to detect playbooks that use only localhost connections and skip Docker container creation for them. Update cleanup logic to handle multiple test containers.

Extract the copy→blockinfile→validate→promote HAProxy safe-edit pipeline into a reusable internal role. Supports marker-bounded insertion, timestamped backups with configurable retention, and haproxy -c validation before promotion. Consumers pass: _hse_config_path, _hse_backup_dir, _hse_marker, _hse_block, _hse_backup_retention, and _hse_apply.

…structure

- Updated QUICKSTART.md to include the new decision tree and clarify local checkout usage. - Improved CREDENTIALS.md by removing redundant information about credential rotation. - Updated OBSERVABILITY.md to clarify observability stack components and configurations. - Enhanced examples and scripts to support deploying from a fork and managing source repositories. - Implemented cleanup for conflicting monitoring containers in local deployment scripts. - Adjusted local deployment scripts to use a default repository URL for AI-Horde.

…for clarity - Also expands alloy testing to capture additional corner cases

- Added backup functionality for PostgreSQL pg_hba.conf configuration. - Introduced backend port overrides for Alertmanager and Stats Exporter in monitoring role. - Updated monitoring stack restart handler to force recreation of containers. - Improved Grafana provisioning logic to handle public organization creation and dashboard configuration. - Added checks for stale configuration files in Loki, Mimir, Pyroscope, and Tempo tasks to prevent container crashes. - Refactored service start logic to ensure all configurations are templated before starting the monitoring stack. - Implemented fail-fast checks in AI-Horde role for PostgreSQL superuser management and system database name restrictions. - Cleaned up test cases to ensure accurate assertions and added new tests for fail-fast scenarios. - Introduced a vault example for managing sensitive information in the monitoring stack.

…s and documentation updates

- Updated Mimir, Pyroscope, and Tempo configurations to replace MinIO references with S3-compatible storage options. - Changed backup timer description to reflect S3 usage. - Modified tests to validate S3 storage configurations and ensure proper handling of backup services. - Introduced new tests for external S3 deployment scenarios, including Garage compatibility. - Adjusted Docker Compose configurations to utilize S3 storage instead of MinIO. - Ensured that all relevant variables and assertions are updated to reflect the transition from MinIO to S3.

- Added tasks to ensure dashboard directories are traversable and files are owned by Grafana in grafana.yml. - Introduced Loki ingestion frontend with basic auth in haproxy.yml. - Implemented pre-flight check for external S3 endpoint reachability in main.yml. - Updated horde_stats_exporter role to support multiple installation sources (git, wheel, local) with validation and manifest resolution. - Added rustfs role for deploying a standalone RustFS S3-compatible object storage container with Docker Compose. - Improved test scripts to support Docker daemon tests and adjusted monitoring tests for source modes.

- Added S3 region configuration to Loki, Mimir, Pyroscope, and Tempo templates to support regional settings. - Updated test cases to clean up stale recording rules and assert their absence when alerting rules are disabled. - Modified full stack tests to ensure proper alerting rules are present and adjusted assertions for PostgreSQL alerts. - Introduced a new Garage role for standalone S3-compatible object storage, including Docker Compose setup and systemd service management. - Created templates for Garage configuration, Docker Compose, and systemd service files. - Implemented validation checks for Garage secrets and access keys to ensure secure deployment. - Added recording rules for the AI Horde monitoring stack to pre-compute aggregations for dashboards and alerts.

- Updated HAProxy configuration templates to use dir for clarity and maintainability. - Enhanced monitoring tests by adding new variables for Alloy role configurations. - Removed obsolete test for host metrics source auto-selection. - Ensured that monitoring configurations adhere to new standards for filesystem mount timeouts and exclusion patterns.

…nder playbook

…rocess

- Updated docker-compose template to conditionally publish OTLP ports for Tempo service. - Added frontend configuration for Tempo ingestion in HAProxy with authentication. - Enabled tenant federation in Mimir configuration. - Modified source cloning function to initialize submodules recursively. - Implemented embedded Garage bootstrap process in local deployment script. - Added variables and tasks for managing Garage credentials and metadata in local deployment. - Updated local Docker Compose template to expose OTLP ports. - Adjusted environment variable template for embedded Garage settings. - Enhanced tests for Alloy role to assert unique remote_write component labels. - Added tests for stats exporter role to validate configuration rendering. - Removed outdated update_alloy.py script as configuration is now managed through templates.

- Removes native AI-Horde deploy support - Reworks AI-Horde instance scaling to a more docker native scheme - Modify test_ai_horde_render.yml to enhance assertions for Docker Compose configurations and environment variable rendering. - Change local_deploy.sh in frontpage tests to use the latest commit reference for the frontpage repository. - Remove obsolete native deployment tests from test_frontpage_render.yml and delete test_native_deploy.yml. - Adjust local_deploy.sh in full_stack and integration tests to support multiple AI-Horde instances and update pinned references. - Update local_deploy.yml in full_stack and integration tests to set the log driver to local. - Enhance assertions in test_integration_smoke.yml to verify host-range format for port bindings in Docker Compose.

…urations

- Moves high cardinality metrics/spans/etc to their own ai-horde-telemetry tenant, with a 3 day retention by default - Enables experimental `early_head_compaction_min_in_memory_series` to force in-memory -> storage compactation sooner, to prevent perpetual in-memory storage during extended periods of constant trace pushes

These changes focus on ensuring application and infrastructure stats flow into their correct sinks, sampled correctly, with the right rentention and that both the production layouts and the local-deploy environments have the correct wiring so everything is talking together as intended. - Added telemetry retention period overrides for Pyroscope in the runtime configuration. - Updated local deployment YAML for AI-Horde to include trace sampling settings and local build context. - Introduced a function to wire AI-Horde to Garage S3 for seamless integration. - Implemented a smoke test for OTLP native histograms to ensure metrics are correctly ingested by Mimir. - Enhanced local deployment scripts to support local source overrides for AI-Horde and Frontpage. - Updated Alloy configuration to convert delta-temporality histograms to cumulative before sending to Mimir. - Adjusted Docker Compose configurations to ensure proper service communication and container management. - Added new environment variables for Garage S3 bucket configuration in local deployment. - Improved test configurations for Alloy role to include trace sampling parameters.

- Introduced Docker Compose configuration for the horde-model-reference service, including multi-worker support and health checks. - Created HAProxy configuration to route traffic to the horde-model-reference service based on hostname. - Added environment variable template for the horde-model-reference service, supporting various configurations and modes. - Updated local deployment scripts to include the new horde-model-reference service, ensuring proper startup and logging. - Enhanced full stack tests to validate the new service integration, including checks for Docker Compose and environment variable correctness. - Implemented policy contracts to ensure defaults and security measures for the horde_model_reference role. - Developed integration and rendering tests for the horde_model_reference role, covering various operational scenarios.

…directories exist

tazlin added 30 commits March 22, 2026 14:04

feat: horde stats/stats monitoring (prometheus based)

27e0c1b

docs: remove superfluous "new" in readme

910f1e2

fix: replace 'yes' with 'true' for become directives in playbooks and…

3d4fb1d

… handlers

fix: horde_alloy and horde_monitoring configurations for OTLP metrics…

95893d7

… and local blocks

ci: remove pip cache directive from ci

43bbf35

lint: auto applied formatting fixes

782acac

chore: ignore formatting-only commit in blame

9351930

fix: add traces storage configuration for Tempo generator

146d014

fix: update dependency installation to use requirements.yml

93f1e60

fix: remove unnecessary listener from HAProxy config validation handler

3c8a955

lint: fix yaml spacing

777ca0d

feat: add Docker Hub login step to CI workflows

b597f98

fix: standardize spacing in YAML files for consistency

28e437e

fix: add 'become: true' to render tasks for proper privilege escalation

3def8a2

fix: add missing local-deploy configurations for monitoring stack

2836479

- Also fixes the glob in ansible-lint

ci/fix: add render-check job for syntax validation in role tests

8465aaa

ci: use available ansible.cfg for tests

cb7d4b3

fix: update Ansible configurations and scripts for improved role path…

712c602

… handling and syntax checks

fix: update changed_when for various tasks to improve clarity

9156bb5

fix: refactor ldconfig task to use notify for improved handler execution

64a9569

Per https://docs.ansible.com/projects/lint/rules/no-handler/

tazlin added 16 commits April 9, 2026 14:38

docs: update monitoring and deployment documentation for clarity and …

315748c

…structure

tests/docs: update monitoring and Alloy deployment and documentation …

efe8a57

…for clarity - Also expands alloy testing to capture additional corner cases

feat: enhance logging and monitoring configurations with Docker label…

211d39c

…s and documentation updates

lint: fix missing changed_when clauses

779f48d

refactor: remove redundant HAProxy mode tests from AI-Horde native re…

1751710

…nder playbook

chore: remove ai planning artifacts

a21d9eb

refactor: mimir-based alertmanager support

c864c81

lint: fix whitespace issues in yaml files

c71d6fb

refactor: streamline HAProxy configuration management and bootstrap p…

4615c08

…rocess

docs: update HAProxy configuration using conf.d architecture

48e3c8c

tazlin force-pushed the prom-changes branch from 12a5506 to 48e3c8c Compare April 17, 2026 20:45

tazlin added 13 commits April 19, 2026 14:39

refactor: reorganize local-deploy structure and update related config…

983b71d

…urations

fix: add health check for Pyroscope service in Docker Compose

606293d

fix: bootstrap local deploy as intended; render correctly

88939ba

lint: fix

272692e

lint: fix

a6c6291

fix: update AI-Horde image reference and ensure Grafana provisioning …

24bca4f

…directories exist

docs: update architecture diagram and improve README clarity

ad16890

lint: fix extra line break in test_fullstack_render

3c8500e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roles for Monitoring, AI-Horde, Frontpage; CI linting, rendering and integration tests; local deploy#9

Roles for Monitoring, AI-Horde, Frontpage; CI linting, rendering and integration tests; local deploy#9
tazlin wants to merge 76 commits intomainfrom
prom-changes

tazlin commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tazlin commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New roles

Test infrastructure

CI workflows

Linting & Best Practices

Documentation

Other

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tazlin commented Mar 27, 2026 •

edited

Loading