Skip to content

Roles for Monitoring, AI-Horde, Frontpage; CI linting, rendering and integration tests; local deploy#9

Draft
tazlin wants to merge 76 commits intomainfrom
prom-changes
Draft

Roles for Monitoring, AI-Horde, Frontpage; CI linting, rendering and integration tests; local deploy#9
tazlin wants to merge 76 commits intomainfrom
prom-changes

Conversation

@tazlin
Copy link
Copy Markdown
Member

@tazlin tazlin commented Mar 27, 2026

Adds a Prometheus driven, Grafana-based monitoring stack via the horde_monitoring role, introduces several new deployment roles for AI Horde services and storage backends, and builds out the CI and test infrastructure from scratch.

New roles

  • horde_monitoring - Deploys a full Grafana monitoring stack: Prometheus/Mimir, Grafana, Loki, Tempo, Pyroscope, Alertmanager, S3-compatible object storage, Memcached, and HAProxy as a unified entrypoint (including a Loki ingestion frontend with basic auth). Includes backup automation, alerting/recording rules, and Docker Compose orchestration.
  • horde_alloy - Deploys Grafana Alloy as a telemetry collector (metrics, logs, traces), intended for near-application deployment (especially on machines which might host apps which generate a lot of telemetry).
  • horde_stats_exporter - Deploys the AI Horde Prometheus exporter as a systemd service. Supports multiple installation sources (git, wheel, local) with manifest resolution.
  • ai_horde - Deploys the AI Horde application natively or via Docker Compose, including fail-fast checks for PostgreSQL configurations.
  • aihorde_frontpage - Deploys the AI Horde frontpage (Docker or native mode).
  • garage - Deploys Garage as a standalone S3-compatible object storage backend with Docker Compose and systemd service management.
  • rustfs - Deploys a standalone RustFS S3-compatible object storage container for lightweight deployment scenarios.

Test infrastructure

  • run_tests.sh - Docker-based test runner that builds a systemd-enabled container (Dockerfile.systemd), runs Ansible playbooks against it, and performs idempotency checks. Supports per-suite execution, --list mode, structured log output, and markers (# requires: docker-daemon, # idempotency: skip).
  • Test suites for monitoring, ai_horde, artbot, frontpage, full_stack, regen_worker, and integration. It covers render-only validation, runtime service health checks, fail-fast scenarios, external S3 deployment scenarios (including Garage), and end-to-end deployment.
  • Local deployment scripts - local_deploy.sh and Ansible-rendered configs for running the full monitoring stack locally, also through Docker.
    • This is suitable for development or test machines to verify deployability (in-principal) and functionality, utilizing default repository mechanisms for easy forks.

CI workflows

  • test.yml - Lint (yamllint + ansible-lint), integration tests (push-only, --privileged containers), and render-check (PR-only, syntax-check only).
  • role-tests.yml - Per-suite render and integration tests with a matrix strategy across roles. PRs get lightweight syntax-checks; push/dispatch gets full Docker-based render tests.
  • Both workflows include Docker Hub login to avoid rate limiting, and follow a security model where --privileged execution is restricted to trusted events (push/dispatch), never fork PRs.

Linting & Best Practices

  • Applies ansible-lint rules: This includes setting role names correctly and using fully qualified names for ansible functions. Configurable via the .ansible-lint file.
    • Of note, this uses role prefixes (e.g., "artbot_revproxy") per the ansible-lint docs.
  • yamllint: Standardizes yaml files structure and whitespace.
  • Added strict checks to prevent stale configurations in monitoring tasks to avoid container crashes.

Documentation

  • MONITORING.md - High-level monitoring stack overview.
  • monitoring - BACKUP.md, CREDENTIALS.md, MIGRATION.md, OBSERVABILITY.md, UPGRADING.md. Includes clarification on observability components.
  • Role-level README.md files clearly noting specific scopes, limitations, and requirements.
  • Example playbooks and inventories for all roles and deployment scenarios (e.g. decision trees).
  • QUICKSTART.md includes a deployment decision tree and clarifies local checkout usage.

Other

  • .gitignore for venv, retry files, and selective tracking of local-deploy static configs (network overlays, HAProxy config, fullstack compose).
  • .git-blame-ignore-rev for formatting-only commits.
  • Galaxy metadata updates (galaxy.yml, runtime.yml).
  • Collection-level requirements.yml for role dependencies.

tazlin added 30 commits March 22, 2026 14:04
Fixes:
- Service template: Jinja2 whitespace bug ({%- + trim_blocks) collapsed
  ExecStart arguments onto one line, breaking systemd parsing
- Service template: Add ExecReload=/bin/kill -HUP $MAINPID (was missing,
  caused reload handler failures)
- Retention default: Changed from '0' to '100y' ('0' omits the flag,
  giving Prometheus 15d default — not indefinite as documented)
- Alert rules: humanizeBytes → humanize1024 (humanizeBytes doesn't exist
  in Prometheus, caused crash on startup)
- Config template: Self-monitoring target uses {{ prometheus_port }}
  instead of hardcoded 9090
- Config template: Added rule_files glob for alert rules
- Download URL: Architecture-aware via prometheus_arch variable (amd64/arm64)
- Prometheus install: creates checks /usr/local/bin/prometheus instead
  of /tmp tarball for better idempotency
- Prometheus config notify: restart instead of reload to avoid handler
  ordering conflict on first deploy
- Backup tasks: Added missing become: yes on all privileged tasks
- Backup script: Fixed validation to use correct tar flags per
  compression type (gzip vs zstd)
- Firewall tasks: Port variables cast through | string filter
- InfluxDB wait_for: Uses {{ influxdb_port }} instead of hardcoded 8086

Additions:
- meta/main.yml: Galaxy metadata (was completely missing)
- .gitignore: Exclude .venv/ and *.retry
- Documentation: Updated all retention references and examples
- Dockerfile.systemd: Ubuntu 22.04 systemd-enabled container
- ansible.cfg: Test configuration with roles_path
- inventory_docker.ini: Docker connection inventory
- test_prometheus_only.yml: Prometheus-only deployment test
- test_full_stack.yml: Full stack (Prometheus + Grafana + Backup)
- test_backup_zstd.yml: Backup with zstd compression variant
…scripts

- Introduced `local_deploy.yml` for rendering the full monitoring stack configuration for local validation.
- Added `run_tests.sh` script to facilitate local Docker-based integration testing of the horde_monitoring role.
- Updated test playbooks to validate Mimir configuration, including retention policies, backup templates, and alerting rules.
- Enhanced `test_full_stack.yml` to cover Mimir, Grafana, and HAProxy integration with comprehensive assertions.
- Modified `test_prometheus_only.yml` to focus on Mimir-only deployment, ensuring proper configuration and permissions.
- Implemented checks for backup service rendering based on MinIO configuration in `test_backup_zstd.yml` and `test_prometheus_only.yml`.
- Added fail-fast checks for default passwords in `test_full_stack.yml` to enforce security best practices.
…ent scripts

- Introduced a new Jinja2 template for Grafana Tempo configuration in monolithic mode with S3 object storage (MinIO).
- Updated local_deploy.sh to streamline the rendering of configurations and remove hardcoded values, sourcing all necessary environment variables from a generated local-deploy.env file.
- Removed redundant local configuration generation logic and replaced it with post-tasks in the Ansible playbook to render Prometheus, Alertmanager, Exporter, Alloy, and Docker Compose overlay configurations.
- Added tests for the Alloy role to validate configuration rendering and ensure proper permissions.
- Enhanced full stack tests to include assertions for Loki and Tempo services, ensuring they are correctly configured and integrated.
- Implemented negative tests to verify that Loki and Tempo are not included in the Docker Compose file when disabled.
…ometheus exporter

- Created UPGRADING.md for monitoring stack upgrade procedures.
- Added horde_alloy role to deploy Grafana Alloy as a telemetry collector.
- Introduced horde_monitoring README.md for comprehensive monitoring stack details.
- Updated defaults in horde_monitoring to reflect new features and components.
- Added horde_stats_exporter role to deploy AI Horde Prometheus exporter as a systemd service.
- Enhanced horde_stats_exporter with detailed variable documentation and logging configuration.
- Included meta information for horde_stats_exporter role for Galaxy compatibility.
- Updated `run_tests.sh` to enhance systemd initialization checks with clearer state handling and increased sleep duration.
- Modified `test_alloy_role.yml` to include additional configuration variables and ensure proper assertions for rendered Alloy configurations.
- Enhanced `test_backup_zstd.yml` with improved assertions for backup service and environment file rendering.
- Improved `test_full_stack.yml` by refining assertions and ensuring proper permissions for critical files and directories.
- Updated `test_prometheus_only.yml` to assert the absence of Grafana datasource provisioning when Mimir is the only service installed.
- Standardized the use of `ansible.builtin` namespace for modules in all test files for consistency.
- Adjusted gather_facts settings across various test files for clarity and correctness.
- Introduced `test_prometheus_only.yml` to validate Mimir-only deployment without Grafana, ensuring correct configuration and absence of Grafana-related files.
- Added `test_runtime_services.yml` to perform runtime integration tests, verifying the health of Mimir, MinIO, Grafana, and Memcached services.
- Created `ansible.cfg` for the regen worker tests to specify roles path and disable host key checking.
- Implemented `test_regen_worker_render.yml` to test the rendering of templates for the horde_regen_worker role, asserting on generated file contents.
- Enhanced `run_tests.sh` to support listing discoverable tests, improved logging, and refined error handling for test execution.
Extract duplicated colour/log/prerequisite/ansible-discovery functions
from all local_deploy.sh scripts into a shared tests/lib.sh. Add a
shared tests/ansible.cfg for consistent Ansible settings across tests.

Remove the top-level tests/templates/local-deploy/ directory — each
test suite now owns its own templates under its own directory.
Add --jobs N flag for parallel test execution with per-container
isolation (test-container-<N>). Each parallel worker gets its own
Docker container and writes atomic result files.

Add is_localhost_only() to detect playbooks that use only localhost
connections and skip Docker container creation for them.

Update cleanup logic to handle multiple test containers.
Extract the copy→blockinfile→validate→promote HAProxy safe-edit
pipeline into a reusable internal role. Supports marker-bounded
insertion, timestamped backups with configurable retention, and
haproxy -c validation before promotion.

Consumers pass: _hse_config_path, _hse_backup_dir, _hse_marker,
_hse_block, _hse_backup_retention, and _hse_apply.
tazlin added 16 commits April 9, 2026 14:38
- Updated QUICKSTART.md to include the new decision tree and clarify local checkout usage.
- Improved CREDENTIALS.md by removing redundant information about credential rotation.
- Updated OBSERVABILITY.md to clarify observability stack components and configurations.
- Enhanced examples and scripts to support deploying from a fork and managing source repositories.
- Implemented cleanup for conflicting monitoring containers in local deployment scripts.
- Adjusted local deployment scripts to use a default repository URL for AI-Horde.
…for clarity

- Also expands alloy testing to capture additional corner cases
- Added backup functionality for PostgreSQL pg_hba.conf configuration.
- Introduced backend port overrides for Alertmanager and Stats Exporter in monitoring role.
- Updated monitoring stack restart handler to force recreation of containers.
- Improved Grafana provisioning logic to handle public organization creation and dashboard configuration.
- Added checks for stale configuration files in Loki, Mimir, Pyroscope, and Tempo tasks to prevent container crashes.
- Refactored service start logic to ensure all configurations are templated before starting the monitoring stack.
- Implemented fail-fast checks in AI-Horde role for PostgreSQL superuser management and system database name restrictions.
- Cleaned up test cases to ensure accurate assertions and added new tests for fail-fast scenarios.
- Introduced a vault example for managing sensitive information in the monitoring stack.
- Updated Mimir, Pyroscope, and Tempo configurations to replace MinIO references with S3-compatible storage options.
- Changed backup timer description to reflect S3 usage.
- Modified tests to validate S3 storage configurations and ensure proper handling of backup services.
- Introduced new tests for external S3 deployment scenarios, including Garage compatibility.
- Adjusted Docker Compose configurations to utilize S3 storage instead of MinIO.
- Ensured that all relevant variables and assertions are updated to reflect the transition from MinIO to S3.
- Added tasks to ensure dashboard directories are traversable and files are owned by Grafana in grafana.yml.
- Introduced Loki ingestion frontend with basic auth in haproxy.yml.
- Implemented pre-flight check for external S3 endpoint reachability in main.yml.
- Updated horde_stats_exporter role to support multiple installation sources (git, wheel, local) with validation and manifest resolution.
- Added rustfs role for deploying a standalone RustFS S3-compatible object storage container with Docker Compose.
- Improved test scripts to support Docker daemon tests and adjusted monitoring tests for source modes.
- Added S3 region configuration to Loki, Mimir, Pyroscope, and Tempo templates to support regional settings.
- Updated test cases to clean up stale recording rules and assert their absence when alerting rules are disabled.
- Modified full stack tests to ensure proper alerting rules are present and adjusted assertions for PostgreSQL alerts.
- Introduced a new Garage role for standalone S3-compatible object storage, including Docker Compose setup and systemd service management.
- Created templates for Garage configuration, Docker Compose, and systemd service files.
- Implemented validation checks for Garage secrets and access keys to ensure secure deployment.
- Added recording rules for the AI Horde monitoring stack to pre-compute aggregations for dashboards and alerts.
- Updated HAProxy configuration templates to use  dir for clarity and maintainability.
- Enhanced monitoring tests by adding new variables for Alloy role configurations.
- Removed obsolete test for host metrics source auto-selection.
- Ensured that monitoring configurations adhere to new standards for filesystem mount timeouts and exclusion patterns.
tazlin added 13 commits April 19, 2026 14:39
- Updated docker-compose template to conditionally publish OTLP ports for Tempo service.
- Added frontend configuration for Tempo ingestion in HAProxy with authentication.
- Enabled tenant federation in Mimir configuration.
- Modified source cloning function to initialize submodules recursively.
- Implemented embedded Garage bootstrap process in local deployment script.
- Added variables and tasks for managing Garage credentials and metadata in local deployment.
- Updated local Docker Compose template to expose OTLP ports.
- Adjusted environment variable template for embedded Garage settings.
- Enhanced tests for Alloy role to assert unique remote_write component labels.
- Added tests for stats exporter role to validate configuration rendering.
- Removed outdated update_alloy.py script as configuration is now managed through templates.
- Removes native AI-Horde deploy support
- Reworks AI-Horde instance scaling to a more docker native scheme
- Modify test_ai_horde_render.yml to enhance assertions for Docker Compose configurations and environment variable rendering.
- Change local_deploy.sh in frontpage tests to use the latest commit reference for the frontpage repository.
- Remove obsolete native deployment tests from test_frontpage_render.yml and delete test_native_deploy.yml.
- Adjust local_deploy.sh in full_stack and integration tests to support multiple AI-Horde instances and update pinned references.
- Update local_deploy.yml in full_stack and integration tests to set the log driver to local.
- Enhance assertions in test_integration_smoke.yml to verify host-range format for port bindings in Docker Compose.
- Moves high cardinality metrics/spans/etc to their own ai-horde-telemetry tenant, with a 3 day retention by default
- Enables experimental `early_head_compaction_min_in_memory_series` to force in-memory -> storage compactation sooner, to prevent perpetual in-memory storage during extended periods of constant trace pushes
These changes focus on ensuring application and infrastructure stats flow into their correct sinks, sampled correctly, with the right rentention and that both the production layouts and the local-deploy environments have the correct wiring so everything is talking together as intended.
- Added telemetry retention period overrides for Pyroscope in the runtime configuration.
- Updated local deployment YAML for AI-Horde to include trace sampling settings and local build context.
- Introduced a function to wire AI-Horde to Garage S3 for seamless integration.
- Implemented a smoke test for OTLP native histograms to ensure metrics are correctly ingested by Mimir.
- Enhanced local deployment scripts to support local source overrides for AI-Horde and Frontpage.
- Updated Alloy configuration to convert delta-temporality histograms to cumulative before sending to Mimir.
- Adjusted Docker Compose configurations to ensure proper service communication and container management.
- Added new environment variables for Garage S3 bucket configuration in local deployment.
- Improved test configurations for Alloy role to include trace sampling parameters.
- Introduced Docker Compose configuration for the horde-model-reference service, including multi-worker support and health checks.
- Created HAProxy configuration to route traffic to the horde-model-reference service based on hostname.
- Added environment variable template for the horde-model-reference service, supporting various configurations and modes.
- Updated local deployment scripts to include the new horde-model-reference service, ensuring proper startup and logging.
- Enhanced full stack tests to validate the new service integration, including checks for Docker Compose and environment variable correctness.
- Implemented policy contracts to ensure defaults and security measures for the horde_model_reference role.
- Developed integration and rendering tests for the horde_model_reference role, covering various operational scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant