Skip to content

Add Wikipedia ingestion importer with canonical multi-model projections#5487

Merged
makr-code merged 3 commits into
developfrom
copilot/develop-wikipedia-ingest
Jun 24, 2026
Merged

Add Wikipedia ingestion importer with canonical multi-model projections#5487
makr-code merged 3 commits into
developfrom
copilot/develop-wikipedia-ingest

Conversation

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Pull Request

Description

Adds a dedicated Wikipedia ingestion module under include/importers/ and src/importers/ with a runnable MVP for full import, incremental update, projection rebuild, validation, and portable export. The implementation models Wikipedia dumps around a canonical relational core and derives graph/vector/process/timeseries projections with checkpoint/resume and dirty-page refresh semantics.

  • Module surface

    • Added WikipediaIngestionPlugin and WikipediaIngestionPipeline
    • Added public config, types, transform, and checkpoint headers
    • Exposed explicit lifecycle and workflow APIs:
      • init()
      • shutdown()
      • runFullImport(...)
      • runIncrementalUpdate(...)
      • rebuildProjection(...)
      • validateDatabase()
      • exportPortable(...)
  • Canonical core + delta workflow

    • Introduced canonical page/revision/link/category/redirect state as the source of truth
    • Added idempotent page/revision upsert behavior
    • Added dirty-page tracking so projection rebuilds only refresh affected pages
    • Added JSON-backed checkpoint persistence for resume after interruption
    • Added dead-letter capture for best-effort parsing failures
  • Projection model wiring

    • Graph projection emits LINKS_TO, IN_CATEGORY, and REDIRECTS_TO
    • Vector projection keeps vendor-neutral embedding hooks and marks pending embeddings
    • Process projection emits page/revision lifecycle events
    • TimeSeries projection emits revision/day metrics per page
  • Portable artifact + verification

    • Added export of a portable wikipedia.db artifact plus manifest.json
    • Manifest includes dump source, importer version, row counts, checksums, and external tool references
    • Added verification report generation (wikipedia.db.verify.json)
  • Integration

    • Registered the importer in the importer/plugin registries
    • Wired the new sources into the existing CMake build
    • Added focused importer test coverage and importer-module docs
  • Trust boundary note

    • Runtime path crosses the importer/plugin boundary into importer execution (T5 -> T2)
    • Boundary controls in this MVP are source validation, strict/best-effort mode, checkpointed resume, and dead-letter isolation
themis::importers::WikipediaIngestionPlugin plugin;
plugin.initialize(R"({"checkpoint_path":"./wikipedia.checkpoint.json"})");

plugin.runFullImport({.source_path = "./pages-articles.xml", .source_id = "full-2026-06"});
plugin.runIncrementalUpdate({.source_path = "./pages-articles-next.xml", .source_id = "delta-2026-07"});

auto report = plugin.validateDatabase();
auto manifest = plugin.exportPortable("./wikipedia.db", "./manifest.json");

Linked Issues

  • Managed by automation

Type of Change

  • Bug fix (non-breaking)
  • New feature (non-breaking)
  • Refactoring (non-breaking)
  • Documentation
  • Breaking change (requires MAJOR version bump — see VERSIONING.md)
  • Security fix
  • Other:

Breaking Change Checklist

  • MAJOR version bump planned in VERSION and CMakeLists.txt
  • Migration guide added in docs/migration/
  • Announcement prepared for GitHub Discussions (≥ 2 weeks before release)
  • CHANGELOG ### Removed / ### Changed section updated

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • Benchmarks run (if performance-sensitive change)

Security Tiering Impact (Required for Runtime Changes)

  • Impacted tier(s):

    • T0 Trusted Core
    • T1 Security & Platform Services
    • T2 Data Plane Engines
    • T3 Interface & Protocol Edge
    • T4 Managed Extension Runtime
    • T5 Plugin Boundary
    • N/A (docs-only / non-runtime)
  • Trust-boundary crossings documented in PR description (example: T3 -> T2, T5 -> T4 brokered call)

  • Boundary controls validated for affected T3/T4/T5 paths (AuthN/AuthZ, validation, rate limits, audit)

  • Boundary-focused tests added/updated or explicit N/A rationale provided

  • If trust level/privilege increased, security maintainer approval is attached

📚 Research & Knowledge (wenn applicable)

  • Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
    • Falls JA: Research-Dateien in /docs/research/ angelegt?
    • Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
    • Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

  • Paper:
  • Best Practice:
  • Architecture Decision:

AI-Generated Code (KI-generierter Code)

  • Symbol-Referenzen mit GetSymbolReferences_CppTools geprüft (siehe .github/instructions/cpp-language-service-tools.instructions.md)
  • Keine rohen Pointer und kein new/delete ohne explizites Review eingeführt
  • RAII und Exception-Safety für neue/angepasste Pfade geprüft
  • Keine unnötig komplexen KI-Abstraktionen eingeführt
  • Performance-Metriken geprüft, falls Hotpath betroffen

AI Review Workflow (Required for AI-assisted PRs)

  • Findings-first review performed with .github/prompts/pr-diff-findings-review.prompt.md
  • Security hardening review performed for security-sensitive/runtime changes with .github/prompts/security-hardening-review.prompt.md (or N/A documented)
  • API impact review performed for API/contract changes with .github/prompts/api-change-impact-review.prompt.md (or N/A documented)
  • All Critical/High findings are resolved or explicitly accepted with rationale in PR description
  • Residual risks and follow-up actions documented in PR description
  • Severity policy applied according to .github/copilot/REVIEW_SEVERITY_POLICY.md

High-Finding Exception Record (only if High is accepted)

  • High-finding exception claimed in this PR

  • Finding reference:

  • Maintainer approver:

  • Mitigation in current release:

  • Target fix milestone:

  • Tracking issue:

  • Validation evidence:

Release Readiness Gate (Required for release-scoped changes)

  • Release readiness reviewed with .github/prompts/release-readiness-check.prompt.md for branch transition scope
  • Branch governance validated against BRANCHING_STRATEGY.md and RELEASE_STRATEGY.md
  • Versioning/changelog impact validated against VERSIONING.md and CHANGELOG.md

Checklist

  • Code follows project style guidelines (clang-format / clang-tidy)
  • Self-review completed
  • Documentation updated (if needed)
  • CHANGELOG.md updated under [Unreleased]
  • No new warnings introduced
  • Security-sensitive paths reviewed by security maintainer (if applicable)

Scanner and IntelliSense Gates

  • IntelliSense/Compiler: no new errors in changed files
  • clang-tidy/cppcheck: no new high-risk findings in changed files
  • Gap Scanner: no new critical findings in categories security, input_validation, query_correctness, distributed_consistency, concurrency, memory
  • Gap Scanner: no new high findings in the same categories (or explicitly approved)
  • Gap Scanner delta report attached (baseline vs current), not only absolute totals
  • New unknown scanner findings triaged (fixed, re-categorized, or justified)

Copilot AI changed the title [WIP] Add new plugin module for Wikipedia ingestion in ThemisDB Add Wikipedia ingestion importer with canonical multi-model projections Jun 24, 2026
Copilot AI requested a review from makr-code June 24, 2026 08:54
@makr-code makr-code marked this pull request as ready for review June 24, 2026 10:04
@github-actions github-actions Bot added type:documentation Documentation improvements or additions type:test Test additions, improvements, or fixes area:storage Storage layer (RocksDB, persistence) area:vector Area: vector area:graph Area: graph area:acceleration Governance area label for acceleration ai-generated labels Jun 24, 2026
@makr-code makr-code merged commit 5d0c50b into develop Jun 24, 2026
15 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated area:acceleration Governance area label for acceleration area:graph Area: graph area:storage Storage layer (RocksDB, persistence) area:vector Area: vector type:documentation Documentation improvements or additions type:test Test additions, improvements, or fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants