Add Wikipedia ingestion importer with canonical multi-model projections#5487
Merged
Conversation
Copilot
AI
changed the title
[WIP] Add new plugin module for Wikipedia ingestion in ThemisDB
Add Wikipedia ingestion importer with canonical multi-model projections
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Description
Adds a dedicated Wikipedia ingestion module under
include/importers/andsrc/importers/with a runnable MVP for full import, incremental update, projection rebuild, validation, and portable export. The implementation models Wikipedia dumps around a canonical relational core and derives graph/vector/process/timeseries projections with checkpoint/resume and dirty-page refresh semantics.Module surface
WikipediaIngestionPluginandWikipediaIngestionPipelineinit()shutdown()runFullImport(...)runIncrementalUpdate(...)rebuildProjection(...)validateDatabase()exportPortable(...)Canonical core + delta workflow
Projection model wiring
LINKS_TO,IN_CATEGORY, andREDIRECTS_TOPortable artifact + verification
wikipedia.dbartifact plusmanifest.jsonwikipedia.db.verify.json)Integration
Trust boundary note
T5 -> T2)Linked Issues
Type of Change
Breaking Change Checklist
VERSIONandCMakeLists.txtdocs/migration/### Removed/### Changedsection updatedTesting
Security Tiering Impact (Required for Runtime Changes)
Impacted tier(s):
Trust-boundary crossings documented in PR description (example: T3 -> T2, T5 -> T4 brokered call)
Boundary controls validated for affected T3/T4/T5 paths (AuthN/AuthZ, validation, rate limits, audit)
Boundary-focused tests added/updated or explicit N/A rationale provided
If trust level/privilege increased, security maintainer approval is attached
📚 Research & Knowledge (wenn applicable)
/docs/research/angelegt?/docs/research/implementation_influence/eingetragen?Relevante Quellen:
AI-Generated Code (KI-generierter Code)
GetSymbolReferences_CppToolsgeprüft (siehe.github/instructions/cpp-language-service-tools.instructions.md)new/deleteohne explizites Review eingeführtAI Review Workflow (Required for AI-assisted PRs)
.github/prompts/pr-diff-findings-review.prompt.md.github/prompts/security-hardening-review.prompt.md(or N/A documented).github/prompts/api-change-impact-review.prompt.md(or N/A documented).github/copilot/REVIEW_SEVERITY_POLICY.mdHigh-Finding Exception Record (only if High is accepted)
High-finding exception claimed in this PR
Finding reference:
Maintainer approver:
Mitigation in current release:
Target fix milestone:
Tracking issue:
Validation evidence:
Release Readiness Gate (Required for release-scoped changes)
.github/prompts/release-readiness-check.prompt.mdfor branch transition scopeBRANCHING_STRATEGY.mdandRELEASE_STRATEGY.mdVERSIONING.mdandCHANGELOG.mdChecklist
[Unreleased]Scanner and IntelliSense Gates
criticalfindings in categoriessecurity,input_validation,query_correctness,distributed_consistency,concurrency,memoryhighfindings in the same categories (or explicitly approved)unknownscanner findings triaged (fixed, re-categorized, or justified)