feat: pcc sync worker (CM-1086) by themarolt · Pull Request #4006 · linuxfoundation/crowd.dev

themarolt · 2026-04-07T08:42:05Z

Note

Medium Risk
Adds a new always-on worker that writes to production segments/insightsProjects and introduces new DB schema/table/indexes; mistakes in matching/mapping logic or platform filtering could cause incorrect updates or missed/incorrect job processing.

Overview
Adds a new pcc_sync_worker Temporal worker that schedules a daily PCC Snowflake export to S3, consumes the resulting Parquet jobs, maps PCC hierarchy depth to CDP (group/project/subproject), and updates matching segments/insightsProjects in a single DB transaction per job (with a PCC_DRY_RUN mode).

Introduces DB support for the sync by adding segments.maturity and a pcc_projects_sync_errors table with deduplicating indexes to persist schema/hierarchy/slug/name-conflict issues for manual review.

Refactors Snowflake job plumbing into @crowd/snowflake (exports MetadataStore, S3Service, SnowflakeExporter, adds buildPlatformFilter and platform-aware cleanup/claiming), updates snowflake_connectors to use the shared implementations, and adds Docker/compose setup plus workspace deps for the new worker.

^{Reviewed by Cursor Bugbot for commit 08be688. Bugbot is set up for automated code reviews on this repo. Configure here.}

Signed-off-by: Uroš Marolt <uros@marolt.me>

github-actions

Conventional Commits FTW!

Copilot

Pull request overview

Introduces a new Temporal worker to export PCC project hierarchy data from Snowflake to S3 (Parquet) and sync it into CDP (segments + insightsProjects), while refactoring shared Snowflake/S3/metadata components into @crowd/snowflake and updating the existing snowflake_connectors app to consume them.

Changes:

Add pcc_sync_worker app with Temporal schedules/workflows, export/cleanup activities, Parquet parsing, and a DB-sync consumer.
Move/centralize Snowflake export job metadata + S3/Parquet consumption logic into services/libs/snowflake and update snowflake_connectors to use it.
Add DB migration for PCC sync support (segments.maturity + pcc_projects_sync_errors table + dedup index), plus worker Docker/compose setup.

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
services/libs/snowflake/src/snowflakeExporter.ts	Fix internal import to avoid self-package import/cycles.
services/libs/snowflake/src/s3Service.ts	New S3 download/delete + Parquet row iteration utility.
services/libs/snowflake/src/metadataStore.ts	Add platform filtering + named params for export job bookkeeping.
services/libs/snowflake/src/index.ts	Export new Snowflake lib surface (metadata store, S3 service, exporter).
services/libs/snowflake/package.json	Add S3 + Parquet deps and DB dependency for the library.
services/apps/snowflake_connectors/src/consumer/transformerConsumer.ts	Use `@crowd/snowflake` MetadataStore/S3Service; add enabled-platform filtering.
services/apps/snowflake_connectors/src/activities/exportActivity.ts	Switch imports to `@crowd/snowflake`.
services/apps/snowflake_connectors/src/activities/cleanupActivity.ts	Use shared MetadataStore/S3Service and pass enabled platforms to cleanup.
services/apps/snowflake_connectors/package.json	Remove direct S3/Parquet deps (now come from `@crowd/snowflake`).
services/apps/pcc_sync_worker/tsconfig.json	New TS config for PCC worker app.
services/apps/pcc_sync_worker/src/workflows/index.ts	Workflow exports.
services/apps/pcc_sync_worker/src/workflows/exportWorkflow.ts	Temporal workflow to run PCC export activity.
services/apps/pcc_sync_worker/src/workflows/cleanupWorkflow.ts	Temporal workflow to run PCC cleanup activity.
services/apps/pcc_sync_worker/src/scripts/triggerExport.ts	Manual script to start export workflow.
services/apps/pcc_sync_worker/src/scripts/triggerCleanup.ts	Manual script to start cleanup workflow.
services/apps/pcc_sync_worker/src/schedules/pccS3Export.ts	Temporal schedule registration for daily PCC export.
services/apps/pcc_sync_worker/src/schedules/pccS3Cleanup.ts	Temporal schedule registration for daily PCC cleanup.
services/apps/pcc_sync_worker/src/schedules/index.ts	Schedule exports.
services/apps/pcc_sync_worker/src/parser/types.ts	Parquet-row + parsed-project types.
services/apps/pcc_sync_worker/src/parser/rowParser.ts	Pure PCC row parsing + hierarchy mapping rules.
services/apps/pcc_sync_worker/src/parser/index.ts	Parser exports.
services/apps/pcc_sync_worker/src/main.ts	ServiceWorker archetype configuration.
services/apps/pcc_sync_worker/src/index.ts	Worker entrypoint: init + schedule + start consumer + start Temporal worker.
services/apps/pcc_sync_worker/src/consumer/pccProjectConsumer.ts	PCC job polling + Parquet processing + DB sync + error recording.
services/apps/pcc_sync_worker/src/config/settings.ts	Re-export Temporal config helpers.
services/apps/pcc_sync_worker/src/activities/index.ts	Activity exports.
services/apps/pcc_sync_worker/src/activities/exportActivity.ts	Snowflake recursive CTE export into S3 + metadata insert.
services/apps/pcc_sync_worker/src/activities/cleanupActivity.ts	Cleanup exported S3 files + mark jobs cleaned + Slack alerting on failures.
services/apps/pcc_sync_worker/package.json	PCC worker package manifest + scripts.
scripts/services/pcc-sync-worker.yaml	Compose service definitions for PCC worker (prod/dev).
scripts/services/docker/Dockerfile.pcc_sync_worker.dockerignore	Docker ignore file for PCC worker build context.
scripts/services/docker/Dockerfile.pcc_sync_worker	Multi-stage build for PCC worker.
backend/src/database/migrations/V1775312770__pcc-sync-worker-setup.sql	Add `segments.maturity` + PCC sync errors table + dedup index.
backend/src/database/migrations/U1775312770__pcc-sync-worker-setup.sql	Rollback for PCC sync DB changes.

Comments suppressed due to low confidence (2)

services/libs/snowflake/src/metadataStore.ts:77

When platforms is provided as an empty array, this method falls back to no filter and will claim jobs for all platforms. That’s risky if CROWD_SNOWFLAKE_ENABLED_PLATFORMS is accidentally empty/misconfigured. Consider treating an explicit empty platforms list as “match nothing” (return null early, or inject an AND FALSE filter).
services/libs/snowflake/src/metadataStore.ts:125
platforms being an empty array currently results in no platform filter, so cleanup can target jobs for all platforms if the enabled-platforms list is empty/misconfigured. Consider returning [] early when platforms is provided but empty (or otherwise ensuring the filter matches nothing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Uroš Marolt <uros@marolt.me>

mbani01

Well done @themarolt 💪 left a couple comments

joanagmaia

Main thing to update is the query to snowflake which in return will affect how we detect hierarchy mismatches. Main changes are:

Use PROJECTS_SPINE to rely on depth mapping
Do not hard code depth level up to 5, simply flat depth into multiple rows

joanagmaia · 2026-04-15T11:26:29Z

+const PLATFORM = 'pcc'
+const SOURCE_NAME = 'project-hierarchy'
+
+function buildSourceQuery(): string {


Hey, I know this is the query I used before on the first data analysis, but this is not following the latest mapping rules of the new PROJECTS_SPINE table which already has the hierarchy defined -> We don't need to do it ourselves.

Proposed new query:

SELECT p.project_id, p.name, p.description, p.project_logo, p.project_status, p.project_maturity_level, ps.mapped_project_id, ps.mapped_project_name, ps.mapped_project_slug, ps.hierarchy_level, s.segment_id FROM ANALYTICS.SILVER_DIM.PROJECTS p LEFT JOIN ANALYTICS.SILVER_DIM.PROJECT_SPINE ps ON ps.base_project_id = p.project_id LEFT JOIN ANALYTICS.SILVER_DIM.ACTIVE_SEGMENTS s ON s.source_id = p.project_id AND s.project_type = 'subproject' WHERE p.project_id NOT IN ( SELECT DISTINCT parent_id FROM ANALYTICS.SILVER_DIM.PROJECTS WHERE parent_id IS NOT NULL ) ORDER BY p.name, ps.hierarchy_level ASC;

Important
Level/depth is now longer in a column but rather in new rows. E.g.

project_id | name | mapped_project_id | mapped_project_name | mapped_project_slug | hierarchy_level | segment_id ----------- | -------- | ------------------- | ----------------------- | ---------------------- | --------------- | kubectl_id | kubectl | kubectl_id | kubectl | kubectl | 1 | seg_123 kubectl_id | kubectl | kubernetes_id | kubernetes | kubernetes | 2 | seg_123 kubectl_id | kubectl | tlf_id | tlf | tlf | 3 | seg_123

Changes:
1. No hardcoded depth limit
The recursive CTE hard-codes depth_1 through depth_5. If the hierarchy ever grows to 6+ levels, we need a schema change. The new query handles any depth automatically via additional rows.

2. Reuses existing model infrastructure
PROJECT_SPINE already computes and materializes the hierarchy. The recursive CTE duplicates that logic inline — any bug fix or change to hierarchy traversal would need to be made in two places.

3. Simpler, more readable SQL
The recursive CTE is ~20 lines of stateful logic that requires understanding how the recursion builds depth_N columns. The new query is a straightforward set of joins that's immediately understandable.

5. Normalized shape
Wide columns (depth_1..5) are harder to work with downstream, filtering, aggregating, or displaying "what level is this?" requires knowing which column to look at. One row per level is easier to GROUP BY, FILTER, or JOIN against.

6. Removed repository_url on the leaf project
Since we're no longer using repository_url to automatically onboard, as that's a free text url in the UI and we shouldn't rely on it, then we don't need to return it as well.

Done just testing it now before resolving the comment.

Signed-off-by: Uroš Marolt <uros@marolt.me>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2c9ab8d. Configure here.}

Signed-off-by: Uroš Marolt <uros@marolt.me>

themarolt added 2 commits April 6, 2026 14:27

chore: pcc sync worker

a43be14

Signed-off-by: Uroš Marolt <uros@marolt.me>

Merge branch 'main' into feat/pcc-sync-CM-1086-CM-1087-CM-1088-CM-1089

d71ee96

Copilot AI review requested due to automatic review settings April 7, 2026 08:42

github-license-compliance bot found potential problems Apr 7, 2026

View reviewed changes

github-actions bot reviewed Apr 7, 2026

View reviewed changes

Copilot started reviewing on behalf of themarolt April 7, 2026 08:42 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

cursor bot reviewed Apr 7, 2026

View reviewed changes

Comment thread services/libs/snowflake/src/metadataStore.ts Outdated

themarolt added 3 commits April 9, 2026 08:15

Merge branch 'main' into feat/pcc-sync-CM-1086-CM-1087-CM-1088-CM-1089

d616b51

chore: pcc sync worker wip1

ad1fb1b

Signed-off-by: Uroš Marolt <uros@marolt.me>

Merge branch 'main' into feat/pcc-sync-CM-1086-CM-1087-CM-1088-CM-1089

c462cc9