Skip to content

Commit d621bb7

Browse files
committed
Wire GitHub numeric IDs through sync + push for rename robustness
Capture GitHub's stable repo / owner / installation IDs on first successful sync, write them to bindings + content rows, and have the push event processor prefer ID-keyed binding lookup with a name-based fallback for un-synced rows. Internal matching now survives a GitHub repo or org rename without operator intervention; the public API remains name-keyed and exposes the captured IDs in the binding GET response for debugging. Key decisions: - ``InstallationAuth`` grew a required ``installation_id: int`` field rather than a separate ``get_installation_id`` round-trip from the syncer. The app client already resolves the installation id inside ``get_installation_auth`` and was discarding it; threading it onto the auth record is a no-cost change for production callers and lets the syncer write ``github_installation_id`` to the binding without doubling the GitHub round-trip count. The dataclass field is required (no default) so production code cannot accidentally construct an auth record without it; the test fixture's ``installation_auth`` helper sources it from the mock's seeded ``(owner, repo) -> id`` map (or ``default_installation_id``) so existing direct-construction tests keep their ergonomic shape. - The tree fetcher gained a ``_fetch_repo`` step that calls ``GET /repos/{owner}/{repo}`` rather than overloading the existing ``commits/{ref}`` call, because the commits response does not include the repository's numeric ID and adding the field to its schema is GitHub's prerogative. Repo + owner IDs land on ``FetchedTree`` as required ``int`` fields rather than ``int | None`` because the syncer is the only consumer in this slice and it always wants both — leaving them ``Optional`` would have leaked ``None`` into the store-layer guard for no benefit. The ``raise_for_status`` chain catches a missing repo deterministi- cally, so a real GitHub anomaly (e.g. installation lost mid-sync) surfaces as a recorded sync failure rather than a silent no-id capture. - ``DashboardGitHubTemplateBindingStore`` exposes two new methods rather than re-purposing ``list_by_repo_ref``: ``list_by_repo_id_and_ref`` for the ID-keyed primary lookup (backed by ``idx_dashboard_github_template_bindings_repo_id_ref`` from #240) and ``list_unsynced_by_repo_ref`` whose ``github_repo_id IS NULL`` filter encodes the rename-safety invariant directly. Without that filter, a binding whose name coincidentally matches a *different* repo now sitting at the old string would re-sync against the wrong upstream; pushing the filter into the store query (vs. dedup-after-the-fact) keeps the invariant load-bearing in code rather than only in tests. ``list_by_repo_ref`` stays in the API surface even though the push processor no longer calls it, because the binding store is a public storage primitive and external callers (admin tooling, future debug consoles) may legitimately want the unfiltered name-only view. - ``PushEventProcessor._lookup_bindings`` unions the two store calls with a ``seen: set[int]`` dedup guard. The two queries are disjoint by construction (the unsynced filter excludes IDs the primary lookup might find), so the dedup is defense-in-depth — but the ``test_process_dedupes_id_and_name_matches`` test pins the guard so a future loosening of either query (e.g. dropping the ``IS NULL`` filter "for symmetry") cannot regress to double- enqueueing the same ``dashboard_sync`` job. ``_coerce_int`` rejects ``bool`` explicitly because ``isinstance(True, int)`` is ``True`` in Python and a malformed payload that sent ``"id": true`` would otherwise leak ``1`` through as a real repo id. - The store-layer "only assign if non-None" guard on ``update_sync_state`` and ``upsert`` (already pinned by tests in #240) is what protects populated IDs from being overwritten on a failed re-sync. The syncer now always passes populated IDs on the success path, so the guard's job is to fence the *failure* path (where ``_record_failure`` calls ``update_sync_state`` without ID kwargs). The new ``test_sync_does_not_overwrite_populated_ids_ on_resync`` test pins the contract end-to-end so a future syncer refactor that started passing ``None`` explicitly cannot zero a binding's IDs. - ``GitHubMock.seed_tree`` now auto-seeds the ``GET /repos/{owner}/{repo}`` endpoint with default ``(repo_id=1, owner_id=1)`` so existing tree-fetcher tests that do not care about ID capture keep working with a single ``seed_tree(...)`` call. Tests that round-trip captured IDs into the database pass ``repo_id=`` / ``owner_id=`` to ``seed_tree`` and the auto-call uses those instead. ``seed_repo`` is exposed separately for tests that bypass ``seed_tree`` (the syncer's GitHub-error fixture needs the repo endpoint mocked but the ref endpoint to return 404, so it calls ``seed_repo`` then mocks ``commits/{ref}`` directly). Next-iteration notes: - Slice #242 ("rename + installation webhooks") will consume the ``github_owner_id`` / ``github_repo_id`` columns this slice now populates. When a ``repository.renamed`` event arrives, the handler can match the binding by stable ID (the rewrite target) and update ``github_owner`` / ``github_repo`` strings without re-syncing content. This slice intentionally does not register any rename-event handlers — name-keyed pushes still match un-synced bindings and ID-keyed pushes still match synced ones, so the system is functional without rename handlers; they're a staleness-fix, not a correctness requirement. - The blocker merge of ``tickets/DM-54689-github-webhook`` (PR #264) into this branch is a no-op once #264 lands on main first. GitHub auto-merge will flatten the merge commit. If #264 is rebased instead of merged cleanly, this branch needs a manual re-merge. - ``InstallationAuth.installation_id`` is now required, but the GitHub App secrets remain optional on ``Factory`` / ``WorkerFactoryBuilder``. The optionality is a feature-flag for the dashboard-template-from-github feature; once the feature graduates from optional, the three GitHub App fields can be tightened to non-Optional and ``startup`` raised earlier — the ``InstallationAuth`` change has no bearing on that decision. - The ``DashboardTemplateBinding`` client model now exposes the three numeric ID fields as informational. If a future client wants to assert on rename-robustness from the outside, it can observe ``github_repo_id`` flipping from ``null`` to populated after the first successful sync — no new endpoint needed. Closes #241
1 parent be3c5fd commit d621bb7

15 files changed

Lines changed: 849 additions & 25 deletions

File tree

client/src/docverse/client/models/dashboard_template.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,34 @@ class DashboardTemplateBinding(BaseModel):
151151
),
152152
)
153153

154+
github_owner_id: int | None = Field(
155+
default=None,
156+
description=(
157+
"GitHub's stable numeric ID for the owner. Captured on first "
158+
"successful sync; ``None`` for un-synced bindings. Informational "
159+
"only — the public API remains keyed on ``github_owner``."
160+
),
161+
)
162+
163+
github_repo_id: int | None = Field(
164+
default=None,
165+
description=(
166+
"GitHub's stable numeric ID for the repository. Captured on "
167+
"first successful sync; ``None`` for un-synced bindings. "
168+
"Informational only — the public API remains keyed on "
169+
"``github_repo``."
170+
),
171+
)
172+
173+
github_installation_id: int | None = Field(
174+
default=None,
175+
description=(
176+
"GitHub App installation ID for the repository. Captured on "
177+
"first successful sync; ``None`` for un-synced bindings or when "
178+
"the GitHub App is not installed. Informational only."
179+
),
180+
)
181+
154182
date_created: datetime = Field(
155183
description="Timestamp when the binding was created."
156184
)

src/docverse/handlers/orgs/models.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -535,6 +535,9 @@ def from_domain(
535535
last_sync_status=domain.last_sync_status,
536536
last_sync_error=domain.last_sync_error,
537537
last_sync_queue_job_url=last_sync_queue_job_url,
538+
github_owner_id=domain.github_owner_id,
539+
github_repo_id=domain.github_repo_id,
540+
github_installation_id=domain.github_installation_id,
538541
date_created=domain.date_created,
539542
date_updated=domain.date_updated,
540543
)

src/docverse/services/dashboard_templates/push_processor.py

Lines changed: 68 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@
99
import structlog
1010

1111
from docverse.client.models.dashboard_template import normalize_github_ref
12+
from docverse.domain.dashboard_github_template import (
13+
DashboardGitHubTemplateBinding,
14+
)
1215
from docverse.domain.queue import QueueJob
1316
from docverse.storage.dashboard_templates.github import (
1417
DashboardGitHubTemplateBindingStore,
@@ -95,10 +98,10 @@ async def process(self, payload: Mapping[str, Any]) -> list[QueueJob]:
9598
return []
9699

97100
normalized_ref = normalize_github_ref(ref)
98-
bindings = await self._binding_store.list_by_repo_ref(
99-
github_owner=owner,
100-
github_repo=repo_name,
101-
github_ref=normalized_ref,
101+
repo_id = _coerce_int(repo.get("id"))
102+
103+
bindings = await self._lookup_bindings(
104+
owner=owner, repo=repo_name, ref=normalized_ref, repo_id=repo_id
102105
)
103106
if not bindings:
104107
self._logger.info(
@@ -107,6 +110,7 @@ async def process(self, payload: Mapping[str, Any]) -> list[QueueJob]:
107110
github_repo=repo_name,
108111
github_ref_raw=ref,
109112
github_ref=normalized_ref,
113+
github_repo_id=repo_id,
110114
)
111115
return []
112116

@@ -143,6 +147,51 @@ async def process(self, payload: Mapping[str, Any]) -> list[QueueJob]:
143147
)
144148
return jobs
145149

150+
async def _lookup_bindings(
151+
self,
152+
*,
153+
owner: str,
154+
repo: str,
155+
ref: str,
156+
repo_id: int | None,
157+
) -> list[DashboardGitHubTemplateBinding]:
158+
"""Find bindings for a push, preferring ``github_repo_id``.
159+
160+
Two queries, unioned and deduplicated by binding id:
161+
162+
- ``list_by_repo_id_and_ref`` — when the payload carries a
163+
``repository.id``, this is the rename-robust primary lookup.
164+
Bindings that completed at least one successful sync have
165+
their ``github_repo_id`` populated and match here even when
166+
the GitHub display name has since changed.
167+
- ``list_unsynced_by_repo_ref`` — fallback for bindings that
168+
have never synced (and therefore lack a stable ID). The
169+
``github_repo_id IS NULL`` filter inside the store keeps
170+
synced bindings whose name happens to coincide with this
171+
push out of the result, so a different repo now sitting at
172+
the old name does not accidentally trigger a sync.
173+
174+
Dedup defends against any future loosening of either query.
175+
"""
176+
seen: set[int] = set()
177+
bindings: list[DashboardGitHubTemplateBinding] = []
178+
if repo_id is not None:
179+
for binding in await self._binding_store.list_by_repo_id_and_ref(
180+
github_repo_id=repo_id, github_ref=ref
181+
):
182+
if binding.id in seen:
183+
continue
184+
seen.add(binding.id)
185+
bindings.append(binding)
186+
for binding in await self._binding_store.list_unsynced_by_repo_ref(
187+
github_owner=owner, github_repo=repo, github_ref=ref
188+
):
189+
if binding.id in seen:
190+
continue
191+
seen.add(binding.id)
192+
bindings.append(binding)
193+
return bindings
194+
146195
async def _resolve_changed_paths(
147196
self, payload: Mapping[str, Any], *, owner: str, repo: str
148197
) -> list[str]:
@@ -180,6 +229,21 @@ def _split_full_name_tuple(full_name: object) -> tuple[str, str] | None:
180229
return None
181230

182231

232+
def _coerce_int(value: object) -> int | None:
233+
"""Return ``value`` as ``int`` when it is a non-bool int, else ``None``.
234+
235+
GitHub webhooks send numeric IDs as JSON ints, but defensive
236+
payload parsing prefers a strict guard: the ``not isinstance(bool)``
237+
clause keeps the ``isinstance(True, int)`` quirk from leaking a
238+
truth value through as the repo id.
239+
"""
240+
if isinstance(value, bool):
241+
return None
242+
if isinstance(value, int):
243+
return value
244+
return None
245+
246+
183247
def _normalize_root(root_path: str) -> str:
184248
return root_path.strip("/")
185249

src/docverse/services/dashboard_templates/sync.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,13 +219,18 @@ async def sync(self, binding_id: int) -> DashboardSyncResult:
219219
etag=etag,
220220
template_toml=template_toml,
221221
files=other_files,
222+
github_owner_id=fetched.owner_id,
223+
github_repo_id=fetched.repo_id,
222224
)
223225

224226
updated = await self._binding_store.update_sync_state(
225227
binding_id=binding.id,
226228
last_sync_status="succeeded",
227229
last_sync_error=None,
228230
github_template_id=upsert.template.id,
231+
github_owner_id=fetched.owner_id,
232+
github_repo_id=fetched.repo_id,
233+
github_installation_id=auth.installation_id,
229234
)
230235
if updated is None:
231236
msg = (

src/docverse/storage/dashboard_templates/github/binding_store.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,62 @@ async def list_by_repo_ref(
221221
for binding_row, public_id, commit_sha in result
222222
]
223223

224+
async def list_by_repo_id_and_ref(
225+
self,
226+
*,
227+
github_repo_id: int,
228+
github_ref: str,
229+
) -> list[DashboardGitHubTemplateBinding]:
230+
"""List every binding pinned to a stable GitHub repo ID + ref.
231+
232+
Backed by the
233+
``idx_dashboard_github_template_bindings_repo_id_ref`` composite
234+
index. Used by the push event processor as the primary
235+
rename-robust lookup: bindings keep matching their upstream
236+
repository even after a GitHub rename or transfer because the
237+
numeric ID is invariant.
238+
"""
239+
result = await self._session.execute(
240+
select(SqlDashboardGitHubTemplateBinding).where(
241+
SqlDashboardGitHubTemplateBinding.github_repo_id
242+
== github_repo_id,
243+
SqlDashboardGitHubTemplateBinding.github_ref == github_ref,
244+
)
245+
)
246+
return [
247+
DashboardGitHubTemplateBinding.model_validate(row)
248+
for row in result.scalars().all()
249+
]
250+
251+
async def list_unsynced_by_repo_ref(
252+
self,
253+
*,
254+
github_owner: str,
255+
github_repo: str,
256+
github_ref: str,
257+
) -> list[DashboardGitHubTemplateBinding]:
258+
"""List ``(owner, repo, ref)`` matches that have no captured repo ID.
259+
260+
The push event processor unions this with
261+
:meth:`list_by_repo_id_and_ref` to cover bindings that have
262+
never completed a successful sync (and therefore lack a
263+
``github_repo_id``). Already-synced bindings whose name
264+
coincidentally matches a different repo are excluded — the
265+
ID lookup is the authoritative match for those.
266+
"""
267+
result = await self._session.execute(
268+
select(SqlDashboardGitHubTemplateBinding).where(
269+
SqlDashboardGitHubTemplateBinding.github_owner == github_owner,
270+
SqlDashboardGitHubTemplateBinding.github_repo == github_repo,
271+
SqlDashboardGitHubTemplateBinding.github_ref == github_ref,
272+
SqlDashboardGitHubTemplateBinding.github_repo_id.is_(None),
273+
)
274+
)
275+
return [
276+
DashboardGitHubTemplateBinding.model_validate(row)
277+
for row in result.scalars().all()
278+
]
279+
224280
async def list_project_overrides_for_org(
225281
self, org_id: int
226282
) -> list[DashboardGitHubTemplateBinding]:

src/docverse/storage/github/app_client.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,16 @@ class InstallationAuth:
2929
``base_url`` and attach ``Authorization: Bearer {token}`` on each
3030
request so the process-wide ``httpx.AsyncClient`` defaults stay
3131
untouched.
32+
33+
``installation_id`` is GitHub's stable identifier for the
34+
repository's app installation. The syncer captures it onto the
35+
binding so future push events can reference the installation
36+
directly without an extra ``/repos/{owner}/{repo}/installation``
37+
round-trip.
3238
"""
3339

3440
token: str
41+
installation_id: int
3542
base_url: str = GITHUB_API_BASE_URL
3643

3744

@@ -147,4 +154,4 @@ async def get_installation_auth(
147154
"""
148155
installation_id = await self.get_installation_id(owner, repo)
149156
token = await self.exchange_installation_token(installation_id)
150-
return InstallationAuth(token=token)
157+
return InstallationAuth(token=token, installation_id=installation_id)

src/docverse/storage/github/tree_fetcher.py

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@ class FetchedTree:
5555
``commit_sha`` is the resolved commit the ref pointed at when the
5656
tree was read, independent of whether ``ref`` was a branch, tag, or
5757
raw SHA.
58+
59+
``repo_id`` and ``owner_id`` are GitHub's stable numeric IDs for
60+
the repository and its owner. The syncer writes them to the
61+
binding + content rows so the push event processor can match
62+
incoming events by ID rather than by display name, which keeps
63+
matching robust across GitHub repo / org renames.
5864
"""
5965

6066
owner: str
@@ -65,6 +71,8 @@ class FetchedTree:
6571
tree_sha: str
6672
etag: str | None
6773
files: tuple[FetchedTreeFile, ...]
74+
repo_id: int
75+
owner_id: int
6876

6977

7078
def _normalize_root(root_path: str) -> str:
@@ -101,10 +109,13 @@ async def fetch(
101109
) -> FetchedTree:
102110
"""Fetch the tree under ``root_path`` at ``ref`` for ``owner/repo``.
103111
104-
Steps: resolve the ref to a commit (giving us ``commit_sha`` +
105-
the top-level tree SHA), list the recursive tree, keep only
106-
blobs inside ``root_path``, then fetch each blob's bytes.
112+
Steps: fetch the repository metadata (for the stable
113+
``repo_id`` / ``owner_id``), resolve the ref to a commit
114+
(giving us ``commit_sha`` + the top-level tree SHA), list the
115+
recursive tree, keep only blobs inside ``root_path``, then
116+
fetch each blob's bytes.
107117
"""
118+
repo_id, owner_id = await self._fetch_repo(owner, repo)
108119
commit_sha, tree_sha = await self._resolve_ref(owner, repo, ref)
109120
tree_entries, etag = await self._list_tree(owner, repo, tree_sha)
110121

@@ -153,6 +164,8 @@ async def _fetch_into(
153164
tree_sha=tree_sha,
154165
etag=etag,
155166
files=tuple(cast("list[FetchedTreeFile]", files)),
167+
repo_id=repo_id,
168+
owner_id=owner_id,
156169
)
157170

158171
def _auth_headers(self, accept: str) -> dict[str, str]:
@@ -161,6 +174,22 @@ def _auth_headers(self, accept: str) -> dict[str, str]:
161174
"Authorization": f"Bearer {self._auth.token}",
162175
}
163176

177+
async def _fetch_repo(self, owner: str, repo: str) -> tuple[int, int]:
178+
"""Fetch repository metadata, returning ``(repo_id, owner_id)``.
179+
180+
Reads ``GET /repos/{owner}/{repo}`` to capture GitHub's stable
181+
numeric IDs for the repository and its owner. The syncer
182+
writes these to the binding + content rows so push events
183+
match by ID even after a GitHub repo or org rename.
184+
"""
185+
response = await self._http_client.get(
186+
f"{self._auth.base_url}/repos/{owner}/{repo}",
187+
headers=self._auth_headers("application/vnd.github+json"),
188+
)
189+
response.raise_for_status()
190+
data = response.json()
191+
return int(data["id"]), int(data["owner"]["id"])
192+
164193
async def _resolve_ref(
165194
self, owner: str, repo: str, ref: str
166195
) -> tuple[str, str]:

tests/handlers/dashboard_template_test.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,61 @@ async def test_org_get_returns_existing_binding(client: AsyncClient) -> None:
204204
assert body["github_owner"] == "lsst-sqre"
205205
assert body["github_ref"] == "main"
206206
assert body["last_sync_status"] == "pending"
207+
# Pre-sync the GitHub numeric IDs are unset; the response surfaces
208+
# them as ``null`` so operators can see the binding has not yet
209+
# captured them.
210+
assert body["github_owner_id"] is None
211+
assert body["github_repo_id"] is None
212+
assert body["github_installation_id"] is None
213+
214+
215+
@pytest.mark.asyncio
216+
async def test_org_get_surfaces_captured_github_numeric_ids(
217+
client: AsyncClient,
218+
) -> None:
219+
"""A populated binding's GitHub numeric IDs are visible to operators.
220+
221+
The public API stays name-keyed; the IDs are informational only,
222+
surfaced for debugging the rename-robust internal lookup path.
223+
"""
224+
await _setup_org(client)
225+
await client.put(
226+
f"/docverse/orgs/{_ORG}/dashboard-template",
227+
json=_VALID_BODY,
228+
headers={"X-Auth-Request-User": _ADMIN},
229+
)
230+
231+
async for session in db_session_dependency():
232+
async with session.begin():
233+
org_store = OrganizationStore(
234+
session=session, logger=structlog.get_logger("test")
235+
)
236+
org = await org_store.get_by_slug(_ORG)
237+
assert org is not None
238+
binding_store = DashboardGitHubTemplateBindingStore(
239+
session=session, logger=structlog.get_logger("test")
240+
)
241+
existing = await binding_store.get_org_default(org.id)
242+
assert existing is not None
243+
await binding_store.update_sync_state(
244+
binding_id=existing.id,
245+
last_sync_status="succeeded",
246+
github_owner_id=111,
247+
github_repo_id=222,
248+
github_installation_id=333,
249+
)
250+
await session.commit()
251+
break
252+
253+
response = await client.get(
254+
f"/docverse/orgs/{_ORG}/dashboard-template",
255+
headers={"X-Auth-Request-User": _ADMIN},
256+
)
257+
assert response.status_code == 200
258+
body = response.json()
259+
assert body["github_owner_id"] == 111
260+
assert body["github_repo_id"] == 222
261+
assert body["github_installation_id"] == 333
207262

208263

209264
@pytest.mark.asyncio

0 commit comments

Comments
 (0)