Skip to content

[GH-2830] Improve Geography query support - core#2831

Draft
zhangfengcdt wants to merge 29 commits intoapache:masterfrom
zhangfengcdt:feature/geography.support
Draft

[GH-2830] Improve Geography query support - core#2831
zhangfengcdt wants to merge 29 commits intoapache:masterfrom
zhangfengcdt:feature/geography.support

Conversation

@zhangfengcdt
Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt commented Apr 8, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.

Core architecture:

  • WKBGeography — stores WKB bytes as primary representation with lazy-parsed JTS, S2, and ShapeIndex caches (double-checked locking for thread safety)
  • GeographyWKBSerializer — WKB serializer with 0xFF format byte, backward-compatible with legacy S2-native format
  • GeographyUDT, implicits.scala, GeometrySerde — switched to WKBSerializer for all serialization paths

Geography functions (3 new):

  • Level 1 (JTS): ST_NPoints
  • Level 2 (JTS + Spheroid): ST_Distance
  • Level 3 (S2): ST_Contains

Docs: API docs for all 3 new functions in docs/api/sql/geography/

Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR

How was this patch tested?

  • all unit tests pass in common module (WKBGeographyTest, FunctionTest)
  • GeographyFunctionTest.scala — Spark SQL integration tests covering constructors, structural functions, metrics, predicates, DataFrame API, and serialization round-trips

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@zhangfengcdt zhangfengcdt requested a review from jiayuasu as a code owner April 8, 2026 15:28
@jiayuasu
Copy link
Copy Markdown
Member

jiayuasu commented Apr 9, 2026

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

@zhangfengcdt zhangfengcdt marked this pull request as draft April 9, 2026 22:12
@zhangfengcdt
Copy link
Copy Markdown
Member Author

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

I am still working on it. I will clean up the benchmark codes and also for this PR, it will focus on building the core architecture of the WKB-based Geography serialization with cached S2. I will keep a few core ST functions and tests and move other to individual PRs following the merging.

@zhangfengcdt zhangfengcdt changed the title [GH-2830] Improve Geography query support - functions [GH-2830] Improve Geography query support - core Apr 11, 2026
@zhangfengcdt
Copy link
Copy Markdown
Member Author

ST Function Performance: Geography vs Geometry (cached objects, ns/op)

  • ST_NPoints (Level 1 — JTS accessor)
  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │         2 │        2 │    1x │
  └─────────────────────┴───────────┴──────────┴───────┘
  • ST_Distance (Level 2 — S2 geodesic distance)
  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       269 │       12 │   22x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │     1,576 │      373 │  4.2x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │     1,419 │      613 │  2.3x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │    69,279 │    3,874 │   18x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │   224,696 │  129,518 │  1.7x │
  └─────────────────────┴───────────┴──────────┴───────┘
  • ST_Contains (Level 3 — S2 predicate)
  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       284 │        8 │   36x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │       664 │        8 │   83x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │       684 │        8 │   86x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │       677 │        8 │   87x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │       703 │        8 │   88x │
  └─────────────────────┴───────────┴──────────┴───────┘


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants