[SPARK-55978][SQL][FOLLOWUP] Don't block V2 join pushdown when pushed Sample has fraction=1#56031
[SPARK-55978][SQL][FOLLOWUP] Don't block V2 join pushdown when pushed Sample has fraction=1#56031cloud-fan wants to merge 3 commits into
Conversation
… Sample has fraction=1 ### What changes were proposed in this pull request? Followup to apache#54972. Narrow the `V2ScanRelationPushDown` join-pushdown guard so it only blocks pushdown when at least one side has a pushed `Sample` with fraction < 1. At fraction = 1 the sample is a no-op on the result set, so dropping it inside the merged scan builder is safe. ### Why are the changes needed? The guard added in SPARK-55978 exists because the merged scan builder for `SupportsPushDownJoin` cannot carry a pushed sample and would silently discard it. The hazard is *silent result change*. At fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land at `TABLESAMPLE SYSTEM (100 PERCENT)` (parameterized queries, query generators, environment-tuned fractions). ### Does this PR introduce _any_ user-facing change? No behavior change for queries with fraction < 1. For queries where the pushed sample has fraction = 1, join pushdown now proceeds — same result set, faster plan. ### How was this patch tested? - Existing `"join pushdown is skipped when a side has a pushed sample"` test moved from `100 PERCENT` to `50 PERCENT` so it keeps exercising the (still-active) fraction < 1 branch of the guard. - New `"100% SYSTEM sample does not block join pushdown"` test asserts the new fraction = 1 short-circuit, locking in the contract. ### Was this patch authored or co-authored using generative AI tooling? No.
| // fraction < 1, because the merged scan builder would silently | ||
| // discard it and change the result. At fraction = 1 the sample is | ||
| // a no-op on the result set, so dropping it is safe. | ||
| // TODO(SPARK-56504): Extend SupportsPushDownJoin to accept pushed |
There was a problem hiding this comment.
looks like this is @stanyao 's todo JIRA, can we make this pr target that and remove the TODO?
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks for the follow-up — narrowing the guard to fraction < 1 matches the original SPARK-55978 rationale well (avoid silently discarding a sample that actually filters rows).
Assumption (full table scan): This PR relies on the assumption that a pushed sample with upperBound - lowerBound >= 1.0 is a no-op on the result row set — i.e. equivalent to reading the full table without sampling — so it is safe when join pushdown drops/discards the sample (merged scan builder cannot carry it; JDBC join subqueries omit TABLESAMPLE, etc.). That is reasonable for SQL TABLESAMPLE at 100% (withReplacement = false, lowerBound = 0) and for the motivating SYSTEM (100 PERCENT) case, but it is still an assumption: execution becomes "scan without sample," not "scan with 100% sample," and correctness depends on connectors/dialects treating 100% as including all rows/blocks (especially SYSTEM, which is implementation-defined).
Tests: Moving the blocking case to 50 PERCENT and adding the 100 PERCENT join-pushed test look good. Optional follow-ups: TABLESAMPLE (100 PERCENT) (Bernoulli), and asserting row results match the no-sample join.
Left an inline note on withReplacement.
| // TODO(SPARK-56504): Extend SupportsPushDownJoin to accept pushed | ||
| // samples so sources supporting both can handle the composition. | ||
| leftHolder.pushedSample.isEmpty && rightHolder.pushedSample.isEmpty && | ||
| leftHolder.pushedSample.forall(s => s.upperBound - s.lowerBound >= 1.0) && |
There was a problem hiding this comment.
upperBound - lowerBound >= 1.0 does not check withReplacement. With withReplacement = true, SampleExec uses PoissonSampler: even at rate 1.0, each input row can be emitted 0, 1, 2, … times, so discarding the pushed sample during join pushdown is not the same as a full scan. SQL TABLESAMPLE always sets withReplacement = false, but DataFrame.sample(withReplacement = true, fraction = 1.0) can be pushed to DSv2.
Suggest tightening the guard, e.g.:
!s.withReplacement && s.upperBound - s.lowerBound >= 1.0(optionally also s.lowerBound <= RandomSampler.roundingEpsilon for clarity).
A sample with withReplacement=true at fraction 1.0 uses PoissonSampler and is not a no-op (each row can emit 0, 1, 2, ... times). Tighten the guard so only !withReplacement, fraction >= 1.0 samples short-circuit the join-pushdown block.
…hdown Locks in the new !s.withReplacement guard via the DataFrame.sample API (SQL TABLESAMPLE can't set withReplacement=true).
What changes were proposed in this pull request?
Followup to #54972.
Narrow the
V2ScanRelationPushDownjoin-pushdown guard so it only blocks pushdown when at least one side has a pushedSamplewith fraction < 1. At fraction = 1 the sample is a no-op on the result set, so dropping it inside the merged scan builder is safe.Why are the changes needed?
The guard added in SPARK-55978 exists because the merged scan builder for
SupportsPushDownJoincannot carry a pushed sample and would silently discard it. The hazard is silent result change. At fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land atTABLESAMPLE SYSTEM (100 PERCENT)(parameterized queries, query generators, environment-tuned fractions).Does this PR introduce any user-facing change?
No behavior change for queries with fraction < 1. For queries where the pushed sample has fraction = 1, join pushdown now proceeds — same result set, faster plan.
How was this patch tested?
"join pushdown is skipped when a side has a pushed sample"test moved from100 PERCENTto50 PERCENTso it keeps exercising the (still-active) fraction < 1 branch of the guard."100% SYSTEM sample does not block join pushdown"test asserts the new fraction = 1 short-circuit, locking in the contract.Was this patch authored or co-authored using generative AI tooling?
No.