Skip to content

Allow specifying an arrow schema for PartitionedFile#22360

Open
fpetkovski wants to merge 6 commits into
apache:mainfrom
fpetkovski:partitioned-file-schema
Open

Allow specifying an arrow schema for PartitionedFile#22360
fpetkovski wants to merge 6 commits into
apache:mainfrom
fpetkovski:partitioned-file-schema

Conversation

@fpetkovski
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

As described in the linked issue, parsing the arrow schema from parquet metadata can be expensive for point lookups, relative to the rest of the query execution pipeline. If the user knows the arrow schema of the file, they should be able to specify it explicitly.

What changes are included in this PR?

  • Add a arrow_schema: SchemaRef field to PartitionedFile
  • Use the arrow_schema field in the parquet opener to bypass schema inference from the ARROW:schema metadata field.

Are these changes tested?

Added unit tests for both matching and mismatching schemas.

Are there any user-facing changes?

There are no breaking changes, the new field is optional and is set to None by default.

@github-actions github-actions Bot added catalog Related to the catalog crate datasource Changes to the datasource crate labels May 19, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpetkovski
Thanks for working on this. I think there is one end-to-end serialization issue that needs to be addressed before this lands.

pub metadata_size_hint: Option<usize>,
pub table_reference: Option<TableReference>,
/// A user-provided arrow schema for the file.
pub arrow_schema: Option<SchemaRef>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a follow-up before merge. PartitionedFile::arrow_schema introduces a new user-provided scan contract, but physical plan proto serialization currently appears to drop it. datafusion/proto/src/physical_plan/to_proto.rs builds protobuf::PartitionedFile without this field, and datafusion/proto/proto/datafusion.proto does not seem to have a schema field for it.

As a result, a Parquet scan that is serialized and deserialized would lose the explicit schema and fall back to parsing ARROW:schema, so the main guarantee from this change would not hold end to end.

Could you please add this field to the proto model and conversions, plus a roundtrip test showing that PartitionedFile::with_arrow_schema(...) survives physical plan or PartitionedFile proto serialization?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I updated the protos to serialize and deserialize the file arrow schema as well. There is a proto test now which verifies the round trip.

Comment thread datafusion/datasource/src/mod.rs Outdated
/// The estimated size of the parquet metadata, in bytes
pub metadata_size_hint: Option<usize>,
pub table_reference: Option<TableReference>,
/// A user-provided arrow schema for the file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small doc suggestion: it would be helpful to make the public contract a bit more precise here. My read is that this is the physical Arrow file schema used by the Parquet opener, it should describe file columns rather than partition columns, and it is currently ignored by non-Parquet sources.

Calling that out explicitly should help avoid users passing a table schema that includes partitions, or expecting CSV and JSON readers to honor this field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I updated the docs to clarify how this field is used by various openers.

@fpetkovski fpetkovski force-pushed the partitioned-file-schema branch 6 times, most recently from 1bdaf0e to bf0c9ab Compare May 26, 2026 17:01
@github-actions github-actions Bot added the proto Related to proto crate label May 26, 2026
@fpetkovski fpetkovski force-pushed the partitioned-file-schema branch from 7feb2ce to 2c719c8 Compare May 27, 2026 13:08
@fpetkovski fpetkovski requested a review from kosiew May 27, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate datasource Changes to the datasource crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reading arrow schemas from parquet files is expensive

2 participants