Skip to content

GH-46369: [C++] Add key-value pairs to the FileSystemFactory interface#50044

Draft
raulcd wants to merge 10 commits into
apache:mainfrom
raulcd:GH-46369
Draft

GH-46369: [C++] Add key-value pairs to the FileSystemFactory interface#50044
raulcd wants to merge 10 commits into
apache:mainfrom
raulcd:GH-46369

Conversation

@raulcd

@raulcd raulcd commented May 26, 2026

Copy link
Copy Markdown
Member

Rationale for this change

From the issue description:

From discussion on PR #41559 (comment) and the mailing list, it's desirable to extend the filesystem factory interface to accept a set of key-value pairs in addition to a URL. This will draw a clearer distinction between filesystem options which are expected to be independent of the user (the URI, which should be reusable by multiple users) and options which are expected to be particular to a user (the key-value pairs, which may contain a user's keys/tokens/...). In particular it will remove the ambiguity of whether filesystem URIs must be guarded: URIs will not contain secrets and need not be guarded.

What changes are included in this PR?

  • A new options const std::vector<std::pair<std::string, std::any>>&, added to FileSystemFromUri which allows to pass via FileSystemFromUriReal through FileSystemFactory::function
  • The s3 and local/file factories are migrated to the new 4-arg signature.
  • Schemes that do not yet consume options reject non-empty options with NotImplemented rather than silently dropping them. The s3 factory rejects in its lambda.
  • examplefs now reads two typed options and appends them to the output path, exercised by arrow-filesystem-test, proving std::any survives the libarrow to arrow_filesystem_example.so boundary.
  • CI: ci/scripts/cpp_build.sh now forwards ARROW_S3_MODULE to CMake. The conda-cpp image was already setting ARROW_S3_MODULE=ON, but the flag was never passed through, so s3fs_module_test (added in GH-40343: [C++] Move S3FileSystem to the registry #41559) had not actually been built or run in CI. This activates it.

Are these changes tested?

Yes, new tests added to validate a dynamically-loaded filesystem is able to receive options via std::any and tests validating rejection if options passes to Filesystems that don't support it.

Are there any user-facing changes?

New public FileSystemFromUri overloads taking an options list. Existing ones are unchanged.

Copilot AI review requested due to automatic review settings May 26, 2026 14:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the C++ filesystem factory plumbing to accept an additional set of backend-specific key/value options (in addition to the filesystem URI), aligning with GH-46369’s goal of separating reusable URIs from user-specific configuration.

Changes:

  • Extends FileSystemFactory to accept options and adds compatibility construction for existing factories.
  • Adds new FileSystemFromUri overloads that accept options, and threads them through the factory registry call path.
  • Updates registered filesystem modules/tests (examplefs, S3, localfs tests) to compile with the new factory signature.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
cpp/src/arrow/testing/examplefs.cc Updates example filesystem factory signature and forwards options to nested FileSystemFromUri call.
cpp/src/arrow/filesystem/s3fs.cc Updates S3 registered factory signature to include options (currently unused).
cpp/src/arrow/filesystem/localfs_test.cc Disambiguates nullptr factory registrations under the new overloaded FileSystemFactory constructors.
cpp/src/arrow/filesystem/filesystem.h Introduces options-aware FileSystemFactory + new FileSystemFromUri overloads and documentation.
cpp/src/arrow/filesystem/filesystem.cc Implements the new FileSystemFromUri overloads and forwards options into the registry-based factory dispatch.
Comments suppressed due to low confidence (1)

cpp/src/arrow/filesystem/filesystem.cc:925

  • The new options argument is only forwarded to registered factories; built-in scheme handling (abfs/gcs/hdfs/mock) ignores it entirely. This makes FileSystemFromUri(..., options, ...) silently drop backend-specific options for these schemes. Consider either (a) plumbing options into the built-in scheme option parsing / constructors, or (b) rejecting non-empty options for schemes that don’t support them yet to avoid surprising behavior.
Result<std::shared_ptr<FileSystem>> FileSystemFromUriReal(
    const Uri& uri, const std::string& uri_string,
    const std::vector<std::pair<std::string, std::any>>& options,
    const io::IOContext& io_context, std::string* out_path) {
  const auto scheme = uri.scheme();

  {
    ARROW_ASSIGN_OR_RAISE(
        auto* factory,
        FileSystemFactoryRegistry::GetInstance()->FactoryForScheme(scheme));
    if (factory != nullptr) {
      return factory->function(uri, options, io_context, out_path);
    }
  }

  if (scheme == "abfs" || scheme == "abfss") {
#ifdef ARROW_AZURE
    ARROW_ASSIGN_OR_RAISE(auto options, AzureOptions::FromUri(uri, out_path));
    return AzureFileSystem::Make(options, io_context);
#else
    return Status::NotImplemented(
        "Got Azure Blob File System URI but Arrow compiled without Azure Blob File "
        "System support");
#endif
  }
  if (scheme == "gs" || scheme == "gcs") {
#ifdef ARROW_GCS
    ARROW_ASSIGN_OR_RAISE(auto options, GcsOptions::FromUri(uri, out_path));
    return GcsFileSystem::Make(options, io_context);
#else

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/filesystem/filesystem.h
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/filesystem/s3fs.cc
Comment thread cpp/src/arrow/filesystem/filesystem.cc
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 26, 2026
@raulcd

raulcd commented May 26, 2026

Copy link
Copy Markdown
Member Author

@pitrou @kou I've started working on this to hopefully move S3 filesystem out of libarrow and into its own shared library.
About the scope of this PR, would you push this plumbing PR in isolation (in which case I would add a couple of tests to validate options are passed), or should I add some S3 specific options as part of this PR?

For the scope of the S3 options (on this PR or a follow up one), would you mimic all S3Options as part of the new k/v pair or only a small subset, example initially only region and credentials?

@raulcd

raulcd commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

@pitrou any thoughts about the scope for this PR, as shared above:

About the scope of this PR, would you push this plumbing PR in isolation (in which case I would add a couple of tests to validate options are passed), or should I add some S3 specific options as part of this PR?

For the scope of the S3 options (on this PR or a follow up one), would you mimic all S3Options as part of the new k/v pair or only a small subset, example initially only region and credentials?

@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 8, 2026
Copilot AI review requested due to automatic review settings June 8, 2026 10:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
@kou

kou commented Jun 9, 2026

Copy link
Copy Markdown
Member

How many S3 specific options do we have?
If we have only few options, how about implementing them in this PR for easy to review with real use case?
If we have many options, how about implementing only a few options in this PR and implementing others in a follow-up PR for easy to review?

BTW, do we need to use std::any? Can we reuse existing arrow::KeyValueMetadata?

@raulcd

raulcd commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Thanks @kou for the review, I really appreciate you taking the time!

How many S3 specific options do we have?

I counted around 13 that we probably don't want to express as part of the URI (credentials, assume-role params, custom providers, connect/request timeouts, a few behavior flags, default_metadata, retry_strategy, sse_customer_key).

If we have only few options, how about implementing them in this PR for easy to review with real use case? If we have many options, how about implementing only a few options in this PR and implementing others in a follow-up PR for easy to review?

I can do that, I basically started a follow up PR on top of this one where I started working towards adding them:

I wanted to have this in isolation for easier initial review but I am happy to push a single PR or add some of those cases here.

BTW, do we need to use std::any? Can we reuse existing arrow::KeyValueMetadata?

The std::any case is what @pitrou proposed on the mailing list and basically was to be able to cater for S3Options that are not string-encodable like retry_strategy with std::shared_ptr<S3RetryStrategy> , or default_metadata std::shared_ptr<const KeyValueMetadata>.

There's also credentials_provider std::shared_ptr<Aws::Auth::AWSCredentialsProvider> .

The cost is the type-erasure across the shared-library boundary, which I am trying to validate it won't be an issue for complex types on the other PR I pointed above. I've only tested std::string on Linux so far as I just started working on it. I should validate it with complex types and not only on Linux but also on other OS, as I've read cases where this might not behave as expected across DLL boundaries.

But now that I think more about the added complexity, I am unsure it is worth the hassle because:

  • credentials_provider won't be able to be built from libarrow without AWS headers (needs the SDK)
  • default_metadata is string-encodable (it's a str:str).
  • S3RetryStrategy is the only thing std::any buys us: letting a libarrow-only caller pass a custom S3RetryStrategy subclass, but only if we keep that interface in libarrow which I am unsure we will as this should be part of the s3fs module. We could come up with string encodable retry_strategy=standard, retry_max_attempts=5

@pitrou @kou thoughts about that?

Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
@pitrou

pitrou commented Jun 9, 2026

Copy link
Copy Markdown
Member

std::any will be easier to use for the caller than have to string-serialize options in a KeyValueMetadata.

Copilot AI review requested due to automatic review settings June 9, 2026 10:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread cpp/src/arrow/filesystem/filesystem.h
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/testing/examplefs.cc Outdated
…f FileSystemFromUri, do not silently ignore non-empty options if underlying FS does not support them and use FileSystemFactoryOptions instead of std::vector<std::pair<std::string, std::any>> everywhere
Copilot AI review requested due to automatic review settings June 9, 2026 13:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread cpp/src/arrow/testing/examplefs.cc
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
Comment thread cpp/src/arrow/filesystem/filesystem.h Outdated
@raulcd

raulcd commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

I've now added a test that passes a typed option (std::shared_ptr<ExampleTypedOption>) through the dynamically-loaded examplefs module (FileSystemFactory registered via the registry) and any_casts it back on the other side. It's green on Linux, macOS, and Windows CI, which seems to clear my concern about std::any.

Copilot AI review requested due to automatic review settings June 9, 2026 13:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comment thread cpp/src/arrow/testing/examplefs.cc Outdated
Comment thread cpp/src/arrow/filesystem/filesystem.h
Comment thread cpp/src/arrow/filesystem/s3fs.cc Outdated
Comment thread cpp/src/arrow/filesystem/localfs_test.cc Outdated
… tests for FileSystemFromUriAndOptions and add a new one to test old behaviour
@kou

kou commented Jun 10, 2026

Copy link
Copy Markdown
Member

std::any will be easier to use for the caller than have to string-serialize options in a KeyValueMetadata.

Can std::any be easier to use for our bindings? If so, I don't object the std::any approach.

@pitrou

pitrou commented Jun 10, 2026

Copy link
Copy Markdown
Member

KeyValueMetadata is meant for metadata serialized using Flatbuffers (or Thrift etc.).

I don't think it's a good idea to use it for things that do not get serialized. It imposes a string parsing or conversion, loses typing etc.

@raulcd

raulcd commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Can std::any be easier to use for our bindings? If so, I don't object the std::any approach.

Hi @kou I wasn't sure so I decided to take some time to investigate it. I've created this commit (acfc939) on top of this PR to test std::any bindings on R, Python and C GLib / Ruby. I've used Claude to help with the Ruby and R bindings but I've validated the test is exercised by producing a failing test first (TDD). All of the binding usages for std::any look quite ergonomic and easy to use. I've created a PR to exercise CI and demonstrate the bindings are green, can be seen here:

So, in my opinion it is easy to use for our bindings, what are your thoughts on it?

Note: A minor nit, we seem to have an arrow::dataset::FileSystemFactoryOptions and the new arrow::fs::FileSystemFactoryOptions forced me to alias on the Python bindings, maybe we want to rename the new one as this is the right time but not a big deal if we have to keep an alias.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants