Skip to content

GH-49689: [R][C++] Parquets do not support list-columns of ordered factors (ordered dictionaries)#49937

Open
thisisnic wants to merge 9 commits into
apache:mainfrom
thisisnic:GH-49689-ordered
Open

GH-49689: [R][C++] Parquets do not support list-columns of ordered factors (ordered dictionaries)#49937
thisisnic wants to merge 9 commits into
apache:mainfrom
thisisnic:GH-49689-ordered

Conversation

@thisisnic

@thisisnic thisisnic commented May 6, 2026

Copy link
Copy Markdown
Member

Rationale for this change

Ordered dictionaries inside nested types lose their ordered flag during construction, because DictionaryBuilder doesn't track it.

What changes are included in this PR?

Store the ordered flag in DictionaryBuilderBase and pass it through when reconstructing the DictionaryType in type() and FinishInternal().

Remove the R-side workaround that was patching this for top-level columns only.

Are these changes tested?

Yes

Are there any user-facing changes?

No

AI usage

Written by Claude, reviewed by me and Codex (via roborev). I had Claude create the tests first, to make sure they failed as expected, then had Claude make the fixes and checked the tests passed. I questioned the approach as I went.

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

⚠️ GitHub issue #49689 has been automatically assigned in GitHub to PR creator.

@thisisnic thisisnic changed the title GH-49689: [R] Parquets do not support list-columns of ordered factors (ordered dictionaries) GH-49689: [R][C++] Parquets do not support list-columns of ordered factors (ordered dictionaries) May 6, 2026
Comment on lines +1764 to +1840
// GH-49689: Ordered dictionary tests

TEST(TestDictionaryBuilderOrdered, TypePreservesOrderedFlag) {
auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true);
std::unique_ptr<ArrayBuilder> boxed_builder;
ASSERT_OK(MakeBuilder(default_memory_pool(), ordered_type, &boxed_builder));

auto builder_type = boxed_builder->type();
ASSERT_TRUE(checked_cast<const DictionaryType&>(*builder_type).ordered());
}

TEST(TestDictionaryBuilderOrdered, UnorderedTypeStaysUnordered) {
auto unordered_type = dictionary(int8(), utf8(), /*ordered=*/false);
std::unique_ptr<ArrayBuilder> boxed_builder;
ASSERT_OK(MakeBuilder(default_memory_pool(), unordered_type, &boxed_builder));

auto builder_type = boxed_builder->type();
ASSERT_FALSE(checked_cast<const DictionaryType&>(*builder_type).ordered());
}

TEST(TestDictionaryBuilderOrdered, FinishPreservesOrderedFlag) {
auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true);
std::unique_ptr<ArrayBuilder> boxed_builder;
ASSERT_OK(MakeBuilder(default_memory_pool(), ordered_type, &boxed_builder));
auto& builder = checked_cast<DictionaryBuilder<StringType>&>(*boxed_builder);

ASSERT_OK(builder.Append("a"));
ASSERT_OK(builder.Append("b"));
ASSERT_OK(builder.Append("a"));

std::shared_ptr<Array> result;
ASSERT_OK(builder.Finish(&result));

const auto& result_type = checked_cast<const DictionaryType&>(*result->type());
ASSERT_TRUE(result_type.ordered());

auto ex_dict = ArrayFromJSON(utf8(), R"(["a", "b"])");
auto ex_indices = ArrayFromJSON(int8(), "[0, 1, 0]");
DictionaryArray expected(ordered_type, ex_indices, ex_dict);
AssertArraysEqual(expected, *result);
}

TEST(TestDictionaryBuilderOrdered, ListOfOrderedDictionary) {
auto ordered_dict_type = dictionary(int8(), utf8(), /*ordered=*/true);
auto list_type = list(field("item", ordered_dict_type));

std::unique_ptr<ArrayBuilder> boxed_builder;
ASSERT_OK(MakeBuilder(default_memory_pool(), list_type, &boxed_builder));
auto& list_builder = checked_cast<ListBuilder&>(*boxed_builder);
auto& dict_builder =
checked_cast<DictionaryBuilder<StringType>&>(*list_builder.value_builder());

ASSERT_OK(list_builder.Append());
ASSERT_OK(dict_builder.Append("a"));
ASSERT_OK(dict_builder.Append("b"));
ASSERT_OK(list_builder.Append());
ASSERT_OK(dict_builder.Append("a"));

std::shared_ptr<Array> result;
ASSERT_OK(list_builder.Finish(&result));

const auto& result_list_type = checked_cast<const ListType&>(*result->type());
const auto& result_dict_type =
checked_cast<const DictionaryType&>(*result_list_type.value_type());
ASSERT_TRUE(result_dict_type.ordered());
}

TEST(TestDictionaryBuilderOrdered, MakeDictionaryBuilderPreservesOrdered) {
auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true);
std::unique_ptr<ArrayBuilder> builder;
ASSERT_OK(MakeDictionaryBuilder(default_memory_pool(), ordered_type,
/*dictionary=*/nullptr, &builder));

auto builder_type = builder->type();
ASSERT_TRUE(checked_cast<const DictionaryType&>(*builder_type).ordered());
}

@thisisnic thisisnic May 7, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all AI-generated. I roughly understood what it does but am not confident in it, other than having checked it looks relatively idiomatic, and all of these tests except UnorderedTypeStaysUnordered failed before the PR and passed after.

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 7, 2026
Comment thread cpp/src/arrow/builder.cc
const std::shared_ptr<DataType>& index_type;
const std::shared_ptr<DataType>& value_type;
const std::shared_ptr<Array>& dictionary;
std::shared_ptr<Array> dictionary;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was updated as without it, we ended up with a dangling reference when nullptr is passed in, like when we're creating a builder from a type and don't have a dictionary array to pass in.

@thisisnic thisisnic marked this pull request as ready for review May 7, 2026 20:38
@thisisnic thisisnic requested a review from jonkeane as a code owner May 7, 2026 20:38
@thisisnic

Copy link
Copy Markdown
Member Author

Would you mind giving this a look over some time @pitrou?

I did use Claude and have tried to go through all generated code methodically to understand what it's doing and hopefully make it idiomatic, but let me know if there is anywhere I could have been more diligent in this.

CI failures appear to be ones on main and not related to this PR.

@thisisnic thisisnic requested a review from pitrou May 7, 2026 20:42

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @thisisnic , here are some comments

Comment thread cpp/src/arrow/array/array_dict_test.cc Outdated
}

TEST(TestDictionaryBuilderOrdered, FinishPreservesOrderedFlag) {
auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have a for loop on the ordered flag to test both possibilities.

Comment thread cpp/src/arrow/array/array_dict_test.cc Outdated
}

TEST(TestDictionaryBuilderOrdered, ListOfOrderedDictionary) {
auto ordered_dict_type = dictionary(int8(), utf8(), /*ordered=*/true);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment thread cpp/src/arrow/array/array_dict_test.cc Outdated
}

TEST(TestDictionaryBuilderOrdered, MakeDictionaryBuilderPreservesOrdered) {
auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

Comment thread cpp/src/arrow/array/array_dict_test.cc Outdated
}

TEST(TestDictionaryBuilderOrdered, UnorderedTypeStaysUnordered) {
auto unordered_type = dictionary(int8(), utf8(), /*ordered=*/false);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests can be factored into a single one by using a for loop.

Comment thread cpp/src/arrow/array/builder_dict.h Outdated
indices_builder_(start_int_size, pool, alignment),
value_type_(value_type) {}
value_type_(value_type),
ordered_(false) {}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ordered_ is default-initialized to false in the class declaration (I mean the bool ordered_ = false; part), we don't need to reiterate in the constructors.

Comment thread cpp/src/arrow/array/builder_dict.h Outdated
return ::arrow::dictionary(indices_builder_.type(), null(), ordered_);
}

void set_ordered(bool ordered) { ordered_ = ordered; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this separate method, perhaps we can add a bool ordered = false parameter at the end of DictionaryBuilderBase constructors?

Copilot AI review requested due to automatic review settings May 28, 2026 09:56
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 28, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes loss of the ordered flag for dictionary types during array construction (notably for nested/list columns), which caused schema/type mismatches when writing ordered factors (ordered dictionaries) to Parquet from R.

Changes:

  • Propagate DictionaryType::ordered() into created dictionary builders (MakeBuilder / MakeDictionaryBuilder) and preserve it in type() / FinishInternal().
  • Remove the R-side post-hoc patch that rewrote the resulting DictionaryType as ordered.
  • Add C++ and R tests covering ordered dictionaries in both top-level and nested (list/large_list/fixed_size_list) contexts.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
r/tests/testthat/test-Array.R Adds R regression tests for ordered factors inside list-like nested arrays.
r/src/r_to_arrow.cpp Removes workaround that manually re-marked finished dictionary arrays as ordered.
cpp/src/arrow/builder.cc Passes ordered through builder creation paths for dictionary types (including nested builder creation).
cpp/src/arrow/array/builder_dict.h Stores ordered_ in dictionary builders and uses it when reconstructing DictionaryType in type() / FinishInternal().
cpp/src/arrow/array/array_dict_test.cc Adds C++ unit tests ensuring ordered flag is preserved through MakeBuilder, Finish, and nested list builders.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@stephenashton-dhsc

Copy link
Copy Markdown

I think this has caused an unintended consequence.

The following code runs under the CRAN version of arrow without issue:

library(arrow)
library(dplyr)

starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())

However, under this PR, it returns the following error:

Error: Invalid: Column data for field 7 with type dictionary<values=string, indices=int8, ordered=0> is inconsistent with schema dictionary<values=string, indices=int8, ordered=1>

Would it be possible to resolve this?

@thisisnic

thisisnic commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

I think this has caused an unintended consequence.

The following code runs under the CRAN version of arrow without issue:

library(arrow)
library(dplyr)

starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())

However, under this PR, it returns the following error:

Error: Invalid: Column data for field 7 with type dictionary<values=string, indices=int8, ordered=0> is inconsistent with schema dictionary<values=string, indices=int8, ordered=1>

Would it be possible to resolve this?

Thanks for the review @stephenashton-dhsc .Would you mind sharing a bit more about the steps you took to build Arrow to get that error? I tried it myself locally but I can't reproduce your error, so it might be that you had a mismatch between R package and C++ library version, though it'd be good to be sure.

Copilot AI review requested due to automatic review settings June 8, 2026 10:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@stephenashton-dhsc

Copy link
Copy Markdown

I think this has caused an unintended consequence.
The following code runs under the CRAN version of arrow without issue:

library(arrow)
library(dplyr)

starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())

However, under this PR, it returns the following error:

Error: Invalid: Column data for field 7 with type dictionary<values=string, indices=int8, ordered=0> is inconsistent with schema dictionary<values=string, indices=int8, ordered=1>

Would it be possible to resolve this?

Thanks for the review @stephenashton-dhsc .Would you mind sharing a bit more about the steps you took to build Arrow to get that error? I tried it myself locally but I can't reproduce your error, so it might be that you had a mismatch between R package and C++ library version, though it'd be good to be sure.

I just did a build/install using pak::pkg_install("apache/arrow/r#49937").

I'll retest under 29dfbc7 now, just in case that has resolved the issue.

@thisisnic

thisisnic commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

I think this has caused an unintended consequence.
The following code runs under the CRAN version of arrow without issue:

library(arrow)
library(dplyr)

starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())

However, under this PR, it returns the following error:

Error: Invalid: Column data for field 7 with type dictionary<values=string, indices=int8, ordered=0> is inconsistent with schema dictionary<values=string, indices=int8, ordered=1>

Would it be possible to resolve this?

Thanks for the review @stephenashton-dhsc .Would you mind sharing a bit more about the steps you took to build Arrow to get that error? I tried it myself locally but I can't reproduce your error, so it might be that you had a mismatch between R package and C++ library version, though it'd be good to be sure.

I just did a build/install using pak::pkg_install("apache/arrow/r#49937").

I'll retest under 29dfbc7 now, just in case that has resolved the issue.

Aha, AFAIK pak::pkg_install() won't rebuild the C++ libraries and will only rebuild the R package but with existing C++ libraries you have installed. If you wanted to test this PR out like that you'd need to do a full rebuild of the Arrow C++ library before installing the R package, as I've modified C++ in this PR. The docs for that are here.

What is it that you're trying to do? If you're trying to use the functionality implemented here, if you wait for the PR to merge, it'll be there in the nightly Arrow binaries later, which you can install using arrow::install_arrow(nightly=TRUE) which is a lot easier than doing a full dev build.

@stephenashton-dhsc

Copy link
Copy Markdown

library(arrow)
library(dplyr)

starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())

Yep - still getting the same issue:

> pak::pkg_install("apache/arrow/r#49937")

! Using bundled GitHub PAT. Please add your own PAT using `gitcreds::gitcreds_set()`.

 Found  1  deps for  0/1  pkgs [⠋] Resolving apache/arrow/r#49937

ℹ Loading metadata database                                      

✔ Loading metadata database ... done

 Found  36  deps for  1/1  pkgs [⠧] Resolving standard (CRAN/BioC) packages                                                                        
 
→ Package library at 'C:\Users\SAshton\AppData\Local\R\Win-library\4.4'.
→ Will update 1 package.
→ The package (0 B) is cached.
+ arrow 24.0.0.9000 → 24.0.0.9000 [bld][cmp] (GitHub: 29dfbc7)
? Do you want to continue (Y/n) y
ℹ No downloads are needed, 1 pkg is cached
✔ Got arrow 24.0.0.9000 (source) (21.46 MB)
ℹ Packaging arrow 24.0.0.9000               
✔ Installed arrow 24.0.0.9000 (github::apache/arrow@29dfbc7) (407ms)                                                                 
✔ 1 pkg + 14 deps: kept 14, upd 1, dld 1 (NA B) [5m 5.3s]
> library(arrow)

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
package ‘dplyr’ was built under R version 4.4.3 
> starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())
Error: Invalid: Column data for field 7 with type dictionary<values=string, indices=int8, ordered=0> is inconsistent with schema dictionary<values=string, indices=int8, ordered=1>

This is the current session info:

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.2.1       arrow_24.0.0.9000

loaded via a namespace (and not attached):
 [1] assertthat_0.2.1 R6_2.6.1         bit_4.6.0        tidyselect_1.2.1 tzdb_0.5.0       magrittr_2.0.5   glue_1.8.1       tibble_3.3.1     pkgconfig_2.0.3  bit64_4.8.2      generics_0.1.4   lifecycle_1.0.5  cli_3.6.6        pak_0.10.0       vctrs_0.7.3      compiler_4.4.1   purrr_1.2.2      pillar_1.11.1    rlang_1.2.0     

@stephenashton-dhsc

Copy link
Copy Markdown

I think this has caused an unintended consequence.
The following code runs under the CRAN version of arrow without issue:

library(arrow)
library(dplyr)

starwars |> mutate(sex = factor(sex, ordered = TRUE)) |> write_parquet(sink = tempfile())

However, under this PR, it returns the following error:

Error: Invalid: Column data for field 7 with type dictionary<values=string, indices=int8, ordered=0> is inconsistent with schema dictionary<values=string, indices=int8, ordered=1>

Would it be possible to resolve this?

Thanks for the review @stephenashton-dhsc .Would you mind sharing a bit more about the steps you took to build Arrow to get that error? I tried it myself locally but I can't reproduce your error, so it might be that you had a mismatch between R package and C++ library version, though it'd be good to be sure.

I just did a build/install using pak::pkg_install("apache/arrow/r#49937").
I'll retest under 29dfbc7 now, just in case that has resolved the issue.

Aha, AFAIK pak::pkg_install() won't rebuild the C++ libraries and will only rebuild the R package but with existing C++ libraries you have installed. If you wanted to test this PR out like that you'd need to do a full rebuild of the Arrow C++ library before installing the R package, as I've modified C++ in this PR. The docs for that are here.

What is it that you're trying to do? If you're trying to use the functionality implemented here, if you wait for the PR to merge, it'll be there in the nightly Arrow binaries later, which you can install using arrow::install_arrow(nightly=TRUE) which is a lot easier than doing a full dev build.

Ah - my mistake then. Apologies!

@thisisnic thisisnic requested a review from pitrou June 8, 2026 11:21

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, but I can't vet the R tests

Comment thread cpp/src/arrow/array/array_dict_test.cc Outdated
Comment thread cpp/src/arrow/array/array_dict_test.cc Outdated
ARROW_SCOPED_TRACE("ordered = ", ordered);
auto dict_type = dictionary(int8(), utf8(), ordered);
std::unique_ptr<ArrayBuilder> builder;
ASSERT_OK(MakeDictionaryBuilder(default_memory_pool(), dict_type,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: use Result-returning variant

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Copilot AI review requested due to automatic review settings June 10, 2026 10:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread cpp/src/arrow/array/array_dict_test.cc
@thisisnic

Copy link
Copy Markdown
Member Author

@jonkeane - Mind giving this a review from the R point of view when you have a moment?

Copilot AI review requested due to automatic review settings June 10, 2026 12:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@jonkeane jonkeane left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Is there anything we need to update in our docs about this? I can't remember if we list what arrow types are (fully) supported somewhere?

@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants