GH-49689: [R][C++] Parquets do not support list-columns of ordered factors (ordered dictionaries)#49937
GH-49689: [R][C++] Parquets do not support list-columns of ordered factors (ordered dictionaries)#49937thisisnic wants to merge 9 commits into
Conversation
|
|
| // GH-49689: Ordered dictionary tests | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, TypePreservesOrderedFlag) { | ||
| auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true); | ||
| std::unique_ptr<ArrayBuilder> boxed_builder; | ||
| ASSERT_OK(MakeBuilder(default_memory_pool(), ordered_type, &boxed_builder)); | ||
|
|
||
| auto builder_type = boxed_builder->type(); | ||
| ASSERT_TRUE(checked_cast<const DictionaryType&>(*builder_type).ordered()); | ||
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, UnorderedTypeStaysUnordered) { | ||
| auto unordered_type = dictionary(int8(), utf8(), /*ordered=*/false); | ||
| std::unique_ptr<ArrayBuilder> boxed_builder; | ||
| ASSERT_OK(MakeBuilder(default_memory_pool(), unordered_type, &boxed_builder)); | ||
|
|
||
| auto builder_type = boxed_builder->type(); | ||
| ASSERT_FALSE(checked_cast<const DictionaryType&>(*builder_type).ordered()); | ||
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, FinishPreservesOrderedFlag) { | ||
| auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true); | ||
| std::unique_ptr<ArrayBuilder> boxed_builder; | ||
| ASSERT_OK(MakeBuilder(default_memory_pool(), ordered_type, &boxed_builder)); | ||
| auto& builder = checked_cast<DictionaryBuilder<StringType>&>(*boxed_builder); | ||
|
|
||
| ASSERT_OK(builder.Append("a")); | ||
| ASSERT_OK(builder.Append("b")); | ||
| ASSERT_OK(builder.Append("a")); | ||
|
|
||
| std::shared_ptr<Array> result; | ||
| ASSERT_OK(builder.Finish(&result)); | ||
|
|
||
| const auto& result_type = checked_cast<const DictionaryType&>(*result->type()); | ||
| ASSERT_TRUE(result_type.ordered()); | ||
|
|
||
| auto ex_dict = ArrayFromJSON(utf8(), R"(["a", "b"])"); | ||
| auto ex_indices = ArrayFromJSON(int8(), "[0, 1, 0]"); | ||
| DictionaryArray expected(ordered_type, ex_indices, ex_dict); | ||
| AssertArraysEqual(expected, *result); | ||
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, ListOfOrderedDictionary) { | ||
| auto ordered_dict_type = dictionary(int8(), utf8(), /*ordered=*/true); | ||
| auto list_type = list(field("item", ordered_dict_type)); | ||
|
|
||
| std::unique_ptr<ArrayBuilder> boxed_builder; | ||
| ASSERT_OK(MakeBuilder(default_memory_pool(), list_type, &boxed_builder)); | ||
| auto& list_builder = checked_cast<ListBuilder&>(*boxed_builder); | ||
| auto& dict_builder = | ||
| checked_cast<DictionaryBuilder<StringType>&>(*list_builder.value_builder()); | ||
|
|
||
| ASSERT_OK(list_builder.Append()); | ||
| ASSERT_OK(dict_builder.Append("a")); | ||
| ASSERT_OK(dict_builder.Append("b")); | ||
| ASSERT_OK(list_builder.Append()); | ||
| ASSERT_OK(dict_builder.Append("a")); | ||
|
|
||
| std::shared_ptr<Array> result; | ||
| ASSERT_OK(list_builder.Finish(&result)); | ||
|
|
||
| const auto& result_list_type = checked_cast<const ListType&>(*result->type()); | ||
| const auto& result_dict_type = | ||
| checked_cast<const DictionaryType&>(*result_list_type.value_type()); | ||
| ASSERT_TRUE(result_dict_type.ordered()); | ||
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, MakeDictionaryBuilderPreservesOrdered) { | ||
| auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true); | ||
| std::unique_ptr<ArrayBuilder> builder; | ||
| ASSERT_OK(MakeDictionaryBuilder(default_memory_pool(), ordered_type, | ||
| /*dictionary=*/nullptr, &builder)); | ||
|
|
||
| auto builder_type = builder->type(); | ||
| ASSERT_TRUE(checked_cast<const DictionaryType&>(*builder_type).ordered()); | ||
| } | ||
|
|
There was a problem hiding this comment.
This is all AI-generated. I roughly understood what it does but am not confident in it, other than having checked it looks relatively idiomatic, and all of these tests except UnorderedTypeStaysUnordered failed before the PR and passed after.
| const std::shared_ptr<DataType>& index_type; | ||
| const std::shared_ptr<DataType>& value_type; | ||
| const std::shared_ptr<Array>& dictionary; | ||
| std::shared_ptr<Array> dictionary; |
There was a problem hiding this comment.
This was updated as without it, we ended up with a dangling reference when nullptr is passed in, like when we're creating a builder from a type and don't have a dictionary array to pass in.
|
Would you mind giving this a look over some time @pitrou? I did use Claude and have tried to go through all generated code methodically to understand what it's doing and hopefully make it idiomatic, but let me know if there is anywhere I could have been more diligent in this. CI failures appear to be ones on main and not related to this PR. |
pitrou
left a comment
There was a problem hiding this comment.
Thanks for the PR @thisisnic , here are some comments
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, FinishPreservesOrderedFlag) { | ||
| auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true); |
There was a problem hiding this comment.
You could have a for loop on the ordered flag to test both possibilities.
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, ListOfOrderedDictionary) { | ||
| auto ordered_dict_type = dictionary(int8(), utf8(), /*ordered=*/true); |
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, MakeDictionaryBuilderPreservesOrdered) { | ||
| auto ordered_type = dictionary(int8(), utf8(), /*ordered=*/true); |
| } | ||
|
|
||
| TEST(TestDictionaryBuilderOrdered, UnorderedTypeStaysUnordered) { | ||
| auto unordered_type = dictionary(int8(), utf8(), /*ordered=*/false); |
There was a problem hiding this comment.
These two tests can be factored into a single one by using a for loop.
| indices_builder_(start_int_size, pool, alignment), | ||
| value_type_(value_type) {} | ||
| value_type_(value_type), | ||
| ordered_(false) {} |
There was a problem hiding this comment.
Since ordered_ is default-initialized to false in the class declaration (I mean the bool ordered_ = false; part), we don't need to reiterate in the constructors.
| return ::arrow::dictionary(indices_builder_.type(), null(), ordered_); | ||
| } | ||
|
|
||
| void set_ordered(bool ordered) { ordered_ = ordered; } |
There was a problem hiding this comment.
Instead of having this separate method, perhaps we can add a bool ordered = false parameter at the end of DictionaryBuilderBase constructors?
bcc614b to
f3eb5a0
Compare
There was a problem hiding this comment.
Pull request overview
Fixes loss of the ordered flag for dictionary types during array construction (notably for nested/list columns), which caused schema/type mismatches when writing ordered factors (ordered dictionaries) to Parquet from R.
Changes:
- Propagate
DictionaryType::ordered()into created dictionary builders (MakeBuilder/MakeDictionaryBuilder) and preserve it intype()/FinishInternal(). - Remove the R-side post-hoc patch that rewrote the resulting
DictionaryTypeas ordered. - Add C++ and R tests covering ordered dictionaries in both top-level and nested (list/large_list/fixed_size_list) contexts.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| r/tests/testthat/test-Array.R | Adds R regression tests for ordered factors inside list-like nested arrays. |
| r/src/r_to_arrow.cpp | Removes workaround that manually re-marked finished dictionary arrays as ordered. |
| cpp/src/arrow/builder.cc | Passes ordered through builder creation paths for dictionary types (including nested builder creation). |
| cpp/src/arrow/array/builder_dict.h | Stores ordered_ in dictionary builders and uses it when reconstructing DictionaryType in type() / FinishInternal(). |
| cpp/src/arrow/array/array_dict_test.cc | Adds C++ unit tests ensuring ordered flag is preserved through MakeBuilder, Finish, and nested list builders. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I think this has caused an unintended consequence. The following code runs under the CRAN version of arrow without issue: However, under this PR, it returns the following error: Would it be possible to resolve this? |
Thanks for the review @stephenashton-dhsc .Would you mind sharing a bit more about the steps you took to build Arrow to get that error? I tried it myself locally but I can't reproduce your error, so it might be that you had a mismatch between R package and C++ library version, though it'd be good to be sure. |
66d7c53 to
29dfbc7
Compare
I just did a build/install using I'll retest under 29dfbc7 now, just in case that has resolved the issue. |
Aha, AFAIK What is it that you're trying to do? If you're trying to use the functionality implemented here, if you wait for the PR to merge, it'll be there in the nightly Arrow binaries later, which you can install using |
Yep - still getting the same issue: This is the current session info: |
Ah - my mistake then. Apologies! |
pitrou
left a comment
There was a problem hiding this comment.
Looks mostly good to me, but I can't vet the R tests
| ARROW_SCOPED_TRACE("ordered = ", ordered); | ||
| auto dict_type = dictionary(int8(), utf8(), ordered); | ||
| std::unique_ptr<ArrayBuilder> builder; | ||
| ASSERT_OK(MakeDictionaryBuilder(default_memory_pool(), dict_type, |
There was a problem hiding this comment.
Same here: use Result-returning variant
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
@jonkeane - Mind giving this a review from the R point of view when you have a moment? |
jonkeane
left a comment
There was a problem hiding this comment.
Looks good to me. Is there anything we need to update in our docs about this? I can't remember if we list what arrow types are (fully) supported somewhere?
Rationale for this change
Ordered dictionaries inside nested types lose their
orderedflag during construction, becauseDictionaryBuilderdoesn't track it.What changes are included in this PR?
Store the
orderedflag inDictionaryBuilderBaseand pass it through when reconstructing theDictionaryTypeintype()andFinishInternal().Remove the R-side workaround that was patching this for top-level columns only.
Are these changes tested?
Yes
Are there any user-facing changes?
No
AI usage
Written by Claude, reviewed by me and Codex (via roborev). I had Claude create the tests first, to make sure they failed as expected, then had Claude make the fixes and checked the tests passed. I questioned the approach as I went.