Skip to content

Translate char as core::ffi::c_char instead of u8#194

Open
lucic71 wants to merge 17 commits into
Cpp2Rust:masterfrom
lucic71:u8-as-c_char
Open

Translate char as core::ffi::c_char instead of u8#194
lucic71 wants to merge 17 commits into
Cpp2Rust:masterfrom
lucic71:u8-as-c_char

Conversation

@lucic71

@lucic71 lucic71 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Translates char as c_char instead of u8 in both unsafe and refcount. It uses core:ffi::c_char instead of libc::c_char so that refcount is not dependent on libc.

This contains:

  1. change all rules and libcc2rs to use c_char instead of u8
  2. use c_char instead of u8 in mapper and VisitBuiltinType
  3. use c_char in argv for both unsafe and refcount
  4. delete the IsCharPointerFieldFromLibc and IsCharArrayFieldFromLibc hacks
  5. use the edition 2024 c"" syntax to define c_char compatible string literals
  6. use libcc2rs::char_array to define c_char array of characters
  7. use getTypedLiteral to write 0 as c_char (correct) instead of 0_c_char (incorrect)
  8. use reinterpret_cast between char and unsigned char in VisitExplicitCastExpr and BuildFnAdapter
  9. reinterpret all remaining bytes instead of one object in PtrKind::Reinterpreted

I wanted to split each of them in a separate PR, however in isolation they don't work. So I had to pack them together.

9 is a bug that I encountered during the u8 -> c_char change. brunsli does reinterpret_cast<uint8_t>(string.data()). string.data() returns a char (c_char) and uint8_t is translated as u8. This means that the following fast-path in Ptr::reinterprete_cast is skipped and a real reinterpret takes place:

https://github.com/Cpp2Rust/cpp2rust/blob/master/libcc2rs/src/rc.rs#L389-L392

Later, brunsli passes the reinterpreted pointer to the unsafe brotli ffi. In the previous version of StrongPtr::deref/PtrKind::with_mut/PtrKind::with we only materialized one u8 from c_char, meaning that the brotli ffi received a single u8 and a size > 1, resulting in OOB. The current code materializes the entire remaining bytes of the source pointer.

This problem only appears because we interact with the unsafe ffi. In safe code, it's fine to materialize only one element because all our Ptr APIs work on element boundary, e.g. write/read. I expect 9 to be a temporary solution until we replace the unsafe brotli ffi with a safe brotli crate.

@lucic71 lucic71 marked this pull request as draft June 13, 2026 19:02
@lucic71 lucic71 marked this pull request as ready for review June 22, 2026 17:45
@lucic71

lucic71 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@nunoplopes this is ready for review

@nunoplopes

Copy link
Copy Markdown
Contributor

What's the advantage of using c_char vs u8? The code seems a lot more verbose and less idiomatic. Refcount should not use c_char.

@lucic71

lucic71 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

What's the advantage of using c_char vs u8? The code seems a lot more verbose and less idiomatic. Refcount should not use c_char.

u8 is incompatible with char* or char[] from libc or FFI. We use u8 while libc and FFI use c_char (typedef over i8). The job of IsCharPointerFieldFromLibc and IsCharArrayFieldFromLibc was to workaround this discrepancy. I deleted both functions now.

We can use c_char instead of core:ffi::c_char to make the code less verbose and more idiomatic.

It's ok for refcount to use c_char. core::ffi::c_char does not depend on libc and c_char is the correct and portable solution, consider the following code:

#include <cassert>
int main() {
    char a = -2;
    char b = 1;
    assert(a + b == -1);
    return 0;
}

With the u8 translation this fails. With the c_char translation, this is ok.

@nunoplopes

Copy link
Copy Markdown
Contributor

I disagree. I don't think core:ffi::c_char is more idiomatic. Native Rust code does not use that.
The sign of char is platform dependent. It's true that in x86 it's signed (so equivalent to i8). On ARM it's often unsigned (equivalent to u8).

I don't know why the translation of that example fails, but c_char is not the answer, for recount at least.

@lucic71

lucic71 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

I disagree. I don't think core:ffi::c_char is more idiomatic. Native Rust code does not use that. The sign of char is platform dependent. It's true that in x86 it's signed (so equivalent to i8). On ARM it's often unsigned (equivalent to u8).

I don't know why the translation of that example fails, but c_char is not the answer, for recount at least.

Then for unsafe I will continue with c_char so that I can get rid of IsCharPointerFieldFromLibc and IsCharArrayFieldFromLibc. For refcount we keep u8 instead of c_char because we don't expect to interact with libc.

I don't know why the translation of that example fails

On a platform where char is signed, in a + b both arguments are promoted to signed int, -2 + 1 becomes -1 and the result is downgraded back to char which results in (char) -1. But becaue we translate char as u8, the result becomes: -2i32 as u8 + 1_u8 = 254 + 1 = 255

fn main_0() -> i32 {
    let a: Value<u8> = Rc::new(RefCell::new((-2_i32 as u8)));
    let b: Value<u8> = Rc::new(RefCell::new(1_u8));
    assert!(((((*a.borrow()) as i32) + ((*b.borrow()) as i32)) == -1_i32));
    return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants