X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/b6cc5bfc0c026d395f232a3cb5a6fdc429c295e0..bfa3f651654789ba700d2368cb7099b3af9c8132:/personal/_posts/2022-04-11-provenance-exposed.md?ds=inline diff --git a/personal/_posts/2022-04-11-provenance-exposed.md b/personal/_posts/2022-04-11-provenance-exposed.md index b37344c..6d0d7d3 100644 --- a/personal/_posts/2022-04-11-provenance-exposed.md +++ b/personal/_posts/2022-04-11-provenance-exposed.md @@ -1,6 +1,6 @@ --- title: "Pointers Are Complicated III, or: Pointer-integer casts exposed" -categories: rust research +categories: rust research programming license: CC BY-SA 4.0 license-url: https://creativecommons.org/licenses/by-sa/4.0/ reddit: /rust/comments/u1bbqn/pointers_are_complicated_iii_or_pointerinteger/ @@ -261,7 +261,7 @@ So what are the alternatives? Well, I would argue that the alternative is to treat the original program (after translation to Rust) as having Undefined Behavior. There are, to my knowledge, generally two reasons why people might want to transmute a pointer to an integer: - Chaining many `as` casts is annoying, so calling `mem::transmute` might be shorter. -- The code doesn't actually care about the *integer* per se, it just needs *some way* to hold arbitrary data in a container of a given time. +- The code doesn't actually care about the *integer* per se, it just needs *some way* to hold arbitrary data in a container of a given type. The first kind of code should just use `as` casts, and we should do what we can (via lints, for example) to identify such code and get it to use casts instead.[^compat] Maybe we can adjust the cast rules to remove the need for chaining, or add some [helper methods](https://doc.rust-lang.org/nightly/std/primitive.pointer.html#method.expose_addr) that can be used instead. @@ -277,13 +277,21 @@ Because of that, I think we should move towards discouraging, deprecating, or ev That means a cast is the only legal way to turn a pointer into an integer, and after the discussion above we got our casts covered. A [first careful step](https://github.com/rust-lang/rust/pull/95547) has recently been taken on this journey; the `mem::transmute` documentation now cautions against using this function to turn pointers into integers. +**Update (2022-09-14):** After a lot more discussion, the current model pursued by the Unsafe Code Guidelines WG is to say that pointer-to-integer transmutation is permitted, but just strips provenance without exposing it. +That means the program with the casts replaced by transmutation is UB, because the `ptr` it ends up dereferencing has invalid provenance. +However, the transmutation itself is not UB. +Basically, pointer-to-integer transmutation is equivalent to [the `addr` method](https://doc.rust-lang.org/nightly/std/primitive.pointer.html#method.addr), with all its caveats -- in particular, transmuting a pointer to an integer and back is like calling `addr` and then calling [`ptr::invalid`](https://doc.rust-lang.org/nightly/std/ptr/fn.invalid.html). +That is a *lossy* round-trip: it loses provenance information, making the resulting pointer invalid to dereference. +It is lossy even if we use a regular integer-to-pointer cast (or `from_exposed_addr`) for the conversion back to a pointer, since the original provenance might never have been exposed. +Compared to declaring the transmutation itself UB, this model has some nice properties that help compiler optimizations (such as removing unnecessary store-load round-trips). **/Update** + ## A new hope for Rust All in all, while the situation may be very complicated, I am actually more hopeful than ever that we can have both -- a precise memory model for Rust *and* all the optimizations we can hope for! The three core pillars of this approach are: - making pointer-integer casts "expose" the pointer's provenance, - offering `ptr.addr()` to learn a pointer's address *without* exposing its provenance, -- and disallowing pointer-integer transmutation. +- and making pointer-integer transmutation round-trips lossy (such that the resulting pointer cannot be dereferenced). Together, they imply that we can optimize "nice" code (that follows Strict Provenance, and does not "expose" or use integer-pointer casts) perfectly, without any risk of breaking code that does use pointer-integer round-trips. In the easiest possible approach, the compiler can simply treat pointer-integer and integer-pointer casts as calls to some opaque external function. @@ -328,7 +336,7 @@ Compositionality at its finest! I have talked a lot about my vision for "solving" pointer provenance in Rust. What about other languages? -As you might have heard, C is moving towards making [PNVI-ae-udi](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2577.pdf) an official recommendation for how to interpret the C memory model. +As you might have heard, C is moving towards making [PNVI-ae-udi](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf) an official recommendation for how to interpret the C memory model. With C having so much more legacy code to care about and many more stakeholders than Rust does, this is an impressive achievement! How does it compare to all I said above? @@ -345,8 +353,8 @@ Here, too, my vision for Rust aligns very well with the direction C is taking. (The set of valid guesses in C is just a lot more restricted since they do not have `wrapping_offset`, and the model does not cover `restrict`. That means they can actually feasibly give an algorithm for how to do the guessing. They don't have to invoke scary terms like "angelic non-determinism", but the end result is the same -- and to me, the fact that it is equivalent to angelic non-determinism is what justifies this as a reasonable semantics. -Presenting this as a concrete algorithm to pick a suitable provenance is then just a stylistic choice. -Kudos go to Michael Sammler for opening my eyes to this interpretation of "user disambiguation".) +Presenting this as a concrete algorithm to pick a suitable provenance is then just a stylistic choice.) +Kudos go to Michael Sammler for opening my eyes to this interpretation of "user disambiguation", and arguing that angelic non-determinism might not be such a crazy idea after all. What is left is the question of how to handle pointer-integer transmutation, and this is where the roads are forking. PNVI-ae-udi explicitly says loading from a union field at integer type exposes the provenance of the pointer being loaded, if any. @@ -367,8 +375,8 @@ Because of all that, I think it is reasonable for Rust to make a different choic This was a long post, but I hope you found it worth reading. :) To summarize, my concrete calls for action in Rust are: -- Code that uses pointer-integer transmutation should migrate to regular casts or `MaybeUninit` transmutation ASAP. - I think we should declare pointer-integer transmutation Undefined Behavior and not accept such code as well-defined. +- Code that uses pointer-integer transmutation round-trips should migrate to regular casts or `MaybeUninit` transmutation ASAP. + I think we should declare pointer-integer transmutation as "losing" provenance, so code that assumes a lossless transmutation round-trip has Undefined Behavior. - Code that uses pointer-integer or integer-pointer *casts* might consider migrating to the Strict Provenance APIs. You can do this even on stable with [this polyfill crate](https://crates.io/crates/sptr). However, such code *is and remains* well-defined. It just might not be optimized as well as one could hope, it might not compile on CHERI, and Miri will probably miss some bugs. @@ -421,13 +429,17 @@ My personal stance is that we should not let the cast synthesize a new provenanc This would entirely lose the benefit I discussed above of making pointer-integer round-trips a *local* concern -- if these round-trips produce new, never-before-seen kinds of provenance, then the entire rest of the memory model has to define how it deals with those provenances. We already have no choice but treat pointer-integer casts as an operation with side-effects; let's just do the same with integer-pointer casts and remain sure that no matter what the aliasing rules are, they will work fine even in the presence of pointer-integer round-trips. +That said, under this model integer-pointer casts still have no side-effect, in the sense that just removing them (if their result is unused) is fine. +Hence, it *could* make sense to implicitly perform integer-pointer casts in some situations, like when an integer value (without provenance) is used in a pointer operation (due to an integer-to-pointer transmutation). +This breaks some optimizations like load fusion (turning two loads into one assumes the same provenance was picked both times), but most optimizations (in particular dead code elimination) are unaffected. + #### What about LLVM? I discussed above how my vision for Rust relates to the direction C is moving towards. What does that mean for the design space of LLVM? -Which changes need to be made to fix (potential) miscompilations in LLVM and to make it compatible with these ideas for C and/or Rust? +Which changes would have to be made to fix (potential) miscompilations in LLVM and to make it compatible with these ideas for C and/or Rust? Here's the list of open problems I am aware of: -- LLVM needs to stop [removing `inttoptr(ptrtoint(_))`](https://github.com/llvm/llvm-project/issues/33896) and stop doing [replacement of `==`-equal pointers](https://github.com/llvm/llvm-project/issues/34577). +- LLVM would have to to stop [removing `inttoptr(ptrtoint(_))`](https://github.com/llvm/llvm-project/issues/33896) and stop doing [replacement of `==`-equal pointers](https://github.com/llvm/llvm-project/issues/34577). - As the first example shows, LLVM also needs to treat `ptrtoint` as a side-effecting operation that has to be kept around even when its result is unused. (Of course, as with everything I say here, there can be special cases where the old optimizations are still correct, but they need extra justification.) - I think LLVM should also treat `inttoptr` as a side-effecting (and, in particular, non-deterministic) operation, as per the last example. However, this could possibly be avoided with a `noalias` model that specifically accounts for new kinds of provenance being synthesized by casts. (I am being vague here since I don't know what that provenance needs to look like.)