X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/2ba08ac89085aa3a0dbc7377278fcc001db48095..a1a7f9faa638583fea785ca998a301d4af0d8c17:/personal/_posts/2022-04-11-provenance-exposed.md?ds=sidebyside diff --git a/personal/_posts/2022-04-11-provenance-exposed.md b/personal/_posts/2022-04-11-provenance-exposed.md index 56b7203..44ccd86 100644 --- a/personal/_posts/2022-04-11-provenance-exposed.md +++ b/personal/_posts/2022-04-11-provenance-exposed.md @@ -17,7 +17,7 @@ There's a lot of information packed into this post, so better find a comfortable In case you don't know what I mean by "pointer provenance", you can either read that previous blog post or the [Strict Provenance documentation](https://doc.rust-lang.org/nightly/core/ptr/index.html#provenance). The gist of it is that a pointer consists not only of the address that it points to in memory, but also of its *provenance*: an extra piece of "shadow state" that is carried along with each pointer and that tracks which memory the pointer has permission to access and when. -This is required to make sense of restrictions like "pointer arithmetic can never be used to construct a pointer that is valid for a different allocation than the one it started out in" (even with operations like Rust's [`wrapping_offset`](https://doc.rust-lang.org/std/primitive.pointer.html#method.wrapping_offset) that *do* allow out-of-bounds pointer arithmetic), or "use-after-free is Undefined Behavior, even if you checked that there is a new allocation at the same address as the old one". +This is required to make sense of restrictions like "use-after-free is Undefined Behavior, even if you checked that there is a new allocation at the same address as the old one". Architectures like CHERI make this "shadow state" explicit (pointers are bigger than usual so that they can explicitly track which part of memory they are allowed to access), but even when compiling for AMD64 CPUs, compilers act "as if" pointers had such extra state -- it is part of the specification, part of the Abstract Machine, even if it is not part of the target CPU. @@ -78,7 +78,7 @@ static int uwu(int *restrict x, int *restrict y) { int *y2 = y-1; uintptr_t y2addr = (uintptr_t)y2; - int *ptr = (int*)y2addr; + int *ptr = (int*)y2addr; // <-- using y2addr *ptr = 1; return *x; @@ -104,7 +104,7 @@ static int uwu(int *restrict x, int *restrict y) { int *ptr = (int*)y2addr; *ptr = 1; - return 0; + return 0; // <-- hard-coded return value } int main() { @@ -295,7 +295,8 @@ This is the entire reason why all of this "untagged pointer" mess exists. Under this brave new world, I can entirely ignore pointer-integer round-trips when designing memory models for Rust. Once that design is done, support for pointer-integer round-trips can be added as follows: - When a pointer is cast to an integer, its provenance (whatever information it is that the model attaches to pointers -- in Stacked Borrows, this is called the pointer's *tag*) is marked as "exposed". -- When an integer is cast to a pointer, we *guess* the provenance that the new pointer should have from among all the provenances that have been previously marked as "exposed". (And I mean *all* of them, not just the ones that have been exposed "at the same address" or anything like that. People will inevitably do imperfect round-trips where the integer is being offset before being cast back to a pointer, and we should support that. As far as I know, this doesn't really cost us anything in terms of optimizations.) +- When an integer is cast to a pointer, we *guess* the provenance that the new pointer should have from among all the provenances that have been previously marked as "exposed". + (And I mean *all* of them, not just the ones that have been exposed "at the same address" or anything like that. People will inevitably do imperfect round-trips where the integer is being offset before being cast back to a pointer, and we should support that. As far as I know, this doesn't really cost us anything in terms of optimizations.) This "guess" does not need to be described by an algorithm. Through the magic that is formally known as [angelic non-determinism](https://en.wikipedia.org/wiki/Angelic_non-determinism), we can just wave our hands and say "the guess will be maximally in the programmer's favor": if *any* possible choice of (previously exposed) provenance makes the program work, then that is the provenance the new pointer will get. @@ -315,7 +316,8 @@ And who knows, maybe there *is* a clever way that Miri can actually get reasonab It doesn't have to be perfect to be useful. What I particularly like about this approach is that it makes pointer-integer round-trips a purely local concern. -With an approach like Stacked Borrows "untagged pointers", *every* memory operation has to define how it handles such pointers -- complexity increases globally, and even when reasoning about Strict Provenance code we have to keep in mind that some pointers in other parts of the program might be "untagged". +With an approach like Stacked Borrows "untagged pointers", *every* memory operation has to define how it handles such pointers. +Complexity increases globally, and even when reasoning about Strict Provenance code we have to keep in mind that some pointers in other parts of the program might be "untagged". In contrast, this "guessing maximally in your favor"-based approach is entirely local; code that does not syntactically contain exposing pointer-integer or integer-pointer casts can literally forget that such casts exist at all. This is true both for programmers thinking about their `unsafe` code, and for compiler authors thinking about optimizations. Compositionality at its finest! @@ -338,7 +340,10 @@ However, if/when a more precise model of C with `restrict` emerges, I don't thin The "udi" part of the name means "user disambiguation", and is basically the mechanism by which an integer-pointer cast in C "guesses" the provenance it has to pick up. The details of this are complicated, but the end-to-end effect is basically exactly the same as in the "best possible guess" model I have described above! Here, too, my vision for Rust aligns very well with the direction C is taking. -(The set of valid guesses in C is just a lot more restricted since they do not have `wrapping_offset`. That means they can actually feasibly give an algorithm for how to do the guessing.) +(The set of valid guesses in C is just a lot more restricted since they do not have `wrapping_offset`, and the model does not cover `restrict`. +That means they can actually feasibly give an algorithm for how to do the guessing. +They don't have to invoke scary terms like "angelic non-determinism", but the end result is the same -- and to me, the fact that it is equivalent to angelic non-determinism is what justifies this as a reasonable semantics. +Presenting this as a concrete algorithm to pick a suitable provenance is then just a stylistic choice.) What is left is the question of how to handle pointer-integer transmutation, and this is where the roads are forking. PNVI-ae-udi explicitly says loading from a union field at integer type exposes the provenance of the pointer being loaded, if any.