X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/3b15ae85e0963b9a853d24bdd397998acaa8720d..d32c1e927cb459a72d8463644239bdd711515aed:/ralf/_posts/2018-07-24-pointers-and-bytes.md?ds=sidebyside diff --git a/ralf/_posts/2018-07-24-pointers-and-bytes.md b/ralf/_posts/2018-07-24-pointers-and-bytes.md index adf82bd..64ea94b 100644 --- a/ralf/_posts/2018-07-24-pointers-and-bytes.md +++ b/ralf/_posts/2018-07-24-pointers-and-bytes.md @@ -4,7 +4,7 @@ categories: internship rust forum: https://internals.rust-lang.org/t/pointers-are-complicated-or-whats-in-a-byte/8045 --- -This summer, I am again [working on Rust full-time]({{ site.baseurl }}{% post_url 2018-07-11-research-assistant %}), and again I will work (amongst other things) on a "memory model" for Rust/MIR. +This summer, I am again [working on Rust full-time]({% post_url 2018-07-11-research-assistant %}), and again I will work (amongst other things) on a "memory model" for Rust/MIR. However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that "pointers are simple: they are just integers". Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers. @@ -38,7 +38,7 @@ However, given how low-level a language C++ is, we can actually break this assum Since `&x[i]` is the same as `x+i`, this means we are actually writing `23` to `&y[0]`. Of course, that does not stop C++ compilers from doing these optimizations. -To allow this, the standard declares our code to have [undefined behavior]({{ site.baseurl }}{% post_url 2017-07-14-undefined-behavior %}). +To allow this, the standard declares our code to have [undefined behavior]({% post_url 2017-07-14-undefined-behavior %}). First of all, it is not allowed to perform pointer arithmetic (like `&x[i]` does) that goes [beyond either end of the array it started in](https://timsong-cpp.github.io/cppwp/n4140/expr.add#5). Our program violates this rule: `x[i]` is outside of `x`, so this is undefined behavior. @@ -117,9 +117,9 @@ But the actual machine also does not do the kind of optimizations that modern C+ If we wrote the above programs in assembly, there would be no UB, and no optimizations. C++ and Rust employ a more "high-level" view of memory and pointers, restricting the programmer for the benefit of optimizations. When formally describing what the programmer may and may not do in these languages, as we have seen, the model of pointers as integers falls apart, so we have to look for something else. -This is another example of using a "virtual machine" that's different from the real machine for specification purposes, which is an idea [I have blogged about before]({{ site.baseurl }}{% post_url 2017-06-06-MIR-semantics %}). +This is another example of using a "virtual machine" that's different from the real machine for specification purposes, which is an idea [I have blogged about before]({% post_url 2017-06-06-MIR-semantics %}). -Here's a simple proposal (in fact, this is the model of pointers used in [CompCert](https://hal.inria.fr/hal-00703441/document) and my [RustBelt work]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}), and it is also how [miri](https://github.com/solson/miri/) implements pointers): +Here's a simple proposal (in fact, this is the model of pointers used in [CompCert](https://hal.inria.fr/hal-00703441/document) and my [RustBelt work]({% post_url 2017-07-08-rustbelt %}), and it is also how [miri](https://github.com/solson/miri/) implements pointers): A pointer is a pair of some kind of ID uniquely identifying the *allocation*, and an *offset* into the allocation. Adding/subtracting an integer to/from a pointer just acts on the offset, and can thus never leave the allocation. Subtracting a pointer from another is only allowed when both point to the same allocation (matching [C++](https://timsong-cpp.github.io/cppwp/n4140/expr.add#6)).[^2] @@ -137,7 +137,10 @@ However, this simple model starts to fall apart once you consider pointer-intege In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable (i.e., its *type* says it is an integer) whose *value* is a pointer (i.e., an allocation-offset pair). However, multiplying that "integer" by 2 leads to an error, because it is entirely unclear what it means to multiply such an abstract pointer by 2. -This is the most lazy thing to do, and we do it because it is not clear what else to do -- in our abstract machine, there is no single coherent "address space" that all allocations live in, that we could use to map every pointer to a distinct integer. +I should clarify that this is *not* a good solution when defining language semantics. +It works fine for an interpreter though. +It is the most lazy thing to do, and we do it because it is not clear what else to do (other than not supporting these casts at all -- but this way, miri can run more programs): +In our abstract machine, there just is no single coherent "address space" that all allocations live in, that we could use to map every pointer to a distinct integer. Every allocation is just identified by an (unobservable) ID. We could now start to enrich this model with extra data like a base address for each allocation, and somehow use that when casting an integer back to a pointer... but that's where it gets really complicated, and anyway discussing such a model is not the point of this post. The point it to discuss the *need* for such a model. @@ -161,7 +164,8 @@ We have to say what the value of `v` is, so we have to find some way to answer t (And this is an entirely separate issue from the problem with multiplication that came up in the last section. We just assume some abstract type `Pointer`.) We cannot represent a byte of a pointer as an element of `0..256`. -Instead, we will remember both the pointer, and which byte of the pointer we got. +Essentially, if we use a naive model of memory, the extra "hidden" part of a pointer (the one that makes it more than just an integer) would be lost whne a pointer is stored to memory and loaded again. +We have to fix this, so we have to extend our notion of a "byte" to accomodate that extra state. So, a byte is now *either* an element of `0..256` ("raw bits"), *or* the n-th byte of some abstract pointer. If we were to implement our memory model in Rust, this might look as follows: {% highlight rust %}