X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/4ff0d6f1866f9f6509c5a0abc70cfa1f92d76deb..ec692a3dcd495e6b87f7d3ea4c485df36141b5a7:/personal/_posts/2018-07-24-pointers-and-bytes.md?ds=inline

diff --git a/personal/_posts/2018-07-24-pointers-and-bytes.md b/personal/_posts/2018-07-24-pointers-and-bytes.md
index 29cc24e..23371fe 100644
--- a/personal/_posts/2018-07-24-pointers-and-bytes.md
+++ b/personal/_posts/2018-07-24-pointers-and-bytes.md
@@ -1,10 +1,10 @@
 ---
 title: "Pointers Are Complicated, or: What's in a Byte?"
-categories: internship rust
+categories: internship rust programming
 forum: https://internals.rust-lang.org/t/pointers-are-complicated-or-whats-in-a-byte/8045
 ---
 
-This summer, I am again [working on Rust full-time]({{ site.baseurl }}{% post_url 2018-07-11-research-assistant %}), and again I will work (amongst other things) on a "memory model" for Rust/MIR.
+This summer, I am again [working on Rust full-time]({% post_url 2018-07-11-research-assistant %}), and again I will work (amongst other things) on a "memory model" for Rust/MIR.
 However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that "pointers are simple: they are just integers".
 Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers.
 
@@ -19,7 +19,7 @@ I hope that by the end of this post, you will agree with me on both of these sta
 ## Pointers Are Complicated
 
 What is the problem with "pointers are just integers"?  Let us consider the following example:<br>
-(I am using C++ code here mostly because writing unsafe code is easier in C++, and unsafe code is where these problems really show up.)
+(I am using C++ code here mostly because writing unsafe code is easier in C++ than in Rust, and unsafe code is where these problems really show up. C has all the same issues, as does unsafe Rust.)
 {% highlight c++ %}
 int test() {
     auto x = new int[8];
@@ -32,13 +32,18 @@ int test() {
 }
 {% endhighlight %}
 It would be beneficial to be able to optimize the final read of `y[0]` to just return `42`.
+C++ compilers regularly perform such optimizations as they are crucial for generating high-quality assembly.[^perf]
 The justification for this optimization is that writing to `x_ptr`, which points into `x`, cannot change `y`.
 
+[^perf]: To be fair, the are *claimed* to be crucial for generating high-quality assembly. The claim sounds plausible to me, but unfortunately, I do not know of a systematic study exploring the performance benefits of such optimizations.
+
 However, given how low-level a language C++ is, we can actually break this assumption by setting `i` to `y-x`.
 Since `&x[i]` is the same as `x+i`, this means we are actually writing `23` to `&y[0]`.
 
 Of course, that does not stop C++ compilers from doing these optimizations.
-To allow this, the standard declares our code to have [undefined behavior]({{ site.baseurl }}{% post_url 2017-07-14-undefined-behavior %}).
+To allow this, the standard declares our code to have [undefined behavior]({% post_url 2017-07-14-undefined-behavior %}).[^0]
+
+[^0]: An argument could be made that compilers should just not do such optimizations to make the programming model simpler. This is a discussion worth having, but the point of this post is not to explore this trade-off, it is to explore the consequences of the choices made in C++.
 
 First of all, it is not allowed to perform pointer arithmetic (like `&x[i]` does) that goes [beyond either end of the array it started in](https://timsong-cpp.github.io/cppwp/n4140/expr.add#5).
 Our program violates this rule: `x[i]` is outside of `x`, so this is undefined behavior.
@@ -56,7 +61,7 @@ int test() {
     auto x = new int[8];
     auto y = new int[8];
     y[0] = 42;
-    auto x_ptr = &x[8]; // one past the end
+    auto x_ptr = x+8; // one past the end
     if (x_ptr == &y[0])
       *x_ptr = 23;
     return y[0];
@@ -74,7 +79,7 @@ It does *not* point at an actual element of another object *even if they have th
 
 The key point here is that just because `x_ptr` and `&y[0]` point to the same *address*, that does not make them *the same pointer*, i.e., they cannot be used interchangeably:
 `&y[0]` points to the first element of `y`; `x_ptr` points past the end of `x`.
-If we replace `*x_ptr = 23` by `*&y[0] = 0`, we change the meaning of the program, even though the two pointers have been tested for equality.
+If we replace `*x_ptr = 23` by `*&y[0] = 23`, we change the meaning of the program, even though the two pointers have been tested for equality.
 
 This is worth repeating:
 
@@ -117,10 +122,17 @@ But the actual machine also does not do the kind of optimizations that modern C+
 If we wrote the above programs in assembly, there would be no UB, and no optimizations.
 C++ and Rust employ a more "high-level" view of memory and pointers, restricting the programmer for the benefit of optimizations.
 When formally describing what the programmer may and may not do in these languages, as we have seen, the model of pointers as integers falls apart, so we have to look for something else.
-This is another example of using a "virtual machine" that's different from the real machine for specification purposes, which is an idea [I have blogged about before]({{ site.baseurl }}{% post_url 2017-06-06-MIR-semantics %}).
+This is another example of using a "virtual machine" that's different from the real machine for specification purposes, which is an idea [I have blogged about before]({% post_url 2017-06-06-MIR-semantics %}).
 
-Here's a simple proposal (in fact, this is the model of pointers used in [CompCert](https://hal.inria.fr/hal-00703441/document) and my [RustBelt work]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}), and it is also how [miri](https://github.com/solson/miri/) implements pointers):
+Here's a simple proposal (in fact, this is the model of pointers used in [CompCert](https://hal.inria.fr/hal-00703441/document) and my [RustBelt work]({% post_url 2017-07-08-rustbelt %}), and it is also how [miri](https://github.com/solson/miri/) implements [pointers](https://github.com/rust-lang/rust/blob/fefe81605d6111faa8dbb3635ab2c51d59de740a/src/librustc/mir/interpret/mod.rs#L121-L124)):
 A pointer is a pair of some kind of ID uniquely identifying the *allocation*, and an *offset* into the allocation.
+If we defined this in Rust, we might write
+{% highlight rust %}
+struct Pointer {
+    alloc_id: usize,
+    offset: isize,
+}
+{% endhighlight %}
 Adding/subtracting an integer to/from a pointer just acts on the offset, and can thus never leave the allocation.
 Subtracting a pointer from another is only allowed when both point to the same allocation (matching [C++](https://timsong-cpp.github.io/cppwp/n4140/expr.add#6)).[^2]
 
@@ -134,14 +146,15 @@ That's how miri can detect that our second example (with `&x[8]`) is UB.
 
 In this model, pointers are not integers, but they are at least simple.
 However, this simple model starts to fall apart once you consider pointer-integer casts.
-In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable (i.e., its *type* says it is an integer) whose *value* is a pointer (i.e., an allocation-offset pair).[^3]
+In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable (i.e., its *type* says it is an integer) whose *value* is a pointer (i.e., an allocation-offset pair).
 However, multiplying that "integer" by 2 leads to an error, because it is entirely unclear what it means to multiply such an abstract pointer by 2.
 
-[^3]: This disconnect between the type and the value may seem somewhat strange, but we are actually not very concerned with *types* at this point.  Types serve to classify values with the goal of establishing certain guarantees about a program, so we can only really start talking about types once we are done defining our set of values and program behaviors.  Still, this means there are safe programs that miri cannot execute, such as `(Box::new(0).into_raw() as usize) * 2`.  To avoid trouble with multiplication, I proposed to only allow "normal" integer values for integer types when doing [compile-time function evaluation]({{ site.baseurl }}{% post_url 2018-07-19-const %}).  However, this makes pointer-integer casts an unsafe operation, because they do not actually produce a "fully operational" integer.
-
-This is the most lazy thing to do, and we do it because it is not clear what else to do -- in our abstract machine, there is no single coherent "address space" that all allocations live in, that we could use to map every pointer to a distinct integer.
-Every allocation is just identified by a (unobservable) ID.
-We could now start to enrich this model with extra data like a base address for each allocation, and somehow use that when casting integer back to pointers... but that's where it gets really complicated, and anyway discussing such a model is not the point of this post.
+I should clarify that this is *not* a good solution when defining language semantics.
+It works fine for an interpreter though.
+It is the most lazy thing to do, and we do it because it is not clear what else to do (other than not supporting these casts at all -- but this way, miri can run more programs):
+In our abstract machine, there just is no single coherent "address space" that all allocations live in, that we could use to map every pointer to a distinct integer.
+Every allocation is just identified by an (unobservable) ID.
+We could now start to enrich this model with extra data like a base address for each allocation, and somehow use that when casting an integer back to a pointer... but that's where it gets really complicated, and anyway discussing such a model is not the point of this post.
 The point it to discuss the *need* for such a model.
 If you are interested, I suggest you read [this paper](http://www.cis.upenn.edu/%7Estevez/papers/KHM+15.pdf) that explores the above idea of adding a base address.
 
@@ -151,15 +164,23 @@ We mostly just ignore the problem in miri and opportunistically do as much as we
 A full definition of a language like C++ or Rust of course cannot take this shortcut, it has to explain what really happens here.
 To my knowledge, no satisfying solution exists, but academic research is [getting closer](http://sf.snu.ac.kr/publications/llvmtwin.pdf).
 
+**Update:** This was by no means meant to be an exhaustive list of academic research on C in general. I do not know of other work that focuses directly on the interplay of integer-pointer casts and optimizations, but other noteworthy work on formalizing C includes [KCC](https://github.com/kframework/c-semantics), [Robbert Krebber's PhD thesis](https://robbertkrebbers.nl/thesis.html) and [Cerberus](https://www.cl.cam.ac.uk/~pes20/cerberus/). **/Update**
+
 This is why pointers are not simple, either.
 
 ## From Pointers to Bytes
 
 I hope I made a convincing argument that integers are not the only data one has to consider when formally specifying low-level languages such as C++ or (the unsafe parts of) Rust.
 However, this means that a simple operation like loading a byte from memory cannot just return a `u8`.
+Imagine we [implement `memcpy`](https://github.com/alexcrichton/rlibc/blob/defb486e765846417a8e73329e8c5196f1dca49a/src/lib.rs#L39) by loading (in turn) every byte of the source into some local variable `v`, and then storing it to the target.
 What if that byte is part of a pointer?  When a pointer is a pair of allocation and offset, what is its first byte?
-We cannot represent this as a `u8`.
-Instead, we will remember both the pointer, and which byte of the pointer we got.
+We have to say what the value of `v` is, so we have to find some way to answer this question.
+(And this is an entirely separate issue from the problem with multiplication that came up in the last section. We just assume some abstract type `Pointer`.)
+
+We cannot represent a byte of a pointer as an element of `0..256`.
+Essentially, if we use a naive model of memory, the extra "hidden" part of a pointer (the one that makes it more than just an integer) would be lost when a pointer is stored to memory and loaded again.
+We have to fix this, so we have to extend our notion of a "byte" to accomodate that extra state.
+So, a byte is now *either* an element of `0..256` ("raw bits"), *or* the n-th byte of some abstract pointer.
 If we were to implement our memory model in Rust, this might look as follows:
 {% highlight rust %}
 enum ByteV1 {
@@ -168,12 +189,12 @@ enum ByteV1 {
 }
 {% endhighlight %}
 For example, a `PtrFragment(ptr, 0)` represents the first byte of `ptr`.
-This way, we can "take apart" a pointer into the individual bytes that represent this pointer in memory, and assemble it back together.
+This way, `memcpy` can "take apart" a pointer into the individual bytes that represent this pointer in memory, and copy them separately.
 On a 32bit architecture, the full value representing `ptr` consists of the following 4 bytes:
 ```
 [PtrFragment(ptr, 0), PtrFragment(ptr, 1), PtrFragment(ptr, 2), PtrFragment(ptr, 3)]
 ```
-Such a representation supports performing all byte-level "data moving" operations on pointers, like implementing `memcpy` by copying one byte at a time.
+Such a representation supports performing all byte-level "data moving" operations on pointers, which is sufficient for `memcpy`.
 Arithmetic or bit-level operations are not fully supported; as already mentioned above, that requires a more sophisticated pointer representation.
 
 ## Uninitialized Memory
@@ -213,17 +234,20 @@ int test() {
 }
 {% endhighlight %}
 With `Uninit`, we can easily argue that `x` is either `Uninit` or `1`, and since replacing `Uninit` by `1` is okay, the optimization is easily justified.
-Without `Uninit`, however, `x` is either "some arbitrary bit pattern" or `1`, and doing the same optimization becomes much harder to justify.[^4]
+Without `Uninit`, however, `x` is either "some arbitrary bit pattern" or `1`, and doing the same optimization becomes much harder to justify.[^3]
 
-[^4]: We could argue that we can reorder when the non-deterministic choice is made, but then we have to prove that the hard to analyze code does not observe `x`.  `Uninit` avoids that unnecessary extra proof burden.
+[^3]: We could argue that we can reorder when the non-deterministic choice is made, but then we have to prove that the hard to analyze code does not observe `x`.  `Uninit` avoids that unnecessary extra proof burden.
 
 Finally, `Uninit` is also a better choice for interpreters like miri.
 Such interpreters have a hard time dealing with operations of the form "just choose any of these values" (i.e., non-deterministic operations), because if they want to fully explore all possible program executions, that means they have to try every possible value.
 Using `Uninit` instead of an arbitrary bit pattern means miri can, in a single execution, reliably tell you if your programs uses uninitialized values incorrectly.
 
+**Update:** Since writing this section, I have written an entire [post dedicated to uninitialized memory and "real hardware"]({% post_url 2019-07-14-uninit %}) with more details, examples and references. **/Update**
+
 ## Conclusion
 
 We have seen that in languages like C++ and Rust (unlike on real hardware), pointers can be different even when they point to the same address, and that a byte is more than just a number in `0..256`.
+This is also why calling C "portable assembly" may have been appropriate in 1978, but is a dangerously misleading statement nowadays.
 With this, I think we are ready to look at a first draft of my "2018 memory model" (working title ;) -- in the next post. :)
 
 Thanks to @rkruppe and @nagisa for help in finding arguments for why `Uninit` is needed.