From 9127f6c25ca8f0fe2974a5f6df6f2945c0f0379f Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Tue, 24 Jul 2018 16:36:02 +0200 Subject: [PATCH] Pointers and Bytes --- .../_posts/2018-07-24-pointers-and-bytes.md | 206 ++++++++++++++++++ 1 file changed, 206 insertions(+) create mode 100644 personal/_posts/2018-07-24-pointers-and-bytes.md diff --git a/personal/_posts/2018-07-24-pointers-and-bytes.md b/personal/_posts/2018-07-24-pointers-and-bytes.md new file mode 100644 index 0000000..61ecc14 --- /dev/null +++ b/personal/_posts/2018-07-24-pointers-and-bytes.md @@ -0,0 +1,206 @@ +--- +title: "Pointers Are Complicated, or: What's in a Byte?" +categories: internship rust +--- + +This summer, I am again [working on Rust full-time]({{ site.baseurl }}{% post_url 2018-07-11-research-assistant %}), and again I will work (amongst other things) on a "memory model" for Rust/MIR. +However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that "pointers are simple: they are just integers". +Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers. + +I also want to define a piece of the memory model that has to be fixed before we can even talk about some of the more complex parts: Just what *is* the data that is stored in memory? +It is organized in bytes, the minimal addressable unit and the smallest piece that can be accessed (at least on most platforms), but what are the possible values of a byte? +Again, it turns out "it's just an 8-bit integer" does not actually work as the answer. + +I hope that by the end of this post, you will agree with me on both of these statements. :) + + + +## Pointers Are Complicated + +What is the problem with "pointers are just integers"? Let us consider the following example:
+(I am using C++ code here mostly because writing unsafe code is easier in C++, and unsafe code is where these problems really show up.) +{% highlight c++ %} +int test() { + auto x = new int[8]; + auto y = new int[8]; + y[0] = 42; + int i = /* some side-effect-free computation */; + auto x_ptr = &x[i]; + *x_ptr = 23; + return y[0]; +} +{% endhighlight %} +It would be beneficial to be able to optimize the final read of `y[0]` to just return `42`. +The justification for this optimization is that writing to `x_ptr`, which points into `x`, cannot change `y`. + +However, given how low-level a language C++ is, we can actually break this assumption by setting `i` to `y-x`. +Since `&x[i]` is the same as `x+i`, this means we are actually writing `23` to `&y[0]`. + +Of course, that does not stop C++ compilers from doing these optimizations. +To allow this, the standard declares our code to have [undefined behavior]({{ site.baseurl }}{% post_url 2017-07-14-undefined-behavior %}). + +First of all, it is not allowed to perform pointer arithmetic (like `&x[i]` does) that goes [beyond either end of the array it started in](https://timsong-cpp.github.io/cppwp/n4140/expr.add#5). +Our program violates this rule: `x[i]` is outside of `x`, so this is undefined behavior. +To be clear: Just the *computation* of `x_ptr` is already UB, we don't even get to the part where we want to *use* this pointer![^1] + +[^1]: It turns out that `y-x` is also undefined behavior because [one may only subtract pointers into the same allocation](https://timsong-cpp.github.io/cppwp/n4140/expr.add#6). However, we could use `i = ((size_t)y - (size_t)x)/sizeof(int)` to work around that. + +But we are not done yet: This rule has a special exception that we can exploit to our advantage. +If the arithmetic ends up computing a pointer *just past* the end of an allocation, that computation is fine. +(This exception is necessary to permit computing `vec.end()` for the usual kind of C++98 iterator loop.) + +So let us change this example a bit: +{% highlight c++ %} +int test() { + auto x = new int[8]; + auto y = new int[8]; + y[0] = 42; + auto x_ptr = &x[8]; // one past the end + if (x_ptr == &y[0]) + *x_ptr = 23; + return y[0]; +} +{% endhighlight %} +Now, imagine that `x` and `y` have been allocated *right next to each other*, with `y` having the higher address. +Then `x_ptr` actually points *right at the beginning* of `y`! +The conditional is true and the write happens. +Still, there is no UB due to out-of-bounds pointer arithmetic. + +This seems to break the optimization. +However, the C++ standard has another trick up its sleeve to help compiler writers: It doesn't actually allow us to *use* our `x_ptr` above. +According to what the standard says about [addition on pointers](https://timsong-cpp.github.io/cppwp/n4140/expr.add#5), `x_ptr` points "one past the last element of the array object". +It does *not* point at an actual element of another object *even if they have the same address*. (At least, that is the common interpretation of the standard based on which [LLVM optimizes this code](https://godbolt.org/g/vxmtej).) + +The key point here is that just because `x_ptr` and `&y[0]` point to the same *address*, that does not make them *the same pointer*, i.e., they cannot be used interchangeably: +`&y[0]` points to the first element of `y`; `x_ptr` points past the end of `x`. +If we replace `*x_ptr = 23` by `*&y[0] = 0`, we change the meaning of the program, even though the two pointers have been tested for equality. + +This is worth repeating: + +> *Just because two pointers point to the same address, does not mean they are equal and can be used interchangeably.* + +If this sounds subtle, it is. +In fact, this still causes miscompilations in both [LLVM](https://bugs.llvm.org/show_bug.cgi?id=35229) and [GCC](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752). + +Also notice that this one-past-the-end rule is not the only part of C/C++ where this effect can be witnessed. +Another example is the [`restrict`](https://en.wikipedia.org/wiki/Restrict) keyword in C, which can be used to express that pointers do not alias: +{% highlight c %} +int foo(int *restrict x, int *restrict y) { + *x = 42; + if (x == y) { + *y = 23; + } + return *x; +} + +int test() { + int x; + return foo(&x, &x); +} +{% endhighlight %} +Calling `test()` triggers UB because the two accesses in `foo` must not alias. +Replacing `*y` by `*x` in `foo` changes the meaning of the program such that it no longer has UB. +So, again, even though `x` and `y` have the same address, they cannot be used interchangeably. + +Pointers are definitely not integers. + +## A Simple Pointer Model + +So, what *is* a pointer? +I don't know the full answer to this. +In fact, this is an open area of research. + +Here's a simple proposal (in fact, this is the model used in my [RustBelt work]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}), and it is also how [miri](https://github.com/solson/miri/) implements pointers): +A pointer is a pair of some kind of ID uniquely identifying the *allocation*, and an *offset* into the allocation. +Adding/subtracting an integer to/from a pointer just acts on the offset, and can thus never leave the allocation. +Subtracting a pointer from another is only allowed when both point to the same allocation (matching [C++](https://timsong-cpp.github.io/cppwp/n4140/expr.add#6)).[^2] + +[^2]: As we have seen, the C++ standard actually applies these rules on the level of arrays, not allocations. However, LLVM applies the rule on a [per-allocation level](https://llvm.org/docs/LangRef.html#getelementptr-instruction). + +It turns out (and miri shows) that this model can get us very far. +We always remember which allocation a pointer points to, so we can differentiate a pointer "one past the end" of one allocation from a pointer to the beginning of another allocation. +That's how miri can detect that our second example (with `&x[8]`) is UB. + +In this model, pointers are not integers, but they are at least simple. +However, this simple model starts to fall apart once you consider pointer-integer casts. +In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable whose value is a pointer (i.e., an allocation-offset pair). +Multiplying that integer by 2 leads to an error, because it is entirely unclear what it means to multiply such a pair by 2. +A full definition of a language like C++ or Rust of course cannot take this shortcut, it has to explain what really happens here. +To my knowledge, no satisfying solution exists, but we are [getting](http://www.cis.upenn.edu/%7Estevez/papers/KHM+15.pdf) [closer](http://sf.snu.ac.kr/publications/llvmtwin.pdf). +This is why pointers are not simple, either. + +## From Pointers to Bytes + +I hope I made a convincing argument that integers are not the only data one has to consider when formally specifying low-level languages such as C++ or (the unsafe parts of) Rust. +However, this means that a simple operation like loading a byte from memory cannot just return a `u8`. +What if that byte is part of a pointer? When a pointer is a pair of allocation and offset, what is its first byte? +We cannot represent this as a `u8`. +Instead, we will remember both the pointer, and which byte of the pointer we got. +If we were to implement our memory model in Rust, this might look as follows: +{% highlight rust %} +enum ByteV1 { + Bits(u8), + PtrFragment(Pointer, u8), +} +{% endhighlight %} +For example, a `PtrFragment(ptr, 0)` represents the first byte of `ptr`. +This way, we can "take apart" a pointer into the individual bytes that represent this pointer in memory, and assemble it back together. +On a 32bit architecture, the full value representing `ptr` consists of the following 4 bytes: +``` +[PtrFragment(ptr, 0), PtrFragment(ptr, 1), PtrFragment(ptr, 2), PtrFragment(ptr, 3)] +``` +Such a representation supports performing all byte-level "data moving" operations on pointers, like implementing `memcpy` by copying one byte at a time. +Arithmetic or bit-level operations are not fully supported; as already mentioned above, that requires a more sophisticated pointer representation. + +## Uninitialized Memory + +However, we are not done yet with our definition of a "byte". +To fully describe program behavior, we need to take one more possibility into account: A byte in memory could be *uninitialized*. +The final definition for a byte (assuming we have a type `Pointer` for pointers) thus looks as follows: +{% highlight rust %} +enum Byte { + Bits(u8), + PtrFragment(Pointer, u8), + Uninit, +} +{% endhighlight %} + +`Uninit` is the value we use for all bytes that have been allocated, but not yet written to. +Reading uninitialized memory is fine, but actually *doing* anything with those bytes (like, using them in integer arithmetic) is undefined behavior. + +This is very similar to the rules LLVM has for its special value called `poison`. +Note that LLVM *also* has a value called `undef`, which it uses for uninitialized memory and which works somewhat differently -- however, compiling our `Uninit` to `undef` is actually correct (`undef` is in some sense "weaker"), and there are proposals to [remove `undef` from LLVM and use `poison` instead](http://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf). + +You may wonder why we have a special `Uninit` value at all. +Couldn't we just pick some arbitrary `b: u8` for each newly allocated byte, and then use `Bits(b)` as the initial value? +That would indeed also be an option. +However, first of all, pretty much all compilers have converged to having a sentinel value for uninitialized memory. +Not doing that would not only pose trouble when compiling through LLVM, it would also require reevaluating many optimizations to make sure they work correctly with this changed model. +The key point is that it is always safe, during compilation, to replace `Uninit` by any value: Any operation that actually observes this value is UB anyway. + +For example, the following C code becomes easier to optimize with `Uninit`: +{% highlight c %} +int test() { + int x; + if (condA()) x = 1; + // Lots of hard to analyze code that will definitely return when condA() + // does NOT hold, but will not change x. + use(x); // want to optimize x to 1. +} +{% endhighlight %} +With `Uninit`, we can easily argue that `x` is either `Uninit` or `1`, and since replacing `Uninit` by `1` is okay, the optimization is easily justified. +Without `Uninit`, however, `x` is either "some arbitrary bit pattern" or `1`, and doing the same optimization becomes much harder to justify.[^3] + +[^3]: We could argue that we can reorder when the non-deterministic choice is made, but then we have to prove that the hard to analyze code does not observe `x`. `Uninit` avoids that unnecessary extra proof burden. + +Finally, `Uninit` is also a better choice for interpreters like miri. +Such interpreters have a hard time dealing with operations of the form "just choose any of these values" (i.e., non-deterministic operations), because if they want to fully explore all possible program executions, that means they have to try every possible value. +Using `Uninit` instead of an arbitrary bit pattern means miri can, in a single execution, reliably tell you if your programs uses uninitialized values incorrectly. + +## Conclusion + +We have seen that pointers can be different even when they point to the same address, and that a byte is more than just a number in `0..256`. +With this, I think we are ready to look at a first draft of my "2018 memory model" (working title ;) -- in the next post. :) + + +#### Footnotes -- 2.30.2