X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/70c890b37d1ce56cae160189c065aefb1de56704..refs/heads/master:/ralf/_posts/2018-07-24-pointers-and-bytes.md diff --git a/ralf/_posts/2018-07-24-pointers-and-bytes.md b/ralf/_posts/2018-07-24-pointers-and-bytes.md deleted file mode 100644 index 61ecc14..0000000 --- a/ralf/_posts/2018-07-24-pointers-and-bytes.md +++ /dev/null @@ -1,206 +0,0 @@ ---- -title: "Pointers Are Complicated, or: What's in a Byte?" -categories: internship rust ---- - -This summer, I am again [working on Rust full-time]({{ site.baseurl }}{% post_url 2018-07-11-research-assistant %}), and again I will work (amongst other things) on a "memory model" for Rust/MIR. -However, before I can talk about the ideas I have for this year, I have to finally take the time and dispel the myth that "pointers are simple: they are just integers". -Both parts of this statement are false, at least in languages with unsafe features like Rust or C: Pointers are neither simple nor (just) integers. - -I also want to define a piece of the memory model that has to be fixed before we can even talk about some of the more complex parts: Just what *is* the data that is stored in memory? -It is organized in bytes, the minimal addressable unit and the smallest piece that can be accessed (at least on most platforms), but what are the possible values of a byte? -Again, it turns out "it's just an 8-bit integer" does not actually work as the answer. - -I hope that by the end of this post, you will agree with me on both of these statements. :) - - - -## Pointers Are Complicated - -What is the problem with "pointers are just integers"? Let us consider the following example:
-(I am using C++ code here mostly because writing unsafe code is easier in C++, and unsafe code is where these problems really show up.) -{% highlight c++ %} -int test() { - auto x = new int[8]; - auto y = new int[8]; - y[0] = 42; - int i = /* some side-effect-free computation */; - auto x_ptr = &x[i]; - *x_ptr = 23; - return y[0]; -} -{% endhighlight %} -It would be beneficial to be able to optimize the final read of `y[0]` to just return `42`. -The justification for this optimization is that writing to `x_ptr`, which points into `x`, cannot change `y`. - -However, given how low-level a language C++ is, we can actually break this assumption by setting `i` to `y-x`. -Since `&x[i]` is the same as `x+i`, this means we are actually writing `23` to `&y[0]`. - -Of course, that does not stop C++ compilers from doing these optimizations. -To allow this, the standard declares our code to have [undefined behavior]({{ site.baseurl }}{% post_url 2017-07-14-undefined-behavior %}). - -First of all, it is not allowed to perform pointer arithmetic (like `&x[i]` does) that goes [beyond either end of the array it started in](https://timsong-cpp.github.io/cppwp/n4140/expr.add#5). -Our program violates this rule: `x[i]` is outside of `x`, so this is undefined behavior. -To be clear: Just the *computation* of `x_ptr` is already UB, we don't even get to the part where we want to *use* this pointer![^1] - -[^1]: It turns out that `y-x` is also undefined behavior because [one may only subtract pointers into the same allocation](https://timsong-cpp.github.io/cppwp/n4140/expr.add#6). However, we could use `i = ((size_t)y - (size_t)x)/sizeof(int)` to work around that. - -But we are not done yet: This rule has a special exception that we can exploit to our advantage. -If the arithmetic ends up computing a pointer *just past* the end of an allocation, that computation is fine. -(This exception is necessary to permit computing `vec.end()` for the usual kind of C++98 iterator loop.) - -So let us change this example a bit: -{% highlight c++ %} -int test() { - auto x = new int[8]; - auto y = new int[8]; - y[0] = 42; - auto x_ptr = &x[8]; // one past the end - if (x_ptr == &y[0]) - *x_ptr = 23; - return y[0]; -} -{% endhighlight %} -Now, imagine that `x` and `y` have been allocated *right next to each other*, with `y` having the higher address. -Then `x_ptr` actually points *right at the beginning* of `y`! -The conditional is true and the write happens. -Still, there is no UB due to out-of-bounds pointer arithmetic. - -This seems to break the optimization. -However, the C++ standard has another trick up its sleeve to help compiler writers: It doesn't actually allow us to *use* our `x_ptr` above. -According to what the standard says about [addition on pointers](https://timsong-cpp.github.io/cppwp/n4140/expr.add#5), `x_ptr` points "one past the last element of the array object". -It does *not* point at an actual element of another object *even if they have the same address*. (At least, that is the common interpretation of the standard based on which [LLVM optimizes this code](https://godbolt.org/g/vxmtej).) - -The key point here is that just because `x_ptr` and `&y[0]` point to the same *address*, that does not make them *the same pointer*, i.e., they cannot be used interchangeably: -`&y[0]` points to the first element of `y`; `x_ptr` points past the end of `x`. -If we replace `*x_ptr = 23` by `*&y[0] = 0`, we change the meaning of the program, even though the two pointers have been tested for equality. - -This is worth repeating: - -> *Just because two pointers point to the same address, does not mean they are equal and can be used interchangeably.* - -If this sounds subtle, it is. -In fact, this still causes miscompilations in both [LLVM](https://bugs.llvm.org/show_bug.cgi?id=35229) and [GCC](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752). - -Also notice that this one-past-the-end rule is not the only part of C/C++ where this effect can be witnessed. -Another example is the [`restrict`](https://en.wikipedia.org/wiki/Restrict) keyword in C, which can be used to express that pointers do not alias: -{% highlight c %} -int foo(int *restrict x, int *restrict y) { - *x = 42; - if (x == y) { - *y = 23; - } - return *x; -} - -int test() { - int x; - return foo(&x, &x); -} -{% endhighlight %} -Calling `test()` triggers UB because the two accesses in `foo` must not alias. -Replacing `*y` by `*x` in `foo` changes the meaning of the program such that it no longer has UB. -So, again, even though `x` and `y` have the same address, they cannot be used interchangeably. - -Pointers are definitely not integers. - -## A Simple Pointer Model - -So, what *is* a pointer? -I don't know the full answer to this. -In fact, this is an open area of research. - -Here's a simple proposal (in fact, this is the model used in my [RustBelt work]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}), and it is also how [miri](https://github.com/solson/miri/) implements pointers): -A pointer is a pair of some kind of ID uniquely identifying the *allocation*, and an *offset* into the allocation. -Adding/subtracting an integer to/from a pointer just acts on the offset, and can thus never leave the allocation. -Subtracting a pointer from another is only allowed when both point to the same allocation (matching [C++](https://timsong-cpp.github.io/cppwp/n4140/expr.add#6)).[^2] - -[^2]: As we have seen, the C++ standard actually applies these rules on the level of arrays, not allocations. However, LLVM applies the rule on a [per-allocation level](https://llvm.org/docs/LangRef.html#getelementptr-instruction). - -It turns out (and miri shows) that this model can get us very far. -We always remember which allocation a pointer points to, so we can differentiate a pointer "one past the end" of one allocation from a pointer to the beginning of another allocation. -That's how miri can detect that our second example (with `&x[8]`) is UB. - -In this model, pointers are not integers, but they are at least simple. -However, this simple model starts to fall apart once you consider pointer-integer casts. -In miri, casting a pointer to an integer does not actually do anything, we now just have an integer variable whose value is a pointer (i.e., an allocation-offset pair). -Multiplying that integer by 2 leads to an error, because it is entirely unclear what it means to multiply such a pair by 2. -A full definition of a language like C++ or Rust of course cannot take this shortcut, it has to explain what really happens here. -To my knowledge, no satisfying solution exists, but we are [getting](http://www.cis.upenn.edu/%7Estevez/papers/KHM+15.pdf) [closer](http://sf.snu.ac.kr/publications/llvmtwin.pdf). -This is why pointers are not simple, either. - -## From Pointers to Bytes - -I hope I made a convincing argument that integers are not the only data one has to consider when formally specifying low-level languages such as C++ or (the unsafe parts of) Rust. -However, this means that a simple operation like loading a byte from memory cannot just return a `u8`. -What if that byte is part of a pointer? When a pointer is a pair of allocation and offset, what is its first byte? -We cannot represent this as a `u8`. -Instead, we will remember both the pointer, and which byte of the pointer we got. -If we were to implement our memory model in Rust, this might look as follows: -{% highlight rust %} -enum ByteV1 { - Bits(u8), - PtrFragment(Pointer, u8), -} -{% endhighlight %} -For example, a `PtrFragment(ptr, 0)` represents the first byte of `ptr`. -This way, we can "take apart" a pointer into the individual bytes that represent this pointer in memory, and assemble it back together. -On a 32bit architecture, the full value representing `ptr` consists of the following 4 bytes: -``` -[PtrFragment(ptr, 0), PtrFragment(ptr, 1), PtrFragment(ptr, 2), PtrFragment(ptr, 3)] -``` -Such a representation supports performing all byte-level "data moving" operations on pointers, like implementing `memcpy` by copying one byte at a time. -Arithmetic or bit-level operations are not fully supported; as already mentioned above, that requires a more sophisticated pointer representation. - -## Uninitialized Memory - -However, we are not done yet with our definition of a "byte". -To fully describe program behavior, we need to take one more possibility into account: A byte in memory could be *uninitialized*. -The final definition for a byte (assuming we have a type `Pointer` for pointers) thus looks as follows: -{% highlight rust %} -enum Byte { - Bits(u8), - PtrFragment(Pointer, u8), - Uninit, -} -{% endhighlight %} - -`Uninit` is the value we use for all bytes that have been allocated, but not yet written to. -Reading uninitialized memory is fine, but actually *doing* anything with those bytes (like, using them in integer arithmetic) is undefined behavior. - -This is very similar to the rules LLVM has for its special value called `poison`. -Note that LLVM *also* has a value called `undef`, which it uses for uninitialized memory and which works somewhat differently -- however, compiling our `Uninit` to `undef` is actually correct (`undef` is in some sense "weaker"), and there are proposals to [remove `undef` from LLVM and use `poison` instead](http://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf). - -You may wonder why we have a special `Uninit` value at all. -Couldn't we just pick some arbitrary `b: u8` for each newly allocated byte, and then use `Bits(b)` as the initial value? -That would indeed also be an option. -However, first of all, pretty much all compilers have converged to having a sentinel value for uninitialized memory. -Not doing that would not only pose trouble when compiling through LLVM, it would also require reevaluating many optimizations to make sure they work correctly with this changed model. -The key point is that it is always safe, during compilation, to replace `Uninit` by any value: Any operation that actually observes this value is UB anyway. - -For example, the following C code becomes easier to optimize with `Uninit`: -{% highlight c %} -int test() { - int x; - if (condA()) x = 1; - // Lots of hard to analyze code that will definitely return when condA() - // does NOT hold, but will not change x. - use(x); // want to optimize x to 1. -} -{% endhighlight %} -With `Uninit`, we can easily argue that `x` is either `Uninit` or `1`, and since replacing `Uninit` by `1` is okay, the optimization is easily justified. -Without `Uninit`, however, `x` is either "some arbitrary bit pattern" or `1`, and doing the same optimization becomes much harder to justify.[^3] - -[^3]: We could argue that we can reorder when the non-deterministic choice is made, but then we have to prove that the hard to analyze code does not observe `x`. `Uninit` avoids that unnecessary extra proof burden. - -Finally, `Uninit` is also a better choice for interpreters like miri. -Such interpreters have a hard time dealing with operations of the form "just choose any of these values" (i.e., non-deterministic operations), because if they want to fully explore all possible program executions, that means they have to try every possible value. -Using `Uninit` instead of an arbitrary bit pattern means miri can, in a single execution, reliably tell you if your programs uses uninitialized values incorrectly. - -## Conclusion - -We have seen that pointers can be different even when they point to the same address, and that a byte is more than just a number in `0..256`. -With this, I think we are ready to look at a first draft of my "2018 memory model" (working title ;) -- in the next post. :) - - -#### Footnotes