ralf/_posts/2020-12-14-provenance.md

   1 ---
   2 title: "Pointers Are Complicated II, or: We need better language specs"
   3 categories: rust research
   4 forum: https://internals.rust-lang.org/t/pointers-are-complicated-ii-or-we-need-better-language-specs/13562
   5 license: CC BY-SA 4.0
   6 license-url: https://creativecommons.org/licenses/by-sa/4.0/
   7 ---
   8
   9 Some time ago, I wrote a blog post about how [there's more to a pointer than meets the eye]({% post_url 2018-07-24-pointers-and-bytes %}).
  10 One key point I was trying to make is that
  11
  12 > *just because two pointers point to the same address, does not mean they are equal in the sense that they can be used interchangeably.*
  13
  14 This "extra information" that distinguishes different pointers to the same address is typically called [*provenance*](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#pointer-provenance).
  15 This post is another attempt to convince you that provenance is "real", by telling a cautionary tale of what can go wrong when provenance is not considered sufficiently carefully in an optimizing compiler.
  16 The post is self-contained; I am not assuming that you have read the first one.
  17 There is also a larger message here about how we could prevent such issues from coming up in the future by spending more effort on the specification of compiler IRs.
  18
  19 <!-- MORE -->
  20
  21 Below, I will show a series of three compiler transformations that each seem "intuitively justified", but when taken together they lead to a clearly incorrect result.
  22 I will use LLVM for these examples, but the goal is not to pick on LLVM---other compilers suffer from similar issues.
  23 The goal is to convince you that to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust,
  24 we need to take IR semantics (and specifically provenance) more seriously.
  25 I use LLVM for the examples because it is particularly easy to study with its single, extensively-documented IR that a lot of infrastructure evolved around.
  26 Let's get started!
  27
  28 ## Warm-up: Why IRs need a precise semantics
  29
  30 As a warm-up, I will give a simple example showing that compiler IRs such as LLVM IR need a precise (and precisely documented) semantics.
  31 If you are already familiar with the idea of treating compiler IRs as proper programming languages in their own right, or if you are just here for the pointers and their provenance, you can skip to the next section.
  32
  33 Consider the following simple (and contrived, for the sake of this example) piece of C code computing `n * (i+j)`:
  34 {% highlight c %}
  35 int sum_up(int i, int j, unsigned int n) {
  36   int result = 0;
  37   while (n > 0) {
  38     result += i + j;
  39     n -= 1;
  40   }
  41   return result;
  42 }
  43 {% endhighlight %}
  44 One transformation the compiler might want to do is to move the addition `i+j` out of the loop, to avoid computing the sum each time around the loop (this is called "loop-invariant code motion"[^loop]):
  45 {% highlight c %}
  46 int sum_up(int i, int j, unsigned int n) { // optimized version
  47   int result = 0;
  48   int s = i + j;
  49   while (n > 0) {
  50     result += s;
  51     n -= 1;
  52   }
  53   return result;
  54 }
  55 {% endhighlight %}
  56 However, that transformation is actually incorrect.
  57 If we imagine a caller using this function as `sum_up(INT_MAX, 1, 0)`, then this is a perfectly correct way to call `sum_up`: the loop is never entered, so the overflowing addition `INT_MAX+1` is never performed.
  58 However, after the desired optimization, the program now causes a signed integer overflow, which is UB (Undefined Behavior) and thus May Never Happen!
  59
  60 [^loop]: If you are a regular reader of my blog, you will recognize this as the same optimization that already played a crucial role in [a previous post of mine]({% post_url 2020-07-15-unused-data %}). Loop-invariant code motion is a great optimization to look at when considering corner cases of IR semantics.
  61
  62 One might be tempted to ignore this problem because the UB on integer overflow is a compiler-only concept; every target supported by the compiler will do the obvious thing and just produce an overflowing result.
  63 However, there might be other compiler passes running after the optimization we are considering.
  64 One such pass might inline `sum_up`, and another pass might notice the `INT_MAX+1` and replace it by `unreachable` since UB code is "by definition" unreachable, and another pass might then just remove all our code since it is unreachable.
  65 Each of these passes has a good reason to exist (it can help real code become a lot faster or help prune dead code), but if we combine them all with loop-invariant code motion, the result is a disaster.
  66
  67 One way (and the only systematic way I know) to avoid such problems is to make sure that we can justify the correctness of each optimization *in isolation*.
  68 Each optimization must be *correct* for any possible program, where *correct* means that the optimized program must only "do things" that the original program could have done as well.
  69 (This is basically the "as-if" rule in the C standard, and is typically called "refinement" in the academic literature.)
  70 In particular, no optimization must ever introduce UB into a UB-free program.
  71
  72 It may seem now that under this premise, it is impossible to perform the loop-invariant code motion we are considering.
  73 But that is not the case!
  74 So far, what we have seen is that the optimization is not *correct* when being performed on a C program.
  75 But when LLVM performs these optimizations, it does not consider the program to be written in C---it considers the program to be written in LLVM IR, which has a different semantics than C.
  76 Specifically, the [LLVM LangRef](https://llvm.org/docs/LangRef.html) says that signed integer overflow in LLVM IR yields a `poison` value.
  77 It is not UB to produce `poison`, it is just UB to use `poison` in certain ways (the details of this do not matter here).
  78 In a call to the optimized `sum_up(INT_MAX, 1, 0)`, the `s` variable introduced by loop-invariant code motion is unused, so the fact that its value is `poison` does not matter!
  79
  80 Due to this behavior of signed integer overflow, this case of loop-invariant code motion is *correct* if we consider it as an optimization on programs that are written in LLVM IR.[^cheat]
  81 The "price" we pay for this is that replacing `INT_MAX+1` by `unreachable` is not *correct* in LLVM IR, since it is not UB.
  82
  83 The great thing about *correct* optimizations is that we can combine any number of them in any order (such as inlining, replacing definite UB by `unreachable`, and removing unreachable code), and we can be sure that the result obtained after all these optimizations is a correct compilation of our original program.
  84 (In the academic lingo, we would say that "refinement is transitive".)
  85
  86 However, to make the argument that an optimization is *correct*, the exact semantics of LLVM IR (what the behavior of all possible programs is and when they have UB) needs to be documented.
  87 All involved optimizations need to exactly agree on what is and is not UB, to ensure that whatever code they produce will not be considered UB by a later optimization.
  88 This is exactly what we also expect from the specification of a programming language such as C, which is why I think we should consider compiler IRs as proper programming languages in their own right, and specify them with the same diligence as we would specify "normal" languages.[^ub-difference]
  89 Sure, no human is going to write many programs in LLVM IR, so their syntax barely matters, but clang and rustc produce LLVM IR programs all the time, and as we have seen understanding the exact rules governing the behavior of programs is crucial to ensuring that the optimizations LLVM performs do not change program behavior.
  90
  91 [^cheat]: If now you feel like we somehow cheated, since we can always translate the program from C to LLVM IR, optimize there, and translate back, consider this: translating from LLVM IR to C is really hard! In particular, signed integer addition in LLVM IR can *not* be translated into signed integer addition in C, since the former is well-defined with `poison` result in case of overflow, but the latter says overflow is UB. C has strictly more UB than LLVM IR (for integer arithmetic), which makes translation in one direction easy, while the other direction is hard.
  92
  93 [^ub-difference]: In particular, two different variants of the IR with different rules for UB are really *two different programming languages*. A program that is well-defined in one language may have UB in another, so great care needs to be taken when the program is moved from being governed by one set of rules to another.
  94
  95 *Take-away:* If we want to be able to justify the correctness of a compiler in a modular way, considering only one optimization at a time, we need to perform these optimizations in an IR that has a precise specification of all aspects of program behavior, including UB.
  96 Then we can, for each optimization separately, consider the question: does the optimization ever change program behavior, and does it ever introduce UB into UB-free programs?
  97 For a *correct* optimization, the answer to these questions is "no".
  98
  99 ## How 3 (seemingly) correct optimizations can be incorrect when used together
 100
 101 With the warm-up done, we are now ready to consider some more tricky optimizations, which we will use to explore the question of how precise a language specification needs to be.
 102 We will look at three different optimizations LLVM can perform, and I will show that they *cannot all be correct* since the first and last program we are considering actually have *different behavior*.
 103 (More precisely: the last program has a possible behavior that was not possible for the first program.)
 104 This is only possible if at least one optimization changed program behavior in an incorrect way, but it is actually not entirely clear which optimization is the culprit.
 105
 106 The sequence of examples is taken from [this talk](https://sf.snu.ac.kr/llvmtwin/files/presentation.pdf#page=32) by Chung-Kil Hur; it was discovered while working on a mathematically rigorous specification of LLVM.
 107
 108 Here is the source program:
 109 {% highlight c %}
 110 char p[1], q[1] = {0};
 111 uintptr_t ip = (uintptr_t)(p+1);
 112 uintptr_t iq = (uintptr_t)q;
 113 if (iq == ip) {
 114   *(char*)iq = 10;
 115   print(q[0]);
 116 }
 117 {% endhighlight %}
 118 I am using C syntax here just as a convenient way to write programs in LLVM IR.
 119
 120 This program has two possible behaviors: either `ip` (the address one-past-the-end of `p`) and `iq` (the address of `q`) are different, and nothing is printed.
 121 Or the two are equal, in which case the program will print "10" (`iq` is the result of casting `q` to an integer, so casting it back will yield the original pointer, or at least a pointer pointing to the same object / location in memory).
 122
 123 The first "optimization" we will perform is to exploit that if we enter the `if` body, we have `iq == ip`, so we can replace all `iq` by `ip`.
 124 Subsequently the definition of `ip` is inlined:
 125 {% highlight c %}
 126 char p[1], q[1] = {0};
 127 uintptr_t ip = (uintptr_t)(p+1);
 128 uintptr_t iq = (uintptr_t)q;
 129 if (iq == ip) {
 130   *(char*)(uintptr_t)(p+1) = 10; // <- This line changed
 131   print(q[0]);
 132 }
 133 {% endhighlight %}
 134
 135 The second optimization notices that we are taking a pointer `p+1`, casting it to an integer, and casting it back, so we can remove the cast roundtrip:
 136 {% highlight c %}
 137 char p[1], q[1] = {0};
 138 uintptr_t ip = (uintptr_t)(p+1);
 139 uintptr_t iq = (uintptr_t)q;
 140 if (iq == ip) {
 141   *(p+1) = 10; // <- This line changed
 142   print(q[0]);
 143 }
 144 {% endhighlight %}
 145
 146 The final optimization notices that `q` is never written to, so we can replace `q[0]` by its initial value `0`:
 147 {% highlight c %}
 148 char p[1], q[1] = {0};
 149 uintptr_t ip = (uintptr_t)(p+1);
 150 uintptr_t iq = (uintptr_t)q;
 151 if (iq == ip) {
 152   *(p+1) = 10;
 153   print(0); // <- This line changed
 154 }
 155 {% endhighlight %}
 156
 157 However, this final program is different from the first one!
 158 Specifically, the final program will either print nothing or print "0", while the original program *could never print "0"*.
 159 This shows that the sequence of three optimizations we performed, as a whole, is *not correct*.
 160
 161 #### What went wrong?
 162
 163 Clearly, one of the three optimizations is incorrect in the sense that it introduced a change in program behavior.
 164 But which one is it?
 165
 166 In an ideal world, we would have a sufficiently precise semantics for LLVM IR that we would just have to read the docs (or, even better, run some Miri-like tool) to figure out the answer.
 167 However, describing language semantics at this level of precision is *hard*, and full of trade-offs.
 168 The LLVM LangRef will not give us a clear answer here, and indeed obtaining a clear answer requires some decisions that have not been explicitly made yet.
 169
 170 To proceed, we will use the three optimizations that we considered above as cues: assuming that the optimization is correct for LLVM IR, what does that tell us about the semantics?
 171
 172 Let us start with the last optimization, where the `print` argument is changed from `q[0]` to `0`.
 173 This optimization is based on alias analysis:
 174 `q[0]` gets initialized to `0` at the beginning of the program, and the only write between that initialization and the `print` is to the pointer `p+1`.
 175 Since `q` and `p` point to different local variables, a pointer derived from `p` cannot alias `q[0]`, and hence we know that this write cannot affect the value stored at `q[0]`.
 176
 177 Looking more closely, however, reveals that things are not quite so simple!
 178 `p+1` is a one-past-the-end pointer, so it actually *can* have the same address as `q[0]`
 179 (and, in fact, inside the conditional we know this to be the case).
 180 However, LLVM IR (just like C) does not permit memory accesses through one-past-the-end pointers.
 181 It makes a difference whether we use `p+1` or `q` inside the `if`, even though we know (in that branch) that both pointers point to the same memory location.
 182 This demonstrates that in LLVM IR, there is more to a pointer than just the address it points to---it also matters how this address was computed.
 183 This extra information is typically called *provenance*.
 184 It is impossible to argue for the *correct*ness of the third optimization without acknowledging that provenance is a real part of the semantics of an LLVM IR program.
 185 In a flat memory model where pointers are just integers (such as most assembly languages), this optimization is simply wrong.
 186
 187 Now that we know that provenance exists in pointers, we have to also consider what happens to provenance when a pointer gets cast to an integer and back.
 188 The second optimization gives us a clue into this aspect of LLVM IR semantics: casting a pointer to an integer and back is optimized away, which means that *integers have provenance*.
 189 To see why, consider the two expressions `(char*)(uintptr_t)(p+1)` and `(char*)(uintptr_t)q`:
 190 if the optimization of removing pointer-integer-pointer roundtrips is correct, the first operation will output `p+1` and the second will output `q`, which we just established are two different pointers (they differ in their provenance).
 191 The only way to explain this is to say that the input to the `(char*)` cast is different, since the program state is otherwise identical in both cases.
 192 But we know that the integer values computed by `(uintptr_t)(p+1)` and `(uintptr_t)q` (i.e., the bit pattern as stored in some CPU register) are the same, and hence a difference can only arise if these integers consist of more than just this bit pattern---just like pointers, integers have provenance.
 193
 194 Finally, let us consider the first optimization.
 195 Here, a successful equality test `iq == ip` prompts the optimizer to replace one value by the other.
 196 This optimization demonstrates that *integers do not have provenance*:
 197 the optimization is only correct if a successful run-time equality test implies that the two values are equivalent in the "Abstract Machine" that is used to describe the language semantics.
 198 But this means that the Abstract Machine version of this value cannot have any "funny" extra parts that are not represented at run-time.
 199 Of course, provenance is exactly such a "funny" extra part.
 200 A different way to phrase the same argument is to say that this optimization is correct only if `iq == ip` evaluating to `true` implies that both values have the same "Abstract Machine" representation, so if that representation involves provenance, both values must have the same provenance.
 201 This would be a possible definition of `==` in LLVM IR, but only in principle---in practice this means the LLVM backends have to compile `==` in a way that pointer provenance is taken into account, which of course is impossible.
 202
 203 *Take-away:*
 204 By considering each of these three optimizations in terms of what they tell us about the semantics of LLVM IR, we learned that pointers have provenance, that integers remember the provenance of the pointer they come from in case of a pointer-to-integer cast, and that integers do not have provenance.
 205 This is a contradiction, and this contradiction explains why we saw incorrect compilation results when applying all three optimizations to the same program.
 206
 207 #### How can we fix this?
 208
 209 To fix the problem, we will have to declare one of the three optimizations incorrect and stop performing it.
 210 Speaking in terms of the LLVM IR semantics, this corresponds to deciding whether pointers and/or integers have provenance:
 211 * We could say both pointers and integers have provenance, which invalidates the first optimization.
 212 * We could say pointers have provenance but integers do not, which invalidates the second optimization.
 213 * We could say nothing has provenance, which invalidates the third optimization.
 214
 215 In my opinion, the first and last options are not tenable.
 216 Removing provenance altogether kills all but the most simple alias analyses.[^alias]
 217 On the other hand, declaring that integers have provenance does not just disable the first optimization in the chain shown above, it also disables common arithmetic optimizations such as `x - x` being equivalent to `0`.
 218 Even achieving commutativity and associativity of `+` becomes non-trivial once integers have provenance.
 219
 220 [^alias]: Sadly, I do not know of a systematic study of the performance impact of going with the third option: which part of the alias analysis is still correct, and how much slower is a compiler that only performs such a restricted form of alias analysis? It is my understanding that many compiler developers "obviously" consider this way too costly of a fix to the problem, but it would still be great to underpin this with some proper data. In any case, the third option would entail a complete re-design of LLVM's `noalias`, which would be bad news both for Rust (which uses `noalias` to encode alias information embedded in Rust's reference types) and clang (which uses `noalias` to encode C's `restrict`).
 221
 222 So, I think that the issue should be resolved by saying that pointers have provenance but integers do not, which means that it is the second optimization that is wrong.
 223 This also corresponds to [what has been recently proposed to the C standard committee](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2577.pdf).
 224 That's why [LLVM bug #34548](https://bugs.llvm.org/show_bug.cgi?id=34548) says that optimizing away pointer-integer-pointer roundtrips is incorrect, and LLVM should stop doing this in the general case.
 225 There might still be special cases where this can be done, but figuring out the limits of this requires a more precise description of LLVM IR semantics such as what we proposed [in this paper](https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf).
 226
 227 But ultimately, it will be up to the LLVM community to make this decision.
 228 All I can say for sure is that they have to make a choice, because the status quo of doing all three of these optimizations leads to incorrect compilation results.
 229
 230 ## Conclusion
 231
 232 What did we learn?
 233 First of all, pointers are complicated.
 234 Precisely describing their semantics in a way that is consistent with common alias analyses requires adding a notion of "provenance".
 235 In a language such as Java or ML where pointers are opaque types whose representation cannot be observed, this is actually fairly easy to do.
 236 But in a language such as Rust, C, or C++ that supports pointer-integer casts, the introduction of provenance poses some really tricky questions, and at least one of the commonly performed optimizations in this space has to give.
 237
 238 We also learned that LLVM has a bug, but that was *not* the point of this blog post.
 239 The GCC developers [made exactly the same mistake](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82282), and I got word that MSVC and ICC have the same issue (though I do not know how to verify this).
 240 And I cannot blame them; the way compiler development typically works, I think bugs like this are inevitable: when exactly UB arises in an IR is often only loosely specified, in many cases "by omission" (where cases not covered in the spec are implicitly UB), so evaluating whether some optimization is *correct* in the sense defined above can be very tricky or even impossible.
 241 Pointer provenance is just a particularly good (and subtle) example.
 242 For another example, see [§2.3 of this paper](https://plv.mpi-sws.org/validc/paper.pdf) (Figure 3 contains the code) which shows how a sequence of two optimizations can lead to a miscompilation, where the first optimization is *correct* under the LLVM concurrency model, and the second optimization is *correct* under the C++11 concurrency model---but there is no concurrency model under which *both* optimizations are correct, so each compiler (or rather, each compiler IR) needs to pick one or the other.
 243 Finally, this [paper on `undef` and `poison`](https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf) gives examples for optimizations that are broken by the presence of `undef` in LLVM, and describes some of the trade-offs that arise when defining the semantics of `poison`.
 244 Again miscompilations arise because the consequence of a statement in one place of the specification (`undef` picks a fresh value at each use) are not considered elsewhere (testing an integer for equality with zero dos not imply it is zero; it could also be `undef`).
 245
 246 Which brings me to my main conclusion for this post: to avoid the problem of incompatible optimizations, I think we need to take compiler IRs more serious as programming languages in their own right, and give them a precise specification---including all the UB.
 247 Now, you may object by saying that LLVM has an [extensive LangRef](https://llvm.org/docs/LangRef.html), and still, by reading the LLVM specification one could convince oneself that each of the three optimizations above is correct, which as we have seen is contradictory.
 248 What is missing? Do we need a formal mathematical definition to avoid such ambiguities?
 249 I do not think so; I think there is something simpler that will already help a lot.
 250 The source of the contradiction is that some parts of the specification *implicitly* assume that pointers have provenance, which is easy to forget when considering other operations.
 251 That's why I think it is very important to make such assumptions more explicit: the specification should *explicitly* describe the information that "makes up" a value, which will include things like provenance and [whether the value is (wholly or partially) initialized]({% post_url 2018-07-24-pointers-and-bytes %}).[^C]
 252 This information needs to be extensive enough such that a hypothetical interpreter can use it to detect all the UB.
 253 Doing so makes it obvious that a pointer has provenance, since otherwise it is impossible to correctly check for out-of-bounds pointer arithmetic while also permitting one-past-the-end pointers.
 254
 255 [^C]: The astute reader will notice that this information is also absent in the C and C++ specifications. That is indeed one of my main criticisms of these specifications. It leads to many open questions not only when discussing provenance (including `restrict`, which I did not even go into here) but also when discussing indeterminate and unspecified values or the rules around effective types.
 256
 257 This is my bar for what I consider a sufficiently precise language specification: it needs to contain all the information needed such that writing a UB-checking interpreter is just a matter of "putting in the work", but does not require introducing anything new that is not described in the specification.
 258
 259 The good news is that the process of developing these more precise specifications is already underway!
 260
 261 On the Rust side, this is mainly apparent in the [Miri interpreter](https://github.com/rust-lang/miri/), which is a concrete realization of the "hypothetical" interpreter that I mentioned above.
 262 In fact, this is why I originally started working on Miri.
 263 Nowadays, its main purpose is to help unsafe code authors avoid UB, but for me personally, I find it equally important that it helps us think about the semantics of Rust and MIR in a different way.
 264 It also feeds back into the design of the UB rules by discovering patterns that people want or need to use but that are not currently accepted by Miri.
 265
 266 On the LLVM side, the main development in this area is [Alive](https://blog.regehr.org/archives/1722), a tool that can automatically validate[^validate] optimizations performed by LLVM.
 267 Alive has found [many bugs in LLVM optimizations](https://github.com/AliveToolkit/alive2/blob/master/BugList.md), and indeed much of the recent dialog with the LLVM community aimed at a more precise IR semantics is pushed by the people building Alive, led by Nuno P. Lopes and John Regehr.
 268
 269 [^validate]: A note on terminology: "validating" an optimization means that given the program text before and after the optimization, the tool will try to prove that this particular transformation is correct. This is in contrast to "verifying" an optimization where a once-and-forall proof is carried out showing that the optimization will always perform a correct transformation. Verification gives a much stronger result, but is also extremely difficult to carry out, so validation is a great middle-ground that is still able to find plenty of bugs.
 270
 271 Progress on these specification efforts is slow, though, in particular when it turns out that [LLVM IR semantics should be changed](https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf).
 272 I hope this post can raise awareness for the subtle problems optimizing compilers are facing, and convince some people that figuring out the specification of compiler IRs is an important and interesting problem to work on. :)
 273
 274 That's all I have for today, thanks for sticking with me!
 275 As usual, this post can be [discussed in the Rust forums](https://internals.rust-lang.org/t/pointers-are-complicated-ii-or-we-need-better-language-specs/13562) and [on Reddit](https://www.reddit.com/r/rust/comments/kd157i/pointers_are_complicated_ii_or_we_need_better/).
 276 I am curious what your thoughts are on how we can build compilers that do not suffer from the issues I have discussed here.
 277
 278 #### Footnotes