From: Ralf Jung <post@ralfj.de>
Date: Sun, 6 Dec 2020 15:04:23 +0000 (+0100)
Subject: provenance-matters: explain what went wrong
X-Git-Url: https://git.ralfj.de/web.git/commitdiff_plain/5b1e565ff26b44f0eec320825378f95df67dde0b?hp=ee3e8070bc0c3937861914f49b6139e019afdd44

provenance-matters: explain what went wrong
---

diff --git a/personal/_drafts/provenance-matters.md b/personal/_drafts/provenance-matters.md
index 946044b..d97d7b8 100644
--- a/personal/_drafts/provenance-matters.md
+++ b/personal/_drafts/provenance-matters.md
@@ -14,7 +14,7 @@ This post is a cautionary tale of what can happen when provenance is not conside
 <!-- MORE -->
 
 As an example, I will show a series of three compiler transformation that each seem "intuitively justified" based on wording in the C standard and some common understanding of how pointers work.
-We will use LLVM for these examples, but the goal is not to pick on LLVM -- other compilers suffer from similar issues.
+We will use LLVM for these examples, but the goal is not to pick on LLVM---other compilers suffer from similar issues.
 The goal is to convince you that we need to take provenance seriously if we want to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust.
 
 ## Warm-up: Why IRs need a precise semantics
@@ -80,7 +80,7 @@ Sure, no human is going to write many programs in LLVM IR, but clang and rustc p
 *Take-away:* If we want to be able to justify the correctness of a compiler in a modular way, considering only one optimization at a time, we need to perform these optimizations in an IR that has a precise specification of all aspects of program behavior, including UB.
 Then we can, for each optimization separately, consider the question: does the optimization ever change program behavior, and does it ever introduce UB into UB-free programs?
 
-## How to break this code in 3 easy steps
+## How 3 (seemingly) correct optimizations can be incorrect when used together
 
 With the warm-up done, we are now ready to consider some more tricky optimizations.
 We will look at three different optimizations LLVM can perform, and I will show that they *cannot all be correct* since the first and last program we are considering actually have *different behavior*.
@@ -88,6 +88,9 @@ We will look at three different optimizations LLVM can perform, and I will show
 This is only possible if at least one optimization changed program behavior in an incorrect way.
 To determine which optimization is at fault, we will need a more precise specification of LLVM IR than what currently exists.
 
+The sequence of examples you are going is taken from [this talk](https://sf.snu.ac.kr/llvmtwin/files/presentation.pdf#page=32) by Chung-Kil Hur.
+For simplicity, we assume that `int` has the right size to hold a pointer value; just imagine we used `uintptr_t` if you want to be more general.
+
 Here is the source program:
 {% highlight c %}
 char p[1], q[1] = {0};
@@ -108,7 +111,7 @@ char p[1], q[1] = {0};
 int ip = (int)(p+1);
 int iq = (int)q;
 if (iq == ip) {
-  *(char*)(int)(p+1) = 10;
+  *(char*)(int)(p+1) = 10; // <- This line changed
   print(q[0]);
 }
 {% endhighlight %}
@@ -119,7 +122,7 @@ char p[1], q[1] = {0};
 int ip = (int)(p+1);
 int iq = (int)q;
 if (iq == ip) {
-  *(p+1) = 10;
+  *(p+1) = 10; // <- This line changed
   print(q[0]);
 }
 {% endhighlight %}
@@ -131,7 +134,7 @@ int ip = (int)(p+1);
 int iq = (int)q;
 if (iq == ip) {
   *(p+1) = 10;
-  print(0);
+  print(0); // <- This line changed
 }
 {% endhighlight %}
 
@@ -139,6 +142,56 @@ However, this final program is different from the first one!
 Specifically, the final program will either print nothing or print "0", while the original program *could never print "0"*.
 This shows that the sequence of three optimizations we performed, as a whole, is *not correct*.
 
-### Playing the blame game
+#### What went wrong?
+
+Clearly, one of the three optimizations is incorrect in the sense that it introduced a change in program behavior.
+But which one is it?
+
+In an ideal world, we would have a sufficiently precise semantics for LLVM IR that we would just have to read the docs (or, even better, run some Miri-like tool) to figure out the answer.
+However, describing language semantics at this level of precision is *hard*, and full of trade-offs.
+That's why the LLVM LangRef will not give us a clear answer here, and indeed obtaining a clear answer requires some decisions that have not been explicitly made yet.
+As a formal researcher, all I can do is to structure the design space and uncover the trade-offs; it will be up to the LLVM community to decide which option to pick.
+But the key point is that *they will have to make a choice*, because the status quo of doing all three of these optimizations leads to incorrect compilation results.
+
+To proceed, we will use the three optimizations that we considered above as cues: assuming that the optimization is correct for LLVM IR, what does that tell us about the semantics?
+
+We start with the last optimization, where the `print` argument is changed from `q[0]` to `0`.
+This optimization is based on alias analysis:
+`q[0]` gets initialized to `0` at the beginning of the program, and the only write between that initialization and the `print` is to the pointer `p+1`.
+Since `q` and `p` point to different local variables, a pointer derived from `p` cannot alias `q[0]`, and hence we know that this write cannot affect the value stored at `q[0]`.
+
+Looking more closely, however, reveals that things are not quite so simple!
+`p+1` is a one-past-the-end pointer, so it actually *can* have the same address as `q[0]`
+(and, in fact, inside the conditional we know this to be the case).
+However, LLVM IR (just like C) does not permit memory accesses through one-past-the-end pointers.
+It makes a difference whether we use `p+1` or `q` inside the `if`, even though we know (in that branch) that both pointers point to the same memory location.
+This demonstrates that in LLVM IR, there is more to a pointer than just the address it points to---it also matters how this address was computed.
+This something extra is what we typically call *provenance*.
+It is impossible to argue for the correctness of the third optimization without acknowledging that provenance is a real part of the semantics of an LLVM IR program.
+This is what I mean when I say *provenance matters*.
+
+Now that we know that provenance exists in pointers, we have to also consider what happens to provenance when a pointer gets cast to an integer and back.
+The second optimization gives us a clue into this aspect of LLVM IR semantics: casting a pointer to an integer and back is optimized away, which means that *integers have provenance*.
+To see why, consider the two expressions `(char*)(int)(p+1)` and `(char*)(int)q` (this optimization consists of replacing one by the other):
+if the optimization of removing pointer-integer-pointer roundtrips is correct, the first operation will output `p+1` and the second will output `q`, which we just established are two different pointers (they differ in their provenance).
+The only way to explain this is to say that the input to the `(char*)` cast is different, since the Abstract Machine state is identical in both cases.
+But we know that the integer value (i.e., the bit pattern of length 32) that serves as input to the `(char*)` cast is the same, and hence a difference can only arise if integers consist of more than just this bit pattern---just like pointers, integers have provenance.
+
+Finally, let us consider the first optimization.
+Here, a successful equality test `iq == ip` prompts the optimizer to replace one value by the other.
+This optimization demonstrates that *integers do not have provenance*:
+the optimization is only correct if a successful run-time equality test implies that the two values are equivalent in the Abstract Machine.
+But this means that the Abstract Machine version of this value cannot have any "funny" extra parts that are not represented at run-time.
+Of course, provenance is exactly such a "funny" extra part.
+A different way to phrase the same argument is to say that this optimization is correct only if `iq == ip` implies that both values have the same provenance.
+This would be a possible definition of `==` in LLVM IR only in principle---in practice this means the LLVM backends have to compile `==` in a way that pointer provenance is taken into account, which of course is impossible.
+
+*Take-away:*
+By considering each of these three optimizations in terms of what they tell us about the semantics of LLVM IR, we learned that pointers have provenance, that integers remember the provenance of the pointer they come from in case of a pointer-to-integer cast, and that integers do not have provenance.
+This is a contradiction, and this contradiction explains why we saw incorrect compilation results when applying all three optimizations to the same program.
+
+#### How can we fix this?
+
+## Conclusion
 
 #### Footnotes