minor tweaks

[web.git] / personal / _drafts / provenance-matters.md
diff --git a/personal/_drafts/provenance-matters.md b/personal/_drafts/provenance-matters.md

index ceb1be513efd06f529258fc2fd1b6b86deb590b8..7ad9f1a4d53d545f55be49e94adedc3705510753 100644 (file)
--- a/personal/_drafts/provenance-matters.md
+++ b/personal/_drafts/provenance-matters.md
@@ -15,10 +15,10 @@ There is also a larger message here about how we could prevent such issues from
  <!-- MORE -->
  
  I will show a series of three compiler transformations that each seem "intuitively justified", but when taken together they lead to a clearly incorrect result.
-We will use LLVM for these examples, but the goal is not to pick on LLVM---other compilers suffer from similar issues.
+I will use LLVM for these examples, but the goal is not to pick on LLVM---other compilers suffer from similar issues.
  The goal is to convince you that to build a correct compiler for languages permitting unsafe pointer manipulation such as C, C++, or Rust,
  we need to take IR semantics (and specifically provenance) more seriously.
-We use LLVM for the examples because it is particularly easy to study with its single, (comparatively) well-documented IR that a lot of infrastructure evolved around.
+I use LLVM for the examples because it is particularly easy to study with its single, (comparatively) well-documented IR that a lot of infrastructure evolved around.
  Let's get started!
  
  ## Warm-up: Why IRs need a precise semantics
@@ -112,7 +112,7 @@ if (iq == ip) {
    print(q[0]);
  }
  {% endhighlight %}
-We are using C syntax here just as a convenient way to write programs in LLVM IR.
+I am using C syntax here just as a convenient way to write programs in LLVM IR.
  For simplicity, we assume that `int` has the right size to hold a pointer value; just imagine we used `uintptr_t` if you want to be more general.
  
  This program has two possible behaviors: either `ip` (the address one-past-the-end of `p`) and `iq` (the address of `q`) are different, and nothing is printed.
@@ -167,7 +167,7 @@ That's why the LLVM LangRef will not give us a clear answer here, and indeed obt
  
  To proceed, we will use the three optimizations that we considered above as cues: assuming that the optimization is correct for LLVM IR, what does that tell us about the semantics?
  
-We start with the last optimization, where the `print` argument is changed from `q[0]` to `0`.
+Let us start with the last optimization, where the `print` argument is changed from `q[0]` to `0`.
  This optimization is based on alias analysis:
  `q[0]` gets initialized to `0` at the beginning of the program, and the only write between that initialization and the `print` is to the pointer `p+1`.
  Since `q` and `p` point to different local variables, a pointer derived from `p` cannot alias `q[0]`, and hence we know that this write cannot affect the value stored at `q[0]`.
@@ -236,9 +236,10 @@ But in a language such as Rust, C, or C++ that supports pointer-integer casts, t
  We also learned that LLVM has a bug, but that was *not* the point of this blog post.
  The GCC developers [made exactly the same mistake](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82282), and I got word that MSVC and ICC have the same issue (though I do not know how to verify this).
  And I cannot blame them; the way compiler development typically works, I think bugs like this are inevitable: when exactly UB arises in an IR is often only loosely specified, so evaluating whether some optimization is *correct* in the sense defined above is basically impossible.
-Pointer provenance is just a particularly good example of where the current approach failed.
+Pointer provenance is just a particularly good (and subtle) example.
+The warm-up above is another trivial case of this (albeit one where the existing specification is sufficient): loop-invariant code motion of arithmetic operations and UB on arithmetic overflow can both be *correct*, but not in the same IR.
  For another example, see [§2.3 of this paper](https://plv.mpi-sws.org/validc/paper.pdf) (Figure 3 contains the code) which shows how a sequence of two optimizations can lead to a miscompilation, where the first optimization is *correct* under the LLVM concurrency model, and the second optimization is *correct* under the C++11 concurrency model---but there is no concurrency model under which *both* optimizations are correct, so each compiler (or rather, each compiler IR) needs to pick one or the other.
-The warm-up example is another trivial case of this: loop-invariant code motion of arithmetic operations and UB on arithmetic overflow can both be *correct*, but not in the same IR.
+Finally, this [paper on `undef` and `poison`](https://www.cs.utah.edu/~regehr/papers/undef-pldi17.pdf) gives examples for optimizations that are broken by the presence of `undef` in LLVM, and describes some of the trade-offs that arise when defining the semantics of `poison`.
  
  Which brings me to my main conclusion for this post: to avoid the problem of incompatible optimizations, I think we need to take compiler IRs more serious as programming languages in their own right, and give them a precise specification---including all the UB.
  Now, you may object by saying that LLVM has an [extensive LangRef](https://llvm.org/docs/LangRef.html), and still, by reading the LLVM specification one could convince oneself that each of the three optimizations above is correct, which as we have seen is contradictory.