+#### What went wrong?
+
+Clearly, one of the three optimizations is incorrect in the sense that it introduced a change in program behavior.
+But which one is it?
+
+In an ideal world, we would have a sufficiently precise semantics for LLVM IR that we would just have to read the docs (or, even better, run some Miri-like tool) to figure out the answer.
+However, describing language semantics at this level of precision is *hard*, and full of trade-offs.
+That's why the LLVM LangRef will not give us a clear answer here, and indeed obtaining a clear answer requires some decisions that have not been explicitly made yet.
+As a researcher, all I can do is to structure the design space and uncover the trade-offs; it will be up to the LLVM community to decide which option to pick.
+But the key point is that *they will have to make a choice*, because the status quo of doing all three of these optimizations leads to incorrect compilation results.
+
+To proceed, we will use the three optimizations that we considered above as cues: assuming that the optimization is correct for LLVM IR, what does that tell us about the semantics?
+
+We start with the last optimization, where the `print` argument is changed from `q[0]` to `0`.
+This optimization is based on alias analysis:
+`q[0]` gets initialized to `0` at the beginning of the program, and the only write between that initialization and the `print` is to the pointer `p+1`.
+Since `q` and `p` point to different local variables, a pointer derived from `p` cannot alias `q[0]`, and hence we know that this write cannot affect the value stored at `q[0]`.
+
+Looking more closely, however, reveals that things are not quite so simple!
+`p+1` is a one-past-the-end pointer, so it actually *can* have the same address as `q[0]`
+(and, in fact, inside the conditional we know this to be the case).
+However, LLVM IR (just like C) does not permit memory accesses through one-past-the-end pointers.
+It makes a difference whether we use `p+1` or `q` inside the `if`, even though we know (in that branch) that both pointers point to the same memory location.
+This demonstrates that in LLVM IR, there is more to a pointer than just the address it points to---it also matters how this address was computed.
+This something extra is what we typically call *provenance*.
+It is impossible to argue for the *correct*ness of the third optimization without acknowledging that provenance is a real part of the semantics of an LLVM IR program.
+In a flat memory model where pointers are just integers (such as most assembly languages), this optimization is simply wrong.
+
+Now that we know that provenance exists in pointers, we have to also consider what happens to provenance when a pointer gets cast to an integer and back.
+The second optimization gives us a clue into this aspect of LLVM IR semantics: casting a pointer to an integer and back is optimized away, which means that *integers have provenance*.
+To see why, consider the two expressions `(char*)(int)(p+1)` and `(char*)(int)q`:
+if the optimization of removing pointer-integer-pointer roundtrips is correct, the first operation will output `p+1` and the second will output `q`, which we just established are two different pointers (they differ in their provenance).
+The only way to explain this is to say that the input to the `(char*)` cast is different, since the Abstract Machine state is identical in both cases.
+But we know that the integer value (i.e., the bit pattern of length 32) that serves as input to the `(char*)` cast is the same, and hence a difference can only arise if integers consist of more than just this bit pattern---just like pointers, integers have provenance.
+
+Finally, let us consider the first optimization.
+Here, a successful equality test `iq == ip` prompts the optimizer to replace one value by the other.
+This optimization demonstrates that *integers do not have provenance*:
+the optimization is only correct if a successful run-time equality test implies that the two values are equivalent in the Abstract Machine.
+But this means that the Abstract Machine version of this value cannot have any "funny" extra parts that are not represented at run-time.
+Of course, provenance is exactly such a "funny" extra part.
+A different way to phrase the same argument is to say that this optimization is correct only if `iq == ip` implies that both values have the same provenance.
+This would be a possible definition of `==` in LLVM IR, but only in principle---in practice this means the LLVM backends have to compile `==` in a way that pointer provenance is taken into account, which of course is impossible.
+
+*Take-away:*
+By considering each of these three optimizations in terms of what they tell us about the semantics of LLVM IR, we learned that pointers have provenance, that integers remember the provenance of the pointer they come from in case of a pointer-to-integer cast, and that integers do not have provenance.
+This is a contradiction, and this contradiction explains why we saw incorrect compilation results when applying all three optimizations to the same program.
+
+#### How can we fix this?
+
+To fix the problem, we will have to declare one of the three optimizations incorrect and stop performing it.
+Speaking in terms of the LLVM IR semantics, this corresponds to deciding whether pointers and/or integers have provenance:
+* We could say both pointers and integers have provenance, which invalidates the first optimization.
+* We could say pointers have provenance but integers do not, which invalidates the second optimization.
+* We could say nothing has provenance, which invalidates the third optimization.
+
+In my opinion, the first and last options are not tenable.
+Removing provenance altogether kills all but the most simple alias analyses.[^alias]
+On the other hand, declaring that integers have provenance does not just disable the first optimization in the chain shown above, it also disables common arithmetic optimizations such as `x - x` being equivalent to `0`.
+Even achieving commutativity and associativity of `+` becomes non-trivial once integers have provenance.
+
+[^alias]: Sadly, I do not know of a systematic study of the performance impact of this decision. It is my understanding that many compiler developers "obviously" consider this way too costly of a fix to the problem, but it would still be great to underpin this with some proper data.
+
+So, I think that the issue should be resolved by saying that pointers have provenance but integers do not, which means that it is the second optimization that is wrong.
+This also corresponds to [what has been recently proposed to the C standard committee](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2577.pdf).
+That's why [LLVM bug #34548](https://bugs.llvm.org/show_bug.cgi?id=34548) says that optimizing away pointer-integer-pointer roundtrips is incorrect, and LLVM should stop doing this in the general case.
+There might still be special cases where this can be done, but figuring out the limits of this really needs a more precise description of LLVM IR semantics such as what we proposed [in this paper](https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf).
+
+## Conclusion
+
+What did we learn?
+First of all, pointers are complicated.
+Precisely describing their semantics in a way that is consistent with common alias analyses requires adding a notion of "provenance".
+In a language such as Java or ML where pointers are opaque types whose representation cannot be observed, this is actually fairly easy to do.
+But in a language such as Rust, C or C++ that supports pointer-integer casts, the introduction of provenance poses some really tricky questions, and at least one of the commonly performed optimizations in this space has to give.
+
+We also learned that LLVM has a bug, but that was *not* the point of this blog post.
+The GCC developers [made exactly the same mistake](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82282), and I got word that MSVC and ICC have the same issue (though I do not know how to verify this).
+And I cannot blame them; the way compiler development typically works nowadays, I think bugs like this are inevitable.
+Pointer provenance is just a particularly good example of where the current approach failed, but it is not the only case.
+For example, [ยง2.3 of this paper](https://plv.mpi-sws.org/validc/paper.pdf) (see Figure 3 for the code) shows how a sequence of two optimizations can lead to a miscompilation, where the first optimization is correct under the LLVM concurrency model, and the second optimization is correct under the C++11 concurrency model---but there is no concurrency model under which *both* optimizations are correct, so each compiler (or rather, each compiler IR) needs to pick one or the other.
+
+Which brings me to my main conclusion for this post: I think the way optimizing compilers for these low-level languages are built is fundamentally flawed.
+The current approach, which one might call "optimizations-first", is to largely think of the semantics of a compiler IR in terms of the set of optimizations that it enables.[^weak-mem]
+However, each additional optimization puts another constraint on the IR semantics---how can we be sure that there even is a way to satisfy all constraints in a single language?
+The only way I know to ensure that the constraint set remains satisfiable, i.e., to ensure that performing all these optimizations in any order does not introduce bugs, is to pick a consistent semantics for the IR and then show each optimization to be *correct* under those semantics, in the sense defined above.
+
+[^weak-mem]: We can also clearly see this approach in most discussions around weak memory concurrency effects, which typically are all about which reorderings the compiler is allowed to perform and how barriers can be used to prevent the reorderings. This is the optimizations-first approach in action.
+
+To avoid the problem of incompatible optimizations, I think we need to take compiler IRs more serious as programming languages in their own right, and properly define the Abstract Machine that describes their semantics---including all the UB!
+I call this "language-first".
+This language definition does not need to be a formal artifact in a proof assistant, but the description needs to be precise enough such that there are no ambiguities, and such that for each desired optimization we can evaluate whether it is *correct* for this IR or not.
+One great way to achieve this is to implement a *reference interpreter* for the IR, an interpreter that not only evaluates IR programs but also says if there was any UB triggered by this execution.
+Writing such an interpreter is a great exercise because it requires explicitly writing out what one could call the "state space" of the Abstract Machine:
+just what *is* a value, which pieces of data are needed to fully describe it, what kind of information is stored in memory, and so on.
+Doing so makes it *immediately obvious* that a pointer has provenance, since otherwise it is impossible to correctly check for out-of-bounds accesses.
+
+This is really my main motivation for working on the [Miri interpreter](https://github.com/rust-lang/miri/).
+Of course, practically speaking, its main purpose is to help unsafe code authors avoid UB, but for me personally, I find it equally important that it helps us think about the semantics of Rust and MIR in an *operational* way.
+This helps shift the discussion around MIR from "optimizations-first" more towards "language-first".
+Of course, optimizations are still the key motivating factor for defining the language semantics this way or that way, but ultimately it is the language semantics that serves as "ground truth" and keeps it all together, ensuring everything is mutually compatible.
+
+I hope this was educational, and thanks for reading. :)
+As usual, this post can be discussed in the Rust forums.
+I am curious what your thoughts are on how we can build compilers that do not suffer from these issues.