X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/456685f7e728ce397d7bfddc061656c781eb0834..a3f9d0ed51ec4d4f74dc6f9b34a19d745b8ff6d6:/personal/_posts/2020-12-14-provenance.md?ds=inline diff --git a/personal/_posts/2020-12-14-provenance.md b/personal/_posts/2020-12-14-provenance.md index bb2faed..ee3f5b6 100644 --- a/personal/_posts/2020-12-14-provenance.md +++ b/personal/_posts/2020-12-14-provenance.md @@ -1,7 +1,9 @@ --- title: "Pointers Are Complicated II, or: We need better language specs" -categories: rust +categories: rust research forum: https://internals.rust-lang.org/t/pointers-are-complicated-ii-or-we-need-better-language-specs/13562 +license: CC BY-SA 4.0 +license-url: https://creativecommons.org/licenses/by-sa/4.0/ --- Some time ago, I wrote a blog post about how [there's more to a pointer than meets the eye]({% post_url 2018-07-24-pointers-and-bytes %}). @@ -86,7 +88,7 @@ All involved optimizations need to exactly agree on what is and is not UB, to en This is exactly what we also expect from the specification of a programming language such as C, which is why I think we should consider compiler IRs as proper programming languages in their own right, and specify them with the same diligence as we would specify "normal" languages.[^ub-difference] Sure, no human is going to write many programs in LLVM IR, so their syntax barely matters, but clang and rustc produce LLVM IR programs all the time, and as we have seen understanding the exact rules governing the behavior of programs is crucial to ensuring that the optimizations LLVM performs do not change program behavior. -[^cheat]: If now you feel like we somehow cheated, since we can always translate the program from C to LLVM IR, optimize there, and translate back, consider this: translating from LLVM IR to C is really hard! In particular, singed integer addition in LLVM IR can *not* be translated into signed integer addition in C, since the former is well-defined with `poison` result in case of overflow, but the latter says overflow is UB. C has strictly more UB than LLVM IR (for integer arithmetic), which makes translation in one direction easy, while the other direction is hard. +[^cheat]: If now you feel like we somehow cheated, since we can always translate the program from C to LLVM IR, optimize there, and translate back, consider this: translating from LLVM IR to C is really hard! In particular, signed integer addition in LLVM IR can *not* be translated into signed integer addition in C, since the former is well-defined with `poison` result in case of overflow, but the latter says overflow is UB. C has strictly more UB than LLVM IR (for integer arithmetic), which makes translation in one direction easy, while the other direction is hard. [^ub-difference]: In particular, two different variants of the IR with different rules for UB are really *two different programming languages*. A program that is well-defined in one language may have UB in another, so great care needs to be taken when the program is moved from being governed by one set of rules to another. @@ -262,7 +264,7 @@ Nowadays, its main purpose is to help unsafe code authors avoid UB, but for me p It also feeds back into the design of the UB rules by discovering patterns that people want or need to use but that are not currently accepted by Miri. On the LLVM side, the main development in this area is [Alive](https://blog.regehr.org/archives/1722), a tool that can automatically validate[^validate] optimizations performed by LLVM. -Alive has found [many bugs in LLVM optimizations](https://github.com/AliveToolkit/alive2/blob/master/BugList.md), and indeed much of the recent dialog with the LLVM community aimed at a more precise IR semantics is pushed by the people building Alive, lead by Nuno P. Lopes and John Regehr. +Alive has found [many bugs in LLVM optimizations](https://github.com/AliveToolkit/alive2/blob/master/BugList.md), and indeed much of the recent dialog with the LLVM community aimed at a more precise IR semantics is pushed by the people building Alive, led by Nuno P. Lopes and John Regehr. [^validate]: A note on terminology: "validating" an optimization means that given the program text before and after the optimization, the tool will try to prove that this particular transformation is correct. This is in contrast to "verifying" an optimization where a once-and-forall proof is carried out showing that the optimization will always perform a correct transformation. Verification gives a much stronger result, but is also extremely difficult to carry out, so validation is a great middle-ground that is still able to find plenty of bugs. @@ -270,7 +272,7 @@ Progress on these specification efforts is slow, though, in particular when it t I hope this post can raise awareness for the subtle problems optimizing compilers are facing, and convince some people that figuring out the specification of compiler IRs is an important and interesting problem to work on. :) That's all I have for today, thanks for sticking with me! -As usual, this post can be [discussed in the Rust forums](https://internals.rust-lang.org/t/pointers-are-complicated-ii-or-we-need-better-language-specs/13562). +As usual, this post can be [discussed in the Rust forums](https://internals.rust-lang.org/t/pointers-are-complicated-ii-or-we-need-better-language-specs/13562) and [on Reddit](https://www.reddit.com/r/rust/comments/kd157i/pointers_are_complicated_ii_or_we_need_better/). I am curious what your thoughts are on how we can build compilers that do not suffer from the issues I have discussed here. #### Footnotes