X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/b91b3d141fdbdf9c913e620d0701ae9a95dca7e8..refs/heads/master:/ralf/_posts/2020-07-15-unused-data.md diff --git a/ralf/_posts/2020-07-15-unused-data.md b/ralf/_posts/2020-07-15-unused-data.md deleted file mode 100644 index ccde204..0000000 --- a/ralf/_posts/2020-07-15-unused-data.md +++ /dev/null @@ -1,91 +0,0 @@ ---- -title: "Why even unused data needs to be valid" -categories: rust -forum: https://internals.rust-lang.org/t/why-even-unused-data-needs-to-be-valid/12734 ---- - -The Rust compiler has a few assumptions that it makes about the behavior of all code. -Violations of those assumptions are referred to as [Undefined Behavior][ub]. -Since Rust is a safe-by-default language, programmers usually do not have to worry about those rules (the compiler and libraries ensure that safe code always satisfies the assumptions), -but authors of `unsafe` code are themselves responsible for upholding these requirements. - -Those assumptions are [listed in the Rust reference](https://doc.rust-lang.org/reference/behavior-considered-undefined.html). -The one that seems to be most surprising to many people is the clause which says that Rust code may not *produce* "[...] an invalid value, even in private fields and locals". -The reference goes on to explain that "*producing* a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation". -In other words, even just *constructing*, for example, an invalid `bool`, is Undefined Behavior---no matter whether that `bool` is ever actually "used" by the program. -The purpose of this post is to explain why that rule is so strict. - -[ub]: https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#undefined-behavior - - - -First of all, let me clarify what is meant by "used" here, as that term is used to mean very different things. -The following code "uses" `b`: - -{% highlight rust %} -fn example(b: bool) -> i32 { - if b { 42 } else { 23 } -} -{% endhighlight %} - -I hope it is not very surprising that calling `example` on, e.g., `3` transmuted to `bool` is Undefined Behavior (UB). -When compiling `if`, the compiler assumes that `0` and `1` are the only possible values; there is no saying what could go wrong when that assumption is violated. -(This is a compiler-understood *validity invariant* that is fixed in the language specification, which is very different from a user-defined *safety invariant*. -See [this earlier post]({% post_url 2018-08-22-two-kinds-of-invariants.md %}) for more details on that distinction.) - -What is less obvious is why calling `example` on `3` is UB even when there is no such `if` being executed. -To understand why that is important, let us consider the following example: - -{% highlight rust %} -fn example(b: bool, num: u32) -> i32 { - let mut acc = 0; - for _i in 0..num { - acc += if b { 42 } else { 23 }; - } - acc -} -{% endhighlight %} - -Now assume we were working in a slightly different version of Rust, where transmuting `3` to a `bool` is fine as long as you do not "use" the `bool`. -That would mean that calling `example(transmute(3u8), 0)` is actually allowed, because in that case the loop never gets executed, so we never "use" `b`. - -However, this is a problem for a very important transformation called [loop-invariant code motion](https://en.wikipedia.org/wiki/Loop-invariant_code_motion). -That transformation can be used to turn our `example` function into the following: - -{% highlight rust %} -fn example(b: bool, num: u32) -> i32 { - let mut acc = 0; - let incr = if b { 42 } else { 23 } - for _i in 0..num { - acc += incr; - } - acc -} -{% endhighlight %} - -The increment `if b { 42 } else { 23 }`, now called `incr`, is "invariant" during the execution of the loop, and thus computing the increment can be moved out. -Why is this a good transformation? -Instead of determining the increment each time around the loop, we do that just once, thus saving a lot of conditional jumps that the CPU is unhappy about. -This also enables further transformations down the road, e.g. the compiler could notice that this is just `num*incr`. - -However, in our hypothetical Rust where "unused" values may be invalid, this important optimization is actually incorrect! -To see why, consider again calling `example(transmute(3u8), 0)`. -Before the optimization, that call was fine. -After the optimization, that call is UB because we are doing `if b` where `b` is `3`. - -This is because loop-invariant code motion makes dead code live when the loop is not actually executed. -To fix this, we could require the compiler to prove that the loop runs at least once, but in general that will be hard (and in `example` it is impossible, since `num` can indeed be `0`). -So the alternative that Rust uses is to require that even "unused" data satisfies some basic validity. -That makes `example(transmute(3u8), 0)` UB in *both* versions of `example`, and as such the optimization is correct for this input (and indeed for all possible inputs). - -Now, one might argue that `b` in the example here was not truly "unused", it was just "used in dead code". -So instead of saying that `b` must be *always* valid, we could somehow try to make the use of `b` in dead code affect whether or not the program is UB. -However, that is a huge can of worms. -Right now, we have the fundamental principle that *dead code cannot affect program behavior*. -This principle is crucial for tools like [Miri](https://github.com/rust-lang/miri/): since Miri is an interpreter, it never even sees dead code. -I would argue that being able to have tools like Miri is hugely important, and it is worth having a semantics that enables such tools to exist. -But this means our hands are pretty much tied: since we cannot take into account of `b` is "used in dead code", we simply have to require `b` to always be valid, no matter whether and where it is "used" or not. -To support inlining and outlining, we also do not want the function boundary to be relevant, which ultimately leads us to the rule that Rust requires today: whenever data of a given type is *produced* anywhere, the data needs to be valid for that type. - -I hope this post was helpful in explaining why Undefined Behavior in Rust is defined the way it is. -As usual, if you have any comments or questions, let me know in the [forums](https://internals.rust-lang.org/t/why-even-unused-data-needs-to-be-valid/12734).