From: Ralf Jung Date: Wed, 15 Jul 2020 11:06:19 +0000 (+0200) Subject: finish and publish unused-data post X-Git-Url: https://git.ralfj.de/web.git/commitdiff_plain/1afcf299355caf931b823f9473da59e1835d9f8e?ds=inline finish and publish unused-data post --- diff --git a/personal/_drafts/unused-data.md b/personal/_drafts/unused-data.md deleted file mode 100644 index 2c06b68..0000000 --- a/personal/_drafts/unused-data.md +++ /dev/null @@ -1,61 +0,0 @@ ---- -title: "Why even unused data needs to be valid" -categories: rust ---- - -`unsafe` code in Rust has a few ways that it can trigger [Undefined Behavior][ub], i.e., there are a few assumptions that the compiler makes about all code, and that for `unsafe` code the programmer is responsible for upholding. -Those assumptions are [listed in the Rust reference](https://doc.rust-lang.org/reference/behavior-considered-undefined.html). -The one that seems to be most surprising to many people is the clause which says that unsafe code may not produce "[...] an invalid value, even in private fields and locals". -The reference goes on to explain that "*producing* a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation". -In other words, even just *constructing*, for example, an invalid `bool`, is Undefined Behavior---no matter whether that `bool` is ever actually "used" by the program. -The purpose of this post is to explain why. - -[ub]: https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#undefined-behavior - - - -First of all, let me clarify what is meant by "used" here, as that term is used to mean very different things. -The following code "uses" `b`: - -{% highlight rust %} -fn example(b: bool) -> i32 { - if b { 42 } else { 23 } -} -{% endhighlight %} - -I hope it is not very surprising that calling `example` on, e.g., `3` transmuted to `bool` is Undefined Behavior (UB). -When compiling `if`, the compiler assumes that `0` and `1` are the only possible values; there is no saying what could go wrong when that assumption is violated. - -What is less obvious is why calling `example` on `3` is UB even when there is no such `if` being executed. -To understand why that is important, let us consider the following example: - -{% highlight rust %} -fn example(b: bool, num: u32) -> i32 { - let mut acc = 0; - for _i in 0..num { - acc += if b { 42 } else { 23 }; - } - acc -} -{% endhighlight %} - -Now assume we were working in a slightly different version of Rust, where transmuting `3` to a `bool` is fine as long as you do not "use" the `bool`. -That would mean that calling `example(transmute(3u8), 0)` is actually allowed, because in that case the loop never gets executed, so we never "use" `b`. - -However, this is a problem for a very important transformation called [loop-invariant code motion](https://en.wikipedia.org/wiki/Loop-invariant_code_motion). -That transformation can be used to turn our `example` function into the following: - -{% highlight rust %} -fn example(b: bool, num: u32) -> i32 { - let mut acc = 0; - let incr = if b { 42 } else { 23 } - for _i in 0..num { - acc += incr; - } - acc -} -{% endhighlight %} - -The increment `if b { 42 } else { 23 }` is "invariant" during the execution of the loop, and thus computing the increment can be moved out. -Why is this a good transformation? -Instead of determining the increment each time around the loop, we do that just once, thus saving a lot of conditional jumps that the CPU is unhappy about. diff --git a/personal/_posts/2020-07-15-unused-data.md b/personal/_posts/2020-07-15-unused-data.md new file mode 100644 index 0000000..16068e8 --- /dev/null +++ b/personal/_posts/2020-07-15-unused-data.md @@ -0,0 +1,88 @@ +--- +title: "Why even unused data needs to be valid" +categories: rust +--- + +The Rust compiler has a few assumptions that it makes about the behavior of all code. +Violations of those assumptions are referred to as [Undefined Behavior][ub]. +Since Rust is a safe-by-default language, programmers usually do not have to worry about those rules (the compiler and libraries ensure that safe code always satisfies the assumptions), +but authors of `unsafe` code are themselves responsible for upholding these requirements. + +Those assumptions are [listed in the Rust reference](https://doc.rust-lang.org/reference/behavior-considered-undefined.html). +The one that seems to be most surprising to many people is the clause which says that unsafe code may not *produce* "[...] an invalid value, even in private fields and locals". +The reference goes on to explain that "*producing* a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation". +In other words, even just *constructing*, for example, an invalid `bool`, is Undefined Behavior---no matter whether that `bool` is ever actually "used" by the program. +The purpose of this post is to explain why that rule is so strict. + +[ub]: https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#undefined-behavior + + + +First of all, let me clarify what is meant by "used" here, as that term is used to mean very different things. +The following code "uses" `b`: + +{% highlight rust %} +fn example(b: bool) -> i32 { + if b { 42 } else { 23 } +} +{% endhighlight %} + +I hope it is not very surprising that calling `example` on, e.g., `3` transmuted to `bool` is Undefined Behavior (UB). +When compiling `if`, the compiler assumes that `0` and `1` are the only possible values; there is no saying what could go wrong when that assumption is violated. + +What is less obvious is why calling `example` on `3` is UB even when there is no such `if` being executed. +To understand why that is important, let us consider the following example: + +{% highlight rust %} +fn example(b: bool, num: u32) -> i32 { + let mut acc = 0; + for _i in 0..num { + acc += if b { 42 } else { 23 }; + } + acc +} +{% endhighlight %} + +Now assume we were working in a slightly different version of Rust, where transmuting `3` to a `bool` is fine as long as you do not "use" the `bool`. +That would mean that calling `example(transmute(3u8), 0)` is actually allowed, because in that case the loop never gets executed, so we never "use" `b`. + +However, this is a problem for a very important transformation called [loop-invariant code motion](https://en.wikipedia.org/wiki/Loop-invariant_code_motion). +That transformation can be used to turn our `example` function into the following: + +{% highlight rust %} +fn example(b: bool, num: u32) -> i32 { + let mut acc = 0; + let incr = if b { 42 } else { 23 } + for _i in 0..num { + acc += incr; + } + acc +} +{% endhighlight %} + +The increment `if b { 42 } else { 23 }`, now called `incr`, is "invariant" during the execution of the loop, and thus computing the increment can be moved out. +Why is this a good transformation? +Instead of determining the increment each time around the loop, we do that just once, thus saving a lot of conditional jumps that the CPU is unhappy about. +This also enables further transformations down the road, e.g. the compiler could notice that this is just `num*incr`. + +However, in our hypothetical Rust where "unused" values may be invalid, this important optimization is actually incorrect! +To see why, consider again calling `example(transmute(3u8), 0)`. +Before the optimization, that call was fine. +After the optimization, that call is UB because we are doing `if b` where `b` is `3`. + +This is because loop-invariant code motion makes dead code live when the loop is not actually executed. +To fix this, we could require the compiler to prove that the loop runs at least once, but in general that will be hard (and in `example` it is impossible, since `num` can indeed be `0`). +So the alternative that Rust uses is to require that even "unused" data satisfies some basic validity. +That makes `example(transmute(3u8), 0)` UB in *both* versions of `example`, and as such the optimization is correct for this input (and indeed for all possible inputs). + +Now, one might argue that `b` in the example here was not truly "unused", it was just "used in dead code". +So instead of saying that `b` must be *always* valid, we could somehow try to make the use of `b` in dead code affect whether or not the program is UB. +However, that is a huge can of worms. +Right now, we have the fundamental principle that *dead code cannot affect program behavior*. +This principle is crucial for tools like [Miri](https://github.com/rust-lang/miri/): since Miri is an interpreter, it never even sees dead code. +I would argue that being able to have tools like Miri is hugely important, and it is worth having a semantics that enables such tools to exist. +But this means our hands are pretty much tied: since we cannot take into account of `b` is "used in dead code", we simply have to require `b` to always be valid, no matter whether and where it is "used" or not. +To support inlining and outlining, we also do not want the function boundary to be relevant, which ultimately leads us to the rule that Rust requires today: whenever data of a given type is *produced* anywhere, the data needs to be valid for that type. + +I hope this post was helpful in explaining why Undefined Behavior in Rust is defined the way it is. +As usual, if you have any comments or questions, let me know in the forums.