ralf/_posts/2017-06-09-mutexguard-sync.md

   1 ---
   2 title: How MutexGuard was Sync When It Should Not Have Been
   3 categories: rust research
   4 reddit: /rust/comments/6gavfe/how_mutexguard_was_sync_when_it_should_not_have/
   5 ---
   6
   7 A couple of weeks ago, our ongoing effort to [formalize Rust's type system]({% post_url 2015-10-12-formalizing-rust %}) lead to us actually discovering a bug in the Rust standard library:
   8 `MutexGuard` implemented `Sync` in cases where it should not. This could lead to data races in safe programs. Ouch.
   9 <!-- MORE -->
  10
  11 I have to admit that I was somewhat hoping that this would be the case, after all, our paper will "sell" much better if we can point to an actual bug that was found through our effort ;) .
  12 I was, however, not giving this a high chance; after all, many eyes have looked at the standard library.
  13 Well, it turned out the trouble was not some code that someone had written and others had overlooked, the trouble was some code that had *not* been written... but let me start in the beginning.
  14
  15 ## The Bug
  16
  17 Here's a piece of code demonstrating the [problem](https://github.com/rust-lang/rust/issues/41622):
  18 {% highlight rust %}
  19 use std::sync::Mutex;
  20 use std::cell::Cell;
  21
  22 extern crate rayon;
  23
  24 fn main()
  25 {
  26     let m = Mutex::new(Cell::new(0));
  27     let g : MutexGuard<Cell<i32>> = m.lock().unwrap();
  28     {
  29         rayon::join(
  30             || { g.set(g.get() + 1); println!("Thread 1: {:?}", g.get()) },
  31             || { g.set(g.get() + 1); println!("Thread 2: {:?}", g.get()) });
  32     }
  33 }
  34 {% endhighlight %}
  35 Let us carefully step through this code to see what is going on.
  36 First, we create a `Mutex<Cell<i32>>`.
  37 Then we acquire the lock on this mutex, and store the `MutexGuard<Cell<i32>>`.
  38 This is guaranteed to succeed immediately, after all, there is no other thread that could even try to acquire this lock.
  39 Next, we run two closures in parallel (that's what `rayon::join` does), both accessing the same `MutexGuard` `g`.
  40 This is somewhat unusual:  Typically, each thread acquires the lock for itself, performs some operation, and then unlocks the mutex.
  41 Here, the main thread acquires the lock, and then it hands a shared reference to the guard to two other threads.
  42 This may sound odd, but it can in fact be a perfectly safe and pretty powerful pattern *if both threads only perform read operations*.
  43
  44 The trouble is, in the above example, both threads use `Cell::set` to change the content of the cell.
  45 `Cell` uses unsychronized accesses, so this is a classical example of a data race:  Two threads accessing the same location, and at least one of them writing to it.
  46 This is exactly the kind of concurrency bug that Rust set out to prevent.
  47 How does it come that the compiler accepted this code?
  48
  49 As already mentioned, the two closures both capture a shared reference to `g`.
  50 In other words, they both capture something of type `&MutexGuard<Cell<i32>>`.
  51 Following the [type of `rayon::join`](https://docs.rs/rayon/0.7.1/rayon/fn.join.html), the compiler checks whether that type is `Send`.
  52 This is exactly the check that is supposed to prevent data races.
  53 Because `g` is a shared reference, it proceeds by checking whether `MutexGuard<Cell<i32>>` is `Sync`, and this is where things go wrong.
  54 If we look at the [source code of `Mutex` before the fix](https://github.com/rust-lang/rust/blob/3d60bf45f42fe3ab6f6a64a983ed256d86ffdbbe/src/libstd/sync/mutex.rs), we can see that there is *no* implementation of `Sync` for `MutexGuard<T>`.
  55 Now, `Sync` is an "auto trait" (also often called OIBIT, but that's really not a great name).
  56 This means that the compiler considers a type like `MutexGuard<T>` to be `Sync` if all its fields are `Sync`.
  57 It turns out that if you follow this through, the compiler considers `MutexGuard<T>` to be `Sync` if `T` is `Send`.
  58 It then checks whether `Cell<i32>` is `Send` -- which it is -- and accepts our program.
  59
  60 ## How the Bug Was Found
  61
  62 I found this bug in the process of proving safety of `Mutex`.
  63 In particular, this involves proving that the `unsafe impl {Send, Sync}` are actually justified.
  64 Now, `MutexGuard<T>` *did not have* an `unsafe impl Sync` -- but that does not mean that it doesn't implement `Sync`!
  65 Because of the way auto traits work, it could still implement `Sync` automatically, based on the types of its fields.
  66 The trouble is, just because the compiler implements `Sync` automatically for us (so we don't have to write an `unsafe impl`) doesn't mean that this automatic `Sync` is correct!
  67 The compiler does not understand what `MutexGuad<T>` actually means, which invariants it maintains, nor does it understand the precise guarantees provided by `Send` and `Sync`.
  68 It cannot possibly check how these two interact.
  69
  70 So what I did was write some little test programs to check when the compiler considers `MutexGuard<T>` to be `Sync`, and determined that it is sufficient for `T` to be `Send`.
  71 (It turns out that it is much easier to figure this out by experiments than by staring at the source, and it is absolutely impossible to determine this based on the documentation. I would argue that is a [problem on its own](https://github.com/rust-lang/rust/issues/41537).)
  72 Then I looked at what I had to prove, and quickly determined that this can't possibly work -- I needed `T` to be `Sync`, not `Send`!
  73 Given this observation, it wasn't hard to "weaponize" this bug and write the racy program that you can see above.
  74
  75 ## The Fix
  76
  77 Clearly, to fix this problem, we have to make it so that `MutexGuard<Cell<i32>>` is *not* `Sync`.
  78 The question is:  When, in general, can we make `MutexGuard<T>` `Sync`?
  79 To answer this question, we have to consider which operations can be performed on `&MutexGuard<T>`.
  80 We can consider the type `Sync` if all these operations are thread-safe.
  81 In this case, the only interesting operation is that we can use `Deref::deref` to obtain an `&T`.
  82 In other words, for `&MutexGuard<T>` to be thread-safe, we need `&T` to be thread-safe.
  83 This is exactly what it means for `T` to be `Sync`!
  84
  85 To sum this all up, we want the implementation of `Sync` for `MutexGuard<T>` to look something like this:
  86 {% highlight rust %}
  87 unsafe impl<T: Sync> Sync for MutexGuard<T> { }
  88 {% endhighlight %}
  89 Notice the `unsafe` here:  The compiler cannot actually check whether we made any mistake in the analysis that we performed above, so we have to tell it to trust us in our assessment.
  90 The [actual fix](https://github.com/rust-lang/rust/pull/41624) is somewhat more verbose because `MutexGuard` supports unsized types and because there is a lifetime involved, but that doesn't really matter for our discussion here.
  91
  92 ## So What Now?
  93
  94 This fix landed in Rust master a while ago and will be part of Rust 1.19.
  95 If you read this post in the future and try the code above, you are likely using a sufficiently recent compiler for the code to be rejected -- which means you are safe from this particular bug!
  96 So far, I've seen no reports of code that stopped compiling because of this fix, and the pattern involved in abusing the bug is sufficiently obscure that it seems unlikely for any code to actually be broken because of this.
  97
  98 So all's well that ends well?
  99 Not really.
 100 Notice how the bug is not caused by an incorrect piece of code somewhere in the implementation of `Mutex`.
 101 One could go over that entire file, and prove every single line of it correct, and all proofs would go through.
 102 The bug is in a line of code that was *not* written.
 103 I don't think it is very surprising that this was overlooked.
 104
 105 Ideally, types like `MutexGuard<T>` would *not* get these automatic implementations of `Send` and `Sync`.
 106 That would make sure that, if some type incorrectly implements these traits, at least it comes from some piece of incorrect `unsafe` somewhere.
 107 This is exactly the point of `unsafe`: Marking clearly the points in the code that have to be audited extra-carefully.
 108 The reality is [somewhat more subtle than this]({% post_url 2016-01-09-the-scope-of-unsafe %}), but nevertheless, `unsafe` does its job pretty well -- except for when it comes to auto traits.
 109
 110 To make things worse, imagine what happens when auto traits are stabilized, so that crates can introduce their own auto traits.
 111 Some of them may want to express additional guarantees that the compiler cannot check, so some of these auto traits may be unsafe.
 112 It is impossible to predict how the guarantees of these unsafe auto traits interact with the invariants of types like `MutexGuard<T>` -- but still, the compiler will consider the trait to be implemented for `MutexGuard<T>` if it is implemented for all fields.
 113 The types of these fields are an implementation detail and not even visible to other crates.
 114 Nothing outside of the standard library should care about such details.
 115 However, as we have seen, with user-defined auto traits, we actually care a lot!
 116
 117 Again, I think the solution here is that types like `MutexGuard<T>` should not automatically implement unsafe auto traits.
 118 What do I mean by "types like `MutexGuard<T>`"?
 119 The point that makes `MutexGuard<T>` special here is that there is an invariant attached with this type that all instances of `MutexGuard<T>` satisfy.
 120 (I describe this concept of types with additional invariants in more detail [in one of my previous blog posts]({% post_url 2016-01-09-the-scope-of-unsafe %})).
 121 My proposal is to *tell the compiler* that this is the case.
 122 If the compiler understands that a type is "more than what it seems to be", we could know during compilation that we have to be more cautious when handling this type -- and in particular, we could have such types *not* automatically implement unsafe auto traits.
 123 In fact, I [proposed such an annotation](https://internals.rust-lang.org/t/pre-rfc-unsafe-types/3073) previously for different reasons, [and so did others](https://github.com/rust-lang/rfcs/issues/381).[^1]
 124
 125 [^1]: I think having a notion of unsafe types can also help solve some of the [unsafe code guidelines](https://internals.rust-lang.org/t/next-steps-for-unsafe-code-guidelines/3864) questions.  But that is content for another blog post.
 126
 127 This of course significantly weakens the power of user-defined auto traits.
 128 However, I hope I convinced you that if we don't act, errors like the one described here are bound to happen.
 129 That said, such decisions are of course going to go through the usual RFC process.
 130 It's certainly possible that someone comes up with a compromise that preserves some of the usefulness of auto traits, without putting safety at risk.
 131