personal/_posts/2018-07-13-arc-synchronization.md

   1 ---
   2 title: "The Tale of a Bug in Arc: Synchronization and Data Races"
   3 categories: rust research
   4 reddit: /rust/comments/8ykuhv/the_tale_of_a_bug_in_arc_synchronization_and_data/
   5 ---
   6
   7 While I was busy doing Rust-unrelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race.
   8
   9 Notice that I am just the messenger here, the bug was actually found by [Hai](https://people.mpi-sws.org/~haidang/) and [Jacques-Henri](https://jhjourdan.mketjh.fr/).
  10 Still, I'd like to use this opportunity to talk a bit about weak memory, synchronization and data races.
  11 This is just a primer, there are tons of resources on the web that go into more detail (for example [here](http://preshing.com/20120913/acquire-and-release-semantics/)).
  12
  13 <!-- MORE -->
  14
  15 ## Synchronization and Data Races
  16
  17 Consider the following example (full of unsafe code because I am demonstrating behavior that Rust does its best to rule out):
  18 {% highlight rust %}
  19 extern crate rayon;
  20
  21 static mut DATA : Option<String> = None;
  22 static mut FLAG : usize = 0;
  23
  24 fn main() { unsafe {
  25   rayon::join(
  26     || {
  27       // Initialize data
  28       DATA = Some("Hello World!".to_string());
  29       // Signal that we are done
  30       FLAG = 1;
  31     },
  32     || {
  33       // Wait until the other thread is done
  34       while FLAG == 0 {}
  35       // Print data
  36       println!("{}", DATA.as_ref().unwrap());
  37     }
  38   );
  39 } }
  40 {% endhighlight %}
  41 The question is: What will this program print?
  42 If you are thinking "Hello World!", you are not wrong, but you are not fully right either.
  43 "Hello World!" is one possibility, but this program can also panic at the `unwrap` and in fact it can do anything it wants, because it has undefined behavior.
  44
  45 ### Weak Memory
  46
  47 But let us first talk about the possibility of a panic.
  48 How can that happen?
  49 The reason is that [accessing memory is slow](https://gist.github.com/jboner/2841832), at least compared to other operations like arithmetic or accessing a register.
  50 Both compilers and CPUs do everything they can to make sure the program does not wait for memory.
  51 For example, your compiler might decide that it is advantageous to reorder the two instructions in the first thread, putting the `FLAG = 1` *before* the `DATA = Some(...)`.
  52 These are writes to two distinct memory locations, so reordering is fine as the overall result is the same -- right?
  53
  54 Well, that was true when all programs were single-threaded, but with a concurrent program, another thread can actually tell that the writes are now happening in a different order!
  55 That's how the second thread might see `FLAG = 1` but still read `DATA = NONE`.
  56 Moreover, even if the compiler didn't do anything like that, we could *still* see a result of "0" if you are running this program on an ARM chip.
  57 These chips are very aggressive in how they cache memory accesses, with the result that even if you write a program like the above in assembly, you may still be up for a surprise.
  58
  59 We call memory that can exhibit such strange reordering effects "weak memory" because it is weaker in the sense that it provides fewer guarantees than one might naively expect.
  60
  61 ### Data Races
  62
  63 But why does the program have undefined behavior?
  64 The reason is that Rust has tons of primitives whose behavior is essentially impossible to specify once concurrency gets involves.
  65 Consider the assignment `DATA = Some(...)` above, where an entire `Option<String>` is being copied.
  66 What would happen if some other thread reads `DATA` while the assignment is still in progress?
  67 The other thread would see some kind of mixture between the old and the new `DATA`!
  68 That's just complete garbage.
  69 So systems programming languages essentially give up at this point and declare *data races* to be undefined behavior.
  70
  71 A data race is defined as follows:
  72
  73 > Two accesses to the same location form a data race if:
  74 > 1. at least one of them is a write, and
  75 > 2. at least one of them is not a special *atomic* access, and
  76 > 3. the accesses are not *ordered*.
  77
  78 We will talk about the "order" mentioned in condition (3) later.
  79 For now, let us look at "atomic accesses".
  80
  81 ### Atomic Memory Accesses
  82
  83 In our program above, the reads and writes to `FLAG` satisfy all three conditions.
  84 To fix that, we will negate condition (2): We will use *atomic accesses*.
  85
  86 All usual accesses (like all the ones in our example code above) are so called *non-atomic* accesses.
  87 They can be compiled efficiently and are heavily optimized, but as we have seen they quickly cause trouble when writing low-level concurrent code.
  88 To make it possible to write such code, but still keep most of the existing optimizations, the C++11 standard introduced special functions that can be used to perform *atomic* read and write accesses to a location.
  89 We also have these atomic accesses in Rust, and we can use them in our program as follows:
  90 {% highlight rust %}
  91 extern crate rayon;
  92
  93 use std::sync::atomic::{AtomicUsize, Ordering};
  94
  95 static mut DATA : Option<String> = None;
  96 static FLAG : AtomicUsize = AtomicUsize::new(0);
  97
  98 fn main() {
  99   rayon::join(
 100     || {
 101       // Initialize data
 102       unsafe { DATA = Some("Hello World!".to_string()); }
 103       // Signal that we are done
 104       FLAG.store(1, Ordering::Relaxed); },
 105     || {
 106       // Wait until the other thread is done
 107       while FLAG.load(Ordering::Relaxed) == 0 {}
 108       // Print data
 109       println!("{}", unsafe { DATA.as_ref().unwrap() });
 110     }
 111   );
 112 }
 113 {% endhighlight %}
 114 Notice how we are using `AtomicUsize` as type for `FLAG`, and `AtomicUsize::load` and `AtomicUsize::store` instead of non-atomic memory accesses.
 115 These are atomic accesses, and hence we no longer have a data race on `FLAG`.
 116 (I will come back to this `Ordering::Relaxed` parameter soon.)
 117 Notice also that we significantly reduced the amount of unsafe operations, because `AtomucUsize` is actually completely safe to use from multiple threads.
 118
 119 ### Synchronization
 120
 121 However, we are not done yet: We still have a race on `DATA`!
 122 We also cannot use atomic accesses for `DATA` because of its type, `Option<String>`, which is just too big to be read/written atomically.
 123 This brings us to the third condition in our definition of data races: We have to somehow "order" the two accesses to `DATA`.
 124
 125 So, what is that "order" about?
 126 The idea is that while in general, operations on a concurrent program can happen in parallel, with strange effects like our first program panicking, there are still *some* operations that we can rely on to be properly "ordered".
 127 For example, all operations within a single thread are ordered the way they are written, so e.g. `DATA = Some(...)` is ordered before `FLAG.store`.
 128 However, what is missing in the program above is some way to obtain an order between operations on *different threads*.
 129 This is where the `Ordering::Relaxed` parameter comes in: All atomic accesses come with an order mode that indicates under which circumstances this access will induce an order between operations.
 130 Our accesses above are `Relaxed`, which means no order is induced.
 131
 132 So, what we will have to do is to strengthen the mode of our `FLAG` accesses to induce an order between the write of `1` in the first thread, and the operation that reads `1` in the second thread.
 133 To this end, we use the `Release` and `Acquire` pair of modes:
 134 {% highlight rust %}
 135 extern crate rayon;
 136
 137 use std::sync::atomic::{AtomicUsize, Ordering};
 138
 139 static mut DATA : Option<String> = None;
 140 static FLAG : AtomicUsize = AtomicUsize::new(0);
 141
 142 fn main() {
 143   rayon::join(
 144     || {
 145       // Initialize data
 146       unsafe { DATA = Some("Hello World!".to_string()); }
 147       // Signal that we are done
 148       FLAG.store(1, Ordering::Release); },
 149     || {
 150       // Wait until the other thread is done
 151       while FLAG.load(Ordering::Acquire) == 0 {}
 152       // Print data
 153       println!("{}", unsafe { DATA.as_ref().unwrap() });
 154     }
 155   );
 156 }
 157 {% endhighlight %}
 158 This program, finally, is free of data races and hence has no undefined behavior and is guaranteed to print "Hello World!".
 159
 160 They key point is that *when an `Acquire` read reads from a `Release` write, an order is induced between these two accesses*.
 161 We also say that the two accesses *synchronize*.
 162 Together with the per-thread order, this means we have an order between the `DATA = Some(...)` in the first thread and the load of `DATA` in the second thread, through the accesses to `FLAG`.
 163 Intuitively, the `Release` write in the first thread "releases" everything that has been changed by this thread so far, and the `Acquire` read in the second thread then "acquires" all that data and makes it accessible for access later in this thread.
 164
 165 Now, most of the time a `Mutex` or `RwLock` is good enough to protect your data against concurrent accesses, so you do not have to worry about such details.
 166 (And, thanks to Rust thread safety guarantees, you never have to worry about such details in safe code!)
 167 But based on what you learned so far, it should make perfect sense that when a lock is released by thread A, that will happen through a `Release` access, and when a lock is acquired by thread B, that happens through an `Acquire` access.
 168 This way, the lock makes sure that there is an order between the accesses thread A performed when it held the lock (before the `Release`), and the accesses thread B will perform while it has the lock (after the `Acquire`).
 169
 170 ## Coming Back to `Arc`
 171
 172 I said that `Mutex`/`RwLock` are good enough *most of the time*.
 173 However, `Arc` is one of those cases where the overhead induced by an exclusive lock is just way too big, so it is worth using a more optimized, unsafe implementation.
 174 As such, you are going to find plenty of atomic accesses in [the source code of `Arc`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L201).
 175
 176 And it turns out, as Hai and Jacques-Henri noticed when attempting to prove correctness of [`Arc::get_mut`](https://doc.rust-lang.org/beta/std/sync/struct.Arc.html#method.get_mut), that there is one place where `Relaxed` as used as an ordering, [but it really should have been `Acquire`](https://github.com/rust-lang/rust/pull/52031).
 177 Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last other `Arc`, decrementing the reference count to 1.
 178 The PR that fixed the problem contains [some more details in the comments](https://github.com/rust-lang/rust/pull/52031/files).
 179 With `Relaxed`, no such ordering is induced, so we have a data race.
 180 To be fair, it is very unlikely that this race could lead to real misbehavior, but I am still happy to know that we now have a proof that `Arc` is mostly[^1] correctly synchronized.
 181
 182 [^1]: "Mostly", you may wonder?  Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct.  However, this is likely fixable by making the entire proof more complicated.  The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct, albeit still against a model of relaxed memory accesses that is not quite as weak as C11.
 183
 184 One last thing:
 185 I have previously claimed that our [first RustBelt paper]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}) already verified correctness of `Arc`, how can there still be bugs?
 186 In that paper, we did not (yet) have the tools to reason realistically about these ordering issues we have been discussing here, so instead we worked with a "sequentially consistent" logic which essentially assumes the strongest possible mode, `SeqCst`, for all atomic accesses.
 187 (We did not discuss `SeqCst` above, and in fact there are not many cases where it is really needed. [`Release` and `Acquire` are enough most of the time](https://internals.rust-lang.org/t/sync-once-per-instance/7918/8?u=ralfjung).)
 188 This is one of the simplifications we made compared to real Rust to make the verification feasible.
 189 We were realistic enough to find [another bug]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %}), but not realistic enough for this one.
 190 Hai and Jacques-Henri are currently working on remedying this particular simplification by extending the first RustBelt paper to also cover weak memory, and that's when they ran into this problem.