From: Ralf Jung <post@ralfj.de>
Date: Fri, 13 Jul 2018 15:19:49 +0000 (+0200)
Subject: new post on Arc bug
X-Git-Url: https://git.ralfj.de/web.git/commitdiff_plain/44d57d645bd780a2cf354fa7c848bbbf9bcfbc20

new post on Arc bug
---

diff --git a/ralf/_posts/2017-06-09-mutexguard-sync.md b/ralf/_posts/2017-06-09-mutexguard-sync.md
index 62ac7de..0d12ee8 100644
--- a/ralf/_posts/2017-06-09-mutexguard-sync.md
+++ b/ralf/_posts/2017-06-09-mutexguard-sync.md
@@ -1,6 +1,6 @@
 ---
 title: How MutexGuard was Sync When It Should Not Have Been
-categories: rust
+categories: rust research
 reddit: /rust/comments/6gavfe/how_mutexguard_was_sync_when_it_should_not_have/
 ---
 
diff --git a/ralf/_posts/2018-07-13-arc-synchronization.md b/ralf/_posts/2018-07-13-arc-synchronization.md
new file mode 100644
index 0000000..dd1e490
--- /dev/null
+++ b/ralf/_posts/2018-07-13-arc-synchronization.md
@@ -0,0 +1,188 @@
+---
+title: "The Tale of a Bug in Arc: Sychronization and Data Races"
+categories: rust research
+---
+
+While I was busy doing Rust-unelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race.
+
+Notice that I am just the messenger here, the bug was actually found by [Hai](https://people.mpi-sws.org/~haidang/) and [Jacques-Henri](https://jhjourdan.mketjh.fr/).
+Still, I'd like to use this opportunity to talk a bit about weak memory, synchronization and data races.
+This is just a primer, there are tons of resources on the web that go into more detail (for example [here](http://preshing.com/20120913/acquire-and-release-semantics/)).
+
+<!-- MORE -->
+
+## Synchronization and Data Races
+
+Consider the following example (full of unsafe code because I am demonstrating behavior that Rust does its best to rule out):
+{% highlight rust %}
+extern crate rayon;
+
+static mut DATA : Option<String> = None;
+static mut FLAG : usize = 0;
+
+fn main() { unsafe {
+  rayon::join(
+    || {
+      // Initialize data
+      DATA = Some("Hello World!".to_string());
+      // Signal that we are done
+      FLAG = 1;
+    },
+    || {
+      // Wait until the other thread is done
+      while FLAG == 0 {}
+      // Print data
+      println!("{}", DATA.as_ref().unwrap());
+    }
+  );
+} }
+{% endhighlight %}
+The question is: What will this program print?
+If you are thinking "Hello World!", you are not wrong, but you are not fully right either.
+"Hello World!" is one possibility, but this program can also panic at the `unwrap` and in fact it can do anything it wants, because it has undefined behavior.
+
+### Weak Memory
+
+But let us first talk about the possibility of a panic.
+How can that happen?
+The reason is that [accessing memory is slow](https://gist.github.com/jboner/2841832), at least compared to other operations like arithmetic or accessing a register.
+Both compilers and CPUs do everything they can to make sure the program does not wait for memory.
+For example, your compiler might decide that it is advantageous to reorder the two instructions in the first thread, putting the `FLAG = 1` *before* the `DATA = Some(...)`.
+These are writes to two distinct memory locations, so reordering is fine as the overall result is the same -- right?
+
+Well, that was true when all programs were single-threaded, but with a concurrent program, another thread can actually tell that the writes are now happening in a different order!
+That's how the second thread might see `FLAG = 1` but still read `DATA = NONE`.
+Moreover, even if the compiler didn't do anything like that, we could *still* see a result of "0" if you are running this program on an ARM chip.
+These chips are very aggressive in how they cache memory accesses, with the result that even if you write a program like the above in assembly, you may still be up for a surprise.
+
+We call memory that can exhibit such strange reordering effects "weak memory" because it is weaker in the sense that it provides fewer guarantees than one might naively expect.
+
+### Data Races
+
+But why does the program have undefined behavior?
+The reason is that Rust has tons of primitives whose behavior is essentially impossible to specify once concurrency gets involves.
+Consider the assignment `DATA = Some(...)` above, where an entire `Option<String>` is being copied.
+What would happen if some other thread reads `DATA` while the assignment is still in progress?
+The other thread would see some kind of mixture between the old and the new `DATA`!
+That's just complete garbage.
+So systems programming languages essentially give up at this point and declare *data races* to be undefined behavior.
+
+A data race is defined as follows:
+
+> Two accesses to the same location form a data race if:
+> 1. at least one of them is a write, and
+> 2. at least one of them is not a special *atomic* access, and
+> 3. the accesses are not *ordered*.
+
+We will talk about the "order" mentioned in condition (3) later.
+For now, let us look at "atomic accesses".
+
+### Atomic Memory Accesses
+
+In our program above, the reads and writes to `FLAG` satisfy all three conditions.
+To fix that, we will negate condition (2): We will use *atomic accesses*.
+
+All usual accesses (like all the ones in our example code above) are so called *non-atomic* accesses.
+They can be compiled efficiently and are heavily optimized, but as we have seen they quickly cause trouble when writing low-level concurrent code.
+To make it possible to write such code, but still keep most of the existing optimizations, the C++11 standard introduced special functions that can be used to perform *atomic* read and write accesses to a location.
+We also have these atomic accesses in Rust, and we can use them in our program as follows:
+{% highlight rust %}
+extern crate rayon;
+
+use std::sync::atomic::{AtomicUsize, Ordering};
+
+static mut DATA : Option<String> = None;
+static FLAG : AtomicUsize = AtomicUsize::new(0);
+
+fn main() {
+  rayon::join(
+    || {
+      // Initialize data
+      unsafe { DATA = Some("Hello World!".to_string()); }
+      // Signal that we are done
+      FLAG.store(1, Ordering::Relaxed); },
+    || {
+      // Wait until the other thread is done
+      while FLAG.load(Ordering::Relaxed) == 0 {}
+      // Print data
+      println!("{}", unsafe { DATA.as_ref().unwrap() });
+    }
+  );
+}
+{% endhighlight %}
+Notice how we are using `AtomicUsize` as type for `FLAG`, and `AtomicUsize::load` and `AtomicUsize::store` instead of non-atomic memory accesses.
+These are atomic accesses, and hence we no longer have a data race on `FLAG`.
+(I will come back to this `Ordering::Relaxed` parameter soon.)
+Notice also that we significantly reduced the amount of unsafe operations, because `AtomucUsize` is actually completely safe to use from multiple threads.
+
+### Synchronization
+
+However, we are not done yet: We still have a race on `DATA`!
+We also cannot use atomic accesses for `DATA` because of its type, `Option<String>`, which is just too big to be read/written atomically.
+This brings us to the third condition in our definition of data races: We have to somehow "order" the two accesses to `DATA`.
+
+So, what is that "order" about?
+The idea is that while in general, operations on a concurrent program can happen in parallel, with strange effects like our first program panicking, there are still *some* operations that we can rely on to be properly "ordered".
+For example, all operations within a single thread are ordered the way they are written, so e.g. `DATA = Some(...)` is ordered before `FLAG.store`.
+However, what is missing in the program above is some way to obtain an order between operations on *different threads*.
+This is where the `Ordering::Relaxed` parameter comes in: All atomic accesses come with an order mode that indicates under which circumstances this access will induce an order between operations.
+Our accesses above are `Relaxed`, which means no order is induced.
+
+So, what we will have to do is to strengthen the mode of our `FLAG` accesses to induce an order between the write of `1` in the first thread, and the operation that reads `1` in the second thread.
+To this end, we use the `Release` and `Acquire` pair of modes:
+{% highlight rust %}
+extern crate rayon;
+
+use std::sync::atomic::{AtomicUsize, Ordering};
+
+static mut DATA : Option<String> = None;
+static FLAG : AtomicUsize = AtomicUsize::new(0);
+
+fn main() {
+  rayon::join(
+    || {
+      // Initialize data
+      unsafe { DATA = Some("Hello World!".to_string()); }
+      // Signal that we are done
+      FLAG.store(1, Ordering::Release); },
+    || {
+      // Wait until the other thread is done
+      while FLAG.load(Ordering::Acquire) == 0 {}
+      // Print data
+      println!("{}", unsafe { DATA.as_ref().unwrap() });
+    }
+  );
+}
+{% endhighlight %}
+This program, finally, is free of data races and hence has no undefined behavior and is guaranteed to print "Hello World!".
+
+They key point is that *when an `Acquire` read reads from a `Release` write, an order is induced between these two accesses*.
+We also say that the two accesses *synchronize*.
+Together with the per-thread order, this means we have an order between the `DATA = Some(...)` in the first thread and the load of `DATA` in the second thread, through the accesses to `FLAG`.
+Intuitively, the `Release` write in the first thread "releases" everything that has been changed by this thread so far, and the `Acquire` read in the second thread then "acquires" all that data and makes it accessible for access later in this thread.
+
+Now, most of the time a `Mutex` or `RwLock` is good enough to protect your data against concurrent accesses, so you do not have to worry about such details.
+(And, thanks to Rust thread safety guarantees, you never have to worry about such details in safe code!)
+But based on what you learned so far, it should make perfect sense that when a lock is released by thread A, that will happen through a `Release` access, and when a lock is acquired by thread B, that happens through an `Acquire` access.
+This way, the lock makes sure that there is an order between the accesses thread A performed when it held the lock (before the `Release`), and the accesses thread B will perform while it has the lock (after the `Acquire`).
+
+## Coming Back to `Arc`
+
+I said that `Mutex`/`RwLock` are good enough *most of the time*.
+However, `Arc` is one of those cases where the overhead induced by an exclusive lock is just way too big, so it is worth using a more optimized, unsafe implementation.
+As such, you are going to find plenty of atomic accesses in [the source code of `Arc`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L201).
+
+And it turns out, as Hai and Jacques-Henri noticed when attempting to prove correctness of [`Arc::get_mut`](https://doc.rust-lang.org/beta/std/sync/struct.Arc.html#method.get_mut), that there is one place where `Relaxed` as used as an ordering, [but it really should have been `Acquire`](https://github.com/rust-lang/rust/pull/52031).
+Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last [`Weak`](https://doc.rust-lang.org/beta/std/sync/struct.Weak.html) reference.
+With `Relaxed`, no such ordering is induced, so we have a data race.
+To be fair, it is very unlikely that this race could lead to real misbehavior, but I am still happy to know that we now have a proof that `Arc` is mostly[^1] correctly synchronized.
+
+[^1]: "Mostly", you may wonder?  Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct.  However, this is likely fixable by making the entire proof more complicated.  The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct.
+
+One last thing:
+I have previously claimed that our [first RustBelt paper]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}) already verified correctness of `Arc`, how can there still be bugs?
+In that paper, we did not (yet) have the tools to reason realistically about these ordering issues we have been discussing here, so instead we worked with a "sequentially consistent" logic which essentially assumes the strongest possible mode, `SeqCst`, for all atomic accesses.
+(We did not discuss `SeqCst` above, and in fact there are not many cases where it is really needed. [`Release` and `Acquire` are enough most of the time](https://internals.rust-lang.org/t/sync-once-per-instance/7918/8?u=ralfjung).)
+This is one of the simplifications we made compared to real Rust to make the verification feasible.
+We were realistic enough to find [another bug]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %}), but not realistic enough for this one.
+Hai and Jacques-Henri are currently working on remedying this particular simplification by extending the first RustBelt paper to also cover weak memory, and that's when they ran into this problem.