X-Git-Url: https://git.ralfj.de/web.git/blobdiff_plain/093d51990bbbf55e926d9a721b9b2219d1e03922..a21e2f98feda80b6cdc8ba3c5852ec3e85eb6b12:/personal/_posts/2018-07-13-arc-synchronization.md?ds=inline diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index dd1e490..033d56e 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -1,13 +1,14 @@ --- -title: "The Tale of a Bug in Arc: Sychronization and Data Races" +title: "The Tale of a Bug in Arc: Synchronization and Data Races" categories: rust research +reddit: /rust/comments/8ykuhv/the_tale_of_a_bug_in_arc_synchronization_and_data/ --- -While I was busy doing Rust-unelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race. +While I was busy doing Rust-unrelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race. Notice that I am just the messenger here, the bug was actually found by [Hai](https://people.mpi-sws.org/~haidang/) and [Jacques-Henri](https://jhjourdan.mketjh.fr/). Still, I'd like to use this opportunity to talk a bit about weak memory, synchronization and data races. -This is just a primer, there are tons of resources on the web that go into more detail (for example [here](http://preshing.com/20120913/acquire-and-release-semantics/)). +This is just a primer, there are tons of resources on the web that go into more detail (for example [here](https://preshing.com/20120913/acquire-and-release-semantics/)). @@ -156,10 +157,10 @@ fn main() { {% endhighlight %} This program, finally, is free of data races and hence has no undefined behavior and is guaranteed to print "Hello World!". -They key point is that *when an `Acquire` read reads from a `Release` write, an order is induced between these two accesses*. +They key point is that *when a `load(Acquire)` reads from a `store(_, Release)`, an order is induced between these two accesses*. We also say that the two accesses *synchronize*. Together with the per-thread order, this means we have an order between the `DATA = Some(...)` in the first thread and the load of `DATA` in the second thread, through the accesses to `FLAG`. -Intuitively, the `Release` write in the first thread "releases" everything that has been changed by this thread so far, and the `Acquire` read in the second thread then "acquires" all that data and makes it accessible for access later in this thread. +Intuitively, the `store(_, Release)` in the first thread "releases" everything that has been changed by this thread so far, and the `load(Acquire)` in the second thread then "acquires" all that data and makes it accessible for access later in this thread. Now, most of the time a `Mutex` or `RwLock` is good enough to protect your data against concurrent accesses, so you do not have to worry about such details. (And, thanks to Rust thread safety guarantees, you never have to worry about such details in safe code!) @@ -172,17 +173,20 @@ I said that `Mutex`/`RwLock` are good enough *most of the time*. However, `Arc` is one of those cases where the overhead induced by an exclusive lock is just way too big, so it is worth using a more optimized, unsafe implementation. As such, you are going to find plenty of atomic accesses in [the source code of `Arc`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L201). -And it turns out, as Hai and Jacques-Henri noticed when attempting to prove correctness of [`Arc::get_mut`](https://doc.rust-lang.org/beta/std/sync/struct.Arc.html#method.get_mut), that there is one place where `Relaxed` as used as an ordering, [but it really should have been `Acquire`](https://github.com/rust-lang/rust/pull/52031). -Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last [`Weak`](https://doc.rust-lang.org/beta/std/sync/struct.Weak.html) reference. +And it turns out, as Hai and Jacques-Henri noticed when attempting to prove correctness of [`Arc::get_mut`](https://doc.rust-lang.org/stable/std/sync/struct.Arc.html#method.get_mut), that there is one place where `Relaxed` as used as an ordering, [but it really should have been `Acquire`](https://github.com/rust-lang/rust/pull/52031). +Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last other `Arc`, decrementing the reference count to 1. +The PR that fixed the problem contains [some more details in the comments](https://github.com/rust-lang/rust/pull/52031/files). With `Relaxed`, no such ordering is induced, so we have a data race. To be fair, it is very unlikely that this race could lead to real misbehavior, but I am still happy to know that we now have a proof that `Arc` is mostly[^1] correctly synchronized. -[^1]: "Mostly", you may wonder? Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct. However, this is likely fixable by making the entire proof more complicated. The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct. +[^1]: "Mostly", you may wonder? Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct. However, this is likely fixable by making the entire proof more complicated. The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct, albeit still against a model of relaxed memory accesses that is not quite as weak as C11. One last thing: -I have previously claimed that our [first RustBelt paper]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}) already verified correctness of `Arc`, how can there still be bugs? +I have previously claimed that our [first RustBelt paper]({% post_url 2017-07-08-rustbelt %}) already verified correctness of `Arc`, how can there still be bugs? In that paper, we did not (yet) have the tools to reason realistically about these ordering issues we have been discussing here, so instead we worked with a "sequentially consistent" logic which essentially assumes the strongest possible mode, `SeqCst`, for all atomic accesses. (We did not discuss `SeqCst` above, and in fact there are not many cases where it is really needed. [`Release` and `Acquire` are enough most of the time](https://internals.rust-lang.org/t/sync-once-per-instance/7918/8?u=ralfjung).) This is one of the simplifications we made compared to real Rust to make the verification feasible. -We were realistic enough to find [another bug]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %}), but not realistic enough for this one. +We were realistic enough to find [another bug]({% post_url 2017-06-09-mutexguard-sync %}), but not realistic enough for this one. Hai and Jacques-Henri are currently working on remedying this particular simplification by extending the first RustBelt paper to also cover weak memory, and that's when they ran into this problem. + +**Update:** Turns out Servo has a [copy of `Arc`](https://doc.servo.org/servo_arc/index.html) that [has the same problem](https://github.com/servo/servo/issues/21186). So we got two bugs for the price of one. :) **/Update**