From a2e15867eeeec48d4fe52e1fe89f62f063ecf148 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Sun, 10 Jun 2018 19:30:21 +0200 Subject: [PATCH 01/16] describe my mailman CAPTCHA --- .../2018-06-02-mailman-subscription-spam.md | 7 +- ...-10-mailman-subscription-spam-continued.md | 180 ++++++++++++++++++ 2 files changed, 184 insertions(+), 3 deletions(-) create mode 100644 personal/_posts/2018-06-10-mailman-subscription-spam-continued.md diff --git a/personal/_posts/2018-06-02-mailman-subscription-spam.md b/personal/_posts/2018-06-02-mailman-subscription-spam.md index 3c29d1f..e8d557c 100644 --- a/personal/_posts/2018-06-02-mailman-subscription-spam.md +++ b/personal/_posts/2018-06-02-mailman-subscription-spam.md @@ -45,9 +45,10 @@ have found my servers so far are much less patient than that, just setting spam. So, if you are reading this and running a Mailman installation: **Please set -`SUBSCRIBE_FORM_SECRET` and protect your setup against abuse!** Just run `pwgen -16` to get some random string, and then add `SUBSCRIBE_FORM_SECRET = ""` to `/etc/mailman/mm_cfg.py`. It's really that simple! Just a +`SUBSCRIBE_FORM_SECRET` and protect your setup against abuse!** Just run +`openssl rand -base64 18` to get some random string, and then add +`SUBSCRIBE_FORM_SECRET = ""` to `/etc/mailman/mm_cfg.py`. +It's really that simple! Just a [four-line patch in my Ansible playbook](https://git.ralfj.de/ansible.git/commitdiff/937b170594be82e500ae726dc47de8ca9ef3dfcf) to get this rolled out to all servers. Note that you need to be at least on Mailman 2.1.16 for this to work; all currently supported versions of Debian come diff --git a/personal/_posts/2018-06-10-mailman-subscription-spam-continued.md b/personal/_posts/2018-06-10-mailman-subscription-spam-continued.md new file mode 100644 index 0000000..4b92caf --- /dev/null +++ b/personal/_posts/2018-06-10-mailman-subscription-spam-continued.md @@ -0,0 +1,180 @@ +--- +title: "Fighting Mailman Subscription Spam: Leveling Up" +categories: sysadmin +--- + +Last week, I blogged about [my efforts to fight mailman subscription spam]({{ +site.baseurl }}{% post_url 2018-06-02-mailman-subscription-spam %}). Enabling +`SUBSCRIBE_FORM_SECRET` as described there indeed helped to drastically reduce +the amount of subscription spam from more than 1000 to less than 10 mails sent +per day, but some attackers still got through. My guess is that those machines +were just so slow that they managed to wait the required five seconds before +submitting the form. + +So, clearly I had to level up my game. I decided to pull through on my plan to +write a simple CAPTCHA for mailman (that doesn't expose your users to Google). +This post describes how to configure and install that CAPTCHA. + + + +### CAPTCHA configuration + +This simple CAPTCHA is based on a list of questions and matching answers that +you, the site admin, define. The idea is to use questions that anyone who is +interested in your site can easily answer. Since most sites are small enough +that they are not to be targeted specifically, the bots (hopefully) will not be +able to answer these questions. At least for my sites, that has worked so far +(I am running with this patch for a week now). + +The CAPTCHA requires `SUBSCRIBE_FORM_SECRET` to be enabled. Configuration can +look something like this: +``` +SUBSCRIBE_FORM_SECRET = "" +CAPTCHAS = [ + # This is a list of questions, each paired with a list of answers. + ('What is two times six?', ['12', 'twelve']), + ('What is the name of this site's blog', ['Ralf's Ramblings']), +] +``` + +### CAPTCHA patch + +Right now, the `CAPTCHAS` part of the configuration will not yet do anything +because you still have to install my patch. The patch is losely based on +[this blog post](https://warmcat.com/2017/08/12/mailman-captcha.html) and was +written against Mailman 2.1.23 on Debian 9 "Stretch". If you are using a +different version you may have to adapt it accordingly. + +First of all, create a new file `/usr/lib/mailman/Mailman/Captcha.py` with the +following content: +``` +import random +from Mailman import Utils + +def display(mlist, captchas): + """Returns a CAPTCHA question, the HTML for the answer box, and + the data to be put into the CSRF token""" + idx = random.randrange(len(captchas)) + question = captchas[idx][0] + box_html = mlist.FormatBox('captcha_answer', size=30) + return (Utils.websafe(question), box_html, str(idx)) + +def verify(idx, given_answer, captchas): + try: + idx = int(idx) + except ValueError: + return False + if not idx in range(len(captchas)): + return False + # Chec the given answer + correct_answers = captchas[idx][1] + given_answer = given_answer.strip().lower() + return given_answer in map(lambda x: x.strip().lower(), correct_answers) +``` +This contains the actual CAPTCHA logic. Now it needs to be wired up with the +listinfo page (where the subscription form is shown to the user) and the +subscription page (where the subscription form is submitted to). + +Here is the patch for `/usr/lib/mailman/Mailman/Cgi/listinfo.py`: +``` +--- listinfo.py.orig 2018-06-03 19:18:30.089902948 +0200 ++++ listinfo.py 2018-06-10 19:12:59.381910750 +0200 +@@ -26,6 +26,7 @@ + + from Mailman import mm_cfg + from Mailman import Utils ++from Mailman import Captcha + from Mailman import MailList + from Mailman import Errors + from Mailman import i18n +@@ -216,10 +220,16 @@ + # drop one : resulting in an invalid format, but it's only + # for our hash so it doesn't matter. + remote = remote.rsplit(':', 1)[0] ++ # get CAPTCHA data ++ (captcha_question, captcha_box, captcha_idx) = Captcha.display(mlist, mm_cfg.CAPTCHAS) ++ replacements[''] = captcha_question ++ replacements[''] = captcha_box ++ # fill form + replacements[''] += ( +- '\n' +- % (now, Utils.sha_new(mm_cfg.SUBSCRIBE_FORM_SECRET + ++ '\n' ++ % (now, captcha_idx, Utils.sha_new(mm_cfg.SUBSCRIBE_FORM_SECRET + + now + ++ captcha_idx + + mlist.internal_name() + + remote + ).hexdigest() +``` +And here the patch for `/usr/lib/mailman/Mailman/Cgi/subscribe.py`: +``` +--- subscribe.py.orig 2018-06-03 19:18:35.761813517 +0200 ++++ subscribe.py 2018-06-03 20:35:00.056454989 +0200 +@@ -25,6 +25,7 @@ + + from Mailman import mm_cfg + from Mailman import Utils ++from Mailman import Captcha + from Mailman import MailList + from Mailman import Errors + from Mailman import i18n +@@ -144,13 +147,14 @@ + # for our hash so it doesn't matter. + remote1 = remote.rsplit(':', 1)[0] + try: +- ftime, fhash = cgidata.getvalue('sub_form_token', '').split(':') ++ ftime, fcaptcha_idx, fhash = cgidata.getvalue('sub_form_token', '').split(':') + then = int(ftime) + except ValueError: +- ftime = fhash = '' ++ ftime = fcaptcha_idx = fhash = '' + then = 0 + token = Utils.sha_new(mm_cfg.SUBSCRIBE_FORM_SECRET + + ftime + ++ fcaptcha_idx + + mlist.internal_name() + + remote1).hexdigest() + if ftime and now - then > mm_cfg.FORM_LIFETIME: +@@ -165,6 +169,10 @@ + results.append( + _('There was no hidden token in your submission or it was corrupted.')) + results.append(_('You must GET the form before submitting it.')) ++ # Check captcha ++ captcha_answer = cgidata.getvalue('captcha_answer', '') ++ if not Captcha.verify(fcaptcha_idx, captcha_answer, mm_cfg.CAPTCHAS): ++ results.append(_('This was not the right answer to the CAPTCHA question.')) + # Was an attempt made to subscribe the list to itself? + if email == mlist.GetListEmail(): + syslog('mischief', 'Attempt to self subscribe %s: %s', email, remote) +``` + +Finally, the HTML template for the listinfo page needs to be updated to show the CAPTCHA question and answer box. +On Debian, the templates for enabled languages are located in `/etc/mailman/`. +The patch for the English template looks as follows: +``` +--- /usr/share/mailman/en/listinfo.html 2018-02-08 07:54:28.000000000 +0100 ++++ listinfo.html 2018-06-03 20:35:10.680275026 +0200 +@@ -116,6 +116,12 @@ + + + ++ Please answer the following question to prove that you are not a bot: ++ ++ ++ ++ ++ + +
+ +``` + +If you have other languages enabled, you have to translate this patch +accordingly. + +That's it! Now bots have to be adapted to your specific questions to be able to +successfully subscribe someone. It is still a good idea to monitor the logs +(`/var/log/mailman/subscribe` on Debian) to see if any illegitimate requests +still make it through, but unless you site is really big I'd be rather surprised +to see bots being able to answer site-specific questions. -- 2.30.2 From 9546fb75c0c042ab939f9dcca41de0b8c2be3ff7 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Sat, 16 Jun 2018 22:13:21 +0200 Subject: [PATCH 02/16] prettify iptables rules --- personal/_posts/2018-05-28-cloudless-contact-sync.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/personal/_posts/2018-05-28-cloudless-contact-sync.md b/personal/_posts/2018-05-28-cloudless-contact-sync.md index 9b3e190..baced47 100644 --- a/personal/_posts/2018-05-28-cloudless-contact-sync.md +++ b/personal/_posts/2018-05-28-cloudless-contact-sync.md @@ -50,10 +50,10 @@ table nat { ``` The plain iptables equivalent is ``` --A PREROUTING -d $IP/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination $IP:8053 --A PREROUTING -d $IP/32 -p tcp -m tcp --dport 443 -j DNAT --to-destination $IP:44353 --A OUTPUT -d $IP/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination $IP:8053 --A OUTPUT -d $IP/32 -p tcp -m tcp --dport 443 -j DNAT --to-destination $IP:44353 +-A PREROUTING -d $IP -p tcp -m tcp --dport 80 -j DNAT --to-destination $IP:8053 +-A PREROUTING -d $IP -p tcp -m tcp --dport 443 -j DNAT --to-destination $IP:44353 +-A OUTPUT -d $IP -p tcp -m tcp --dport 80 -j DNAT --to-destination $IP:8053 +-A OUTPUT -d $IP -p tcp -m tcp --dport 443 -j DNAT --to-destination $IP:44353 ``` Next, we have to configure the SSH daemon to permit reverse port forwarding to be configured by the client. -- 2.30.2 From b1c4213e3195e115c64e675254c35f8a13f90b5a Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Sun, 8 Jul 2018 15:23:04 +0200 Subject: [PATCH 03/16] add MoSeL paper --- research/publications.html | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/research/publications.html b/research/publications.html index b54e8de..5b30140 100644 --- a/research/publications.html +++ b/research/publications.html @@ -13,6 +13,13 @@ slug: Publications

2018

+
  • + MoSeL: A General, Extensible Modal Framework for Interactive Proofs in Separation Logic
    + Robbert Krebbers, Jacques-Henri Jourdan, Ralf Jung, Joseph Tassarotti, Jan-Oliver Kaiser, Amin Timany, Arthur Charguéraud, Derek Dreyer
    + In ICFP 2018: 23nd ACM SIGPLAN International Conference on Functional Programming
    + [paper] [paper website (incl. appendix and Coq formalization)] +
+
  • RustBelt: Securing the Foundations of the Rust Programming Language
    Ralf Jung, Jacques-Henri Jourdan, Robbert Krebbers, Derek Dreyer
    -- 2.30.2 From 6cd38e47c53388a006b090c203f8c9791bea117d Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Wed, 11 Jul 2018 12:42:06 +0200 Subject: [PATCH 04/16] blog post: back at mozilla --- .../_posts/2018-07-11-research-assistant.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 personal/_posts/2018-07-11-research-assistant.md diff --git a/personal/_posts/2018-07-11-research-assistant.md b/personal/_posts/2018-07-11-research-assistant.md new file mode 100644 index 0000000..7ff0d91 --- /dev/null +++ b/personal/_posts/2018-07-11-research-assistant.md @@ -0,0 +1,18 @@ +--- +title: "Back at Mozilla" +categories: internship rust +--- + +After my [internship last year has ended]({{ site.baseurl }}{% post_url 2017-08-12-internship-ending %}), I naturally became somewhat less active in the Rust community as I could not work on Rust full-time any more. +Well, for the following months I am going to be back full-time. :) +Thanks to @aturon, I am able to work as a research assistant for Mozilla during this summer (until the end of November). +I don't really know what a "research assistant" is, but I am going to continue the work on Rust memory models, and hopefully also have some time to make progress on `union` semantics. + + + +After exploring [a "validity"-based model]({{ site.baseurl }}{% post_url 2017-08-11-types-as-contracts-evaluation %}) last year, I am going to be looking at an "alias"-based model this year. +That's the kind of model @arielb1, @ubsan and others have been proposing, and I am going to build on top of their work and hopefully come up with something we can actually implement a checker for +(staying true to the [vision I laid out previously]({{ site.baseurl }}{% post_url 2017-06-06-MIR-semantics %}) that we should have an executable operational semantics for MIR, including its [undefined behavior]({{ site.baseurl }}{% post_url 2017-07-14-undefined-behavior %})). +Expect a blog post soon for what I mean by "validity"-based vs. "alias"-based, and for a first draft of such an "alias"-based model. + +Until then, stay tuned! -- 2.30.2 From 35f2a13f010d857fc7273d5e295bcdde03d53db9 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Fri, 13 Jul 2018 17:19:49 +0200 Subject: [PATCH 05/16] new post on Arc bug --- personal/_posts/2017-06-09-mutexguard-sync.md | 2 +- .../_posts/2018-07-13-arc-synchronization.md | 188 ++++++++++++++++++ 2 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 personal/_posts/2018-07-13-arc-synchronization.md diff --git a/personal/_posts/2017-06-09-mutexguard-sync.md b/personal/_posts/2017-06-09-mutexguard-sync.md index 62ac7de..0d12ee8 100644 --- a/personal/_posts/2017-06-09-mutexguard-sync.md +++ b/personal/_posts/2017-06-09-mutexguard-sync.md @@ -1,6 +1,6 @@ --- title: How MutexGuard was Sync When It Should Not Have Been -categories: rust +categories: rust research reddit: /rust/comments/6gavfe/how_mutexguard_was_sync_when_it_should_not_have/ --- diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md new file mode 100644 index 0000000..dd1e490 --- /dev/null +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -0,0 +1,188 @@ +--- +title: "The Tale of a Bug in Arc: Sychronization and Data Races" +categories: rust research +--- + +While I was busy doing Rust-unelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race. + +Notice that I am just the messenger here, the bug was actually found by [Hai](https://people.mpi-sws.org/~haidang/) and [Jacques-Henri](https://jhjourdan.mketjh.fr/). +Still, I'd like to use this opportunity to talk a bit about weak memory, synchronization and data races. +This is just a primer, there are tons of resources on the web that go into more detail (for example [here](http://preshing.com/20120913/acquire-and-release-semantics/)). + + + +## Synchronization and Data Races + +Consider the following example (full of unsafe code because I am demonstrating behavior that Rust does its best to rule out): +{% highlight rust %} +extern crate rayon; + +static mut DATA : Option = None; +static mut FLAG : usize = 0; + +fn main() { unsafe { + rayon::join( + || { + // Initialize data + DATA = Some("Hello World!".to_string()); + // Signal that we are done + FLAG = 1; + }, + || { + // Wait until the other thread is done + while FLAG == 0 {} + // Print data + println!("{}", DATA.as_ref().unwrap()); + } + ); +} } +{% endhighlight %} +The question is: What will this program print? +If you are thinking "Hello World!", you are not wrong, but you are not fully right either. +"Hello World!" is one possibility, but this program can also panic at the `unwrap` and in fact it can do anything it wants, because it has undefined behavior. + +### Weak Memory + +But let us first talk about the possibility of a panic. +How can that happen? +The reason is that [accessing memory is slow](https://gist.github.com/jboner/2841832), at least compared to other operations like arithmetic or accessing a register. +Both compilers and CPUs do everything they can to make sure the program does not wait for memory. +For example, your compiler might decide that it is advantageous to reorder the two instructions in the first thread, putting the `FLAG = 1` *before* the `DATA = Some(...)`. +These are writes to two distinct memory locations, so reordering is fine as the overall result is the same -- right? + +Well, that was true when all programs were single-threaded, but with a concurrent program, another thread can actually tell that the writes are now happening in a different order! +That's how the second thread might see `FLAG = 1` but still read `DATA = NONE`. +Moreover, even if the compiler didn't do anything like that, we could *still* see a result of "0" if you are running this program on an ARM chip. +These chips are very aggressive in how they cache memory accesses, with the result that even if you write a program like the above in assembly, you may still be up for a surprise. + +We call memory that can exhibit such strange reordering effects "weak memory" because it is weaker in the sense that it provides fewer guarantees than one might naively expect. + +### Data Races + +But why does the program have undefined behavior? +The reason is that Rust has tons of primitives whose behavior is essentially impossible to specify once concurrency gets involves. +Consider the assignment `DATA = Some(...)` above, where an entire `Option` is being copied. +What would happen if some other thread reads `DATA` while the assignment is still in progress? +The other thread would see some kind of mixture between the old and the new `DATA`! +That's just complete garbage. +So systems programming languages essentially give up at this point and declare *data races* to be undefined behavior. + +A data race is defined as follows: + +> Two accesses to the same location form a data race if: +> 1. at least one of them is a write, and +> 2. at least one of them is not a special *atomic* access, and +> 3. the accesses are not *ordered*. + +We will talk about the "order" mentioned in condition (3) later. +For now, let us look at "atomic accesses". + +### Atomic Memory Accesses + +In our program above, the reads and writes to `FLAG` satisfy all three conditions. +To fix that, we will negate condition (2): We will use *atomic accesses*. + +All usual accesses (like all the ones in our example code above) are so called *non-atomic* accesses. +They can be compiled efficiently and are heavily optimized, but as we have seen they quickly cause trouble when writing low-level concurrent code. +To make it possible to write such code, but still keep most of the existing optimizations, the C++11 standard introduced special functions that can be used to perform *atomic* read and write accesses to a location. +We also have these atomic accesses in Rust, and we can use them in our program as follows: +{% highlight rust %} +extern crate rayon; + +use std::sync::atomic::{AtomicUsize, Ordering}; + +static mut DATA : Option = None; +static FLAG : AtomicUsize = AtomicUsize::new(0); + +fn main() { + rayon::join( + || { + // Initialize data + unsafe { DATA = Some("Hello World!".to_string()); } + // Signal that we are done + FLAG.store(1, Ordering::Relaxed); }, + || { + // Wait until the other thread is done + while FLAG.load(Ordering::Relaxed) == 0 {} + // Print data + println!("{}", unsafe { DATA.as_ref().unwrap() }); + } + ); +} +{% endhighlight %} +Notice how we are using `AtomicUsize` as type for `FLAG`, and `AtomicUsize::load` and `AtomicUsize::store` instead of non-atomic memory accesses. +These are atomic accesses, and hence we no longer have a data race on `FLAG`. +(I will come back to this `Ordering::Relaxed` parameter soon.) +Notice also that we significantly reduced the amount of unsafe operations, because `AtomucUsize` is actually completely safe to use from multiple threads. + +### Synchronization + +However, we are not done yet: We still have a race on `DATA`! +We also cannot use atomic accesses for `DATA` because of its type, `Option`, which is just too big to be read/written atomically. +This brings us to the third condition in our definition of data races: We have to somehow "order" the two accesses to `DATA`. + +So, what is that "order" about? +The idea is that while in general, operations on a concurrent program can happen in parallel, with strange effects like our first program panicking, there are still *some* operations that we can rely on to be properly "ordered". +For example, all operations within a single thread are ordered the way they are written, so e.g. `DATA = Some(...)` is ordered before `FLAG.store`. +However, what is missing in the program above is some way to obtain an order between operations on *different threads*. +This is where the `Ordering::Relaxed` parameter comes in: All atomic accesses come with an order mode that indicates under which circumstances this access will induce an order between operations. +Our accesses above are `Relaxed`, which means no order is induced. + +So, what we will have to do is to strengthen the mode of our `FLAG` accesses to induce an order between the write of `1` in the first thread, and the operation that reads `1` in the second thread. +To this end, we use the `Release` and `Acquire` pair of modes: +{% highlight rust %} +extern crate rayon; + +use std::sync::atomic::{AtomicUsize, Ordering}; + +static mut DATA : Option = None; +static FLAG : AtomicUsize = AtomicUsize::new(0); + +fn main() { + rayon::join( + || { + // Initialize data + unsafe { DATA = Some("Hello World!".to_string()); } + // Signal that we are done + FLAG.store(1, Ordering::Release); }, + || { + // Wait until the other thread is done + while FLAG.load(Ordering::Acquire) == 0 {} + // Print data + println!("{}", unsafe { DATA.as_ref().unwrap() }); + } + ); +} +{% endhighlight %} +This program, finally, is free of data races and hence has no undefined behavior and is guaranteed to print "Hello World!". + +They key point is that *when an `Acquire` read reads from a `Release` write, an order is induced between these two accesses*. +We also say that the two accesses *synchronize*. +Together with the per-thread order, this means we have an order between the `DATA = Some(...)` in the first thread and the load of `DATA` in the second thread, through the accesses to `FLAG`. +Intuitively, the `Release` write in the first thread "releases" everything that has been changed by this thread so far, and the `Acquire` read in the second thread then "acquires" all that data and makes it accessible for access later in this thread. + +Now, most of the time a `Mutex` or `RwLock` is good enough to protect your data against concurrent accesses, so you do not have to worry about such details. +(And, thanks to Rust thread safety guarantees, you never have to worry about such details in safe code!) +But based on what you learned so far, it should make perfect sense that when a lock is released by thread A, that will happen through a `Release` access, and when a lock is acquired by thread B, that happens through an `Acquire` access. +This way, the lock makes sure that there is an order between the accesses thread A performed when it held the lock (before the `Release`), and the accesses thread B will perform while it has the lock (after the `Acquire`). + +## Coming Back to `Arc` + +I said that `Mutex`/`RwLock` are good enough *most of the time*. +However, `Arc` is one of those cases where the overhead induced by an exclusive lock is just way too big, so it is worth using a more optimized, unsafe implementation. +As such, you are going to find plenty of atomic accesses in [the source code of `Arc`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L201). + +And it turns out, as Hai and Jacques-Henri noticed when attempting to prove correctness of [`Arc::get_mut`](https://doc.rust-lang.org/beta/std/sync/struct.Arc.html#method.get_mut), that there is one place where `Relaxed` as used as an ordering, [but it really should have been `Acquire`](https://github.com/rust-lang/rust/pull/52031). +Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last [`Weak`](https://doc.rust-lang.org/beta/std/sync/struct.Weak.html) reference. +With `Relaxed`, no such ordering is induced, so we have a data race. +To be fair, it is very unlikely that this race could lead to real misbehavior, but I am still happy to know that we now have a proof that `Arc` is mostly[^1] correctly synchronized. + +[^1]: "Mostly", you may wonder? Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct. However, this is likely fixable by making the entire proof more complicated. The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct. + +One last thing: +I have previously claimed that our [first RustBelt paper]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}) already verified correctness of `Arc`, how can there still be bugs? +In that paper, we did not (yet) have the tools to reason realistically about these ordering issues we have been discussing here, so instead we worked with a "sequentially consistent" logic which essentially assumes the strongest possible mode, `SeqCst`, for all atomic accesses. +(We did not discuss `SeqCst` above, and in fact there are not many cases where it is really needed. [`Release` and `Acquire` are enough most of the time](https://internals.rust-lang.org/t/sync-once-per-instance/7918/8?u=ralfjung).) +This is one of the simplifications we made compared to real Rust to make the verification feasible. +We were realistic enough to find [another bug]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %}), but not realistic enough for this one. +Hai and Jacques-Henri are currently working on remedying this particular simplification by extending the first RustBelt paper to also cover weak memory, and that's when they ran into this problem. -- 2.30.2 From 75f0adade98617f9d0059184960f85b6cb8b3bb0 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Fri, 13 Jul 2018 17:21:38 +0200 Subject: [PATCH 06/16] add reddit URL, fix typo --- personal/_posts/2018-07-13-arc-synchronization.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index dd1e490..7c78cd0 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -1,6 +1,7 @@ --- -title: "The Tale of a Bug in Arc: Sychronization and Data Races" +title: "The Tale of a Bug in Arc: Synchronization and Data Races" categories: rust research +reddit: https://www.reddit.com/r/rust/comments/8ykuhv/the_tale_of_a_bug_in_arc_synchronization_and_data/ --- While I was busy doing Rust-unelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race. -- 2.30.2 From 6b193cf5f19077de20c195bc0be6cf0b4e7719e0 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Fri, 13 Jul 2018 18:25:42 +0200 Subject: [PATCH 07/16] extend footnote --- personal/_posts/2018-07-13-arc-synchronization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index 7c78cd0..37f3a31 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -178,7 +178,7 @@ Discussing the exact details of the bug would probably fill another blog post (` With `Relaxed`, no such ordering is induced, so we have a data race. To be fair, it is very unlikely that this race could lead to real misbehavior, but I am still happy to know that we now have a proof that `Arc` is mostly[^1] correctly synchronized. -[^1]: "Mostly", you may wonder? Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct. However, this is likely fixable by making the entire proof more complicated. The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct. +[^1]: "Mostly", you may wonder? Unfortunately it turns out that there is [one `Relaxed` access in `make_mut`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L793) that Hai and Jacques-Henri have not yet been able to prove correct. However, this is likely fixable by making the entire proof more complicated. The version where that `Relaxed` is also replaced by `Acquire` *has* been proven correct, albeit still against a model of relaxed memory accesses that is not quite as weak as C11. One last thing: I have previously claimed that our [first RustBelt paper]({{ site.baseurl }}{% post_url 2017-07-08-rustbelt %}) already verified correctness of `Arc`, how can there still be bugs? -- 2.30.2 From b01f07d6188bf279dd0d281f952d15deb1217940 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Fri, 13 Jul 2018 19:49:34 +0200 Subject: [PATCH 08/16] fix typos --- personal/_posts/2018-07-13-arc-synchronization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index 37f3a31..0994c3f 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -1,10 +1,10 @@ --- title: "The Tale of a Bug in Arc: Synchronization and Data Races" categories: rust research -reddit: https://www.reddit.com/r/rust/comments/8ykuhv/the_tale_of_a_bug_in_arc_synchronization_and_data/ +reddit: /rust/comments/8ykuhv/the_tale_of_a_bug_in_arc_synchronization_and_data/ --- -While I was busy doing Rust-unelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race. +While I was busy doing Rust-unrelated research, [RustBelt](https://plv.mpi-sws.org/rustbelt/) continues to move and recently found another bug (after [a missing `impl !Sync` that we found previously]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %})): It turns out that `Arc::get_mut` did [not perform sufficient synchronization](https://github.com/rust-lang/rust/issues/51780), leading to a data race. Notice that I am just the messenger here, the bug was actually found by [Hai](https://people.mpi-sws.org/~haidang/) and [Jacques-Henri](https://jhjourdan.mketjh.fr/). Still, I'd like to use this opportunity to talk a bit about weak memory, synchronization and data races. -- 2.30.2 From bd8c5679ff17bf58d097009c8508859c3bb32eb2 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Fri, 13 Jul 2018 23:47:40 +0200 Subject: [PATCH 09/16] fix Arc issue description --- personal/_posts/2018-07-13-arc-synchronization.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index 0994c3f..50a518c 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -174,7 +174,8 @@ However, `Arc` is one of those cases where the overhead induced by an exclusive As such, you are going to find plenty of atomic accesses in [the source code of `Arc`](https://github.com/rust-lang/rust/blob/c0955a34bcb17f0b31d7b86522a520ebe7fa93ac/src/liballoc/sync.rs#L201). And it turns out, as Hai and Jacques-Henri noticed when attempting to prove correctness of [`Arc::get_mut`](https://doc.rust-lang.org/beta/std/sync/struct.Arc.html#method.get_mut), that there is one place where `Relaxed` as used as an ordering, [but it really should have been `Acquire`](https://github.com/rust-lang/rust/pull/52031). -Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last [`Weak`](https://doc.rust-lang.org/beta/std/sync/struct.Weak.html) reference. +Discussing the exact details of the bug would probably fill another blog post (`Arc` is *really* subtle), but the high-level story is exactly like in our example above: Thanks to `Acquire`, an ordering is induced between the code that follows the `get_mut` and the code in another thread that dropped the last other `Arc`, decrementing the reference count to 1. +The PR that fixed the problem contains [some more details in the comments](https://github.com/rust-lang/rust/pull/52031/files). With `Relaxed`, no such ordering is induced, so we have a data race. To be fair, it is very unlikely that this race could lead to real misbehavior, but I am still happy to know that we now have a proof that `Arc` is mostly[^1] correctly synchronized. -- 2.30.2 From 0e7d706dad6077812bd049c39459d35be6783f90 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Mon, 16 Jul 2018 10:47:27 +0200 Subject: [PATCH 10/16] avoid 'read reads' --- personal/_posts/2018-07-13-arc-synchronization.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index 50a518c..1ad8a46 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -157,10 +157,10 @@ fn main() { {% endhighlight %} This program, finally, is free of data races and hence has no undefined behavior and is guaranteed to print "Hello World!". -They key point is that *when an `Acquire` read reads from a `Release` write, an order is induced between these two accesses*. +They key point is that *when a `load(Acquire)` reads from a `store(_, Release)`, an order is induced between these two accesses*. We also say that the two accesses *synchronize*. Together with the per-thread order, this means we have an order between the `DATA = Some(...)` in the first thread and the load of `DATA` in the second thread, through the accesses to `FLAG`. -Intuitively, the `Release` write in the first thread "releases" everything that has been changed by this thread so far, and the `Acquire` read in the second thread then "acquires" all that data and makes it accessible for access later in this thread. +Intuitively, the `store(_, Release)` in the first thread "releases" everything that has been changed by this thread so far, and the `load(Acquire)` in the second thread then "acquires" all that data and makes it accessible for access later in this thread. Now, most of the time a `Mutex` or `RwLock` is good enough to protect your data against concurrent accesses, so you do not have to worry about such details. (And, thanks to Rust thread safety guarantees, you never have to worry about such details in safe code!) -- 2.30.2 From 167326ade86d8663f6440cc930b65ea5f32e66b9 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Mon, 16 Jul 2018 15:34:54 +0200 Subject: [PATCH 11/16] link to Servo's copy of the same code --- personal/_posts/2018-07-13-arc-synchronization.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/personal/_posts/2018-07-13-arc-synchronization.md b/personal/_posts/2018-07-13-arc-synchronization.md index 1ad8a46..2e8f7ea 100644 --- a/personal/_posts/2018-07-13-arc-synchronization.md +++ b/personal/_posts/2018-07-13-arc-synchronization.md @@ -188,3 +188,5 @@ In that paper, we did not (yet) have the tools to reason realistically about the This is one of the simplifications we made compared to real Rust to make the verification feasible. We were realistic enough to find [another bug]({{ site.baseurl }}{% post_url 2017-06-09-mutexguard-sync %}), but not realistic enough for this one. Hai and Jacques-Henri are currently working on remedying this particular simplification by extending the first RustBelt paper to also cover weak memory, and that's when they ran into this problem. + +**Update:** Turns out Servo has a [copy of `Arc`](https://doc.servo.org/servo_arc/index.html) that [has the same problem](https://github.com/servo/servo/issues/21186). So we got two bugs for the price of one. :) **/Update** -- 2.30.2 From 0c9f308229040f1416676ae704db9dc6ab332e39 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Tue, 17 Jul 2018 20:36:16 +0200 Subject: [PATCH 12/16] improve old blog post title --- personal/_posts/2017-05-23-internship-starting.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/personal/_posts/2017-05-23-internship-starting.md b/personal/_posts/2017-05-23-internship-starting.md index c828316..149b4f3 100644 --- a/personal/_posts/2017-05-23-internship-starting.md +++ b/personal/_posts/2017-05-23-internship-starting.md @@ -1,5 +1,5 @@ --- -title: "Day 1 of My Mozilla Internship, or: How to Specify Program Behavior?" +title: "How to Specify Program (Undefined) Behavior?" categories: internship rust --- @@ -7,6 +7,8 @@ This summer, I am given the awesome opportunity of spending three months in the I am extremely grateful that Mozilla is providing this opportunity to me; hopefully I can give something back by making some hands-on contributions to the Rust ecosystem. +**Update:** This post was originally titled "Day 1 of My Mozilla Internship, or: How to Specify Program Behavior?" but I feel that did not really reflect its content very well. **/Update** + Today is day 1 of my internship -- at least if you start counting at 0, which of course we all do. I meant to write this yesterday, but the morning was filled with bureaucratics and in the afternoon I got so jetlagged, I essentially was sleeping with my eyes open. (But don't tell Aaron, my manager, or Niko, my mentor -- I was of course supposed to be working. ;) -- 2.30.2 From 225745df642a95419064ace9004bffe5d1ba25de Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Thu, 19 Jul 2018 18:11:02 +0200 Subject: [PATCH 13/16] post on CTFE --- personal/_posts/2018-07-19-const.md | 234 ++++++++++++++++++++++++++++ 1 file changed, 234 insertions(+) create mode 100644 personal/_posts/2018-07-19-const.md diff --git a/personal/_posts/2018-07-19-const.md b/personal/_posts/2018-07-19-const.md new file mode 100644 index 0000000..ee1eab3 --- /dev/null +++ b/personal/_posts/2018-07-19-const.md @@ -0,0 +1,234 @@ +--- +title: "Thoughts on Compile-Time Function Evaluation and Type Systems" +categories: internship rust +--- + +For some time now (since the 1.26 release, to be precise), Rust has a [very powerful machinery for CTFE](https://github.com/rust-lang/rust/pull/46882), or compile-time function evaluation. +Since then, there have been various discussions about which operations should be allowed during CTFE, which checks the compiler should do, and which kinds of guarantees we should be able to expect around CTFE. +This post is my take on those topics, and it should not be surprising that I am going to take a very type-system centric view. :) +Expect something like a structured brain dump, so there are some unanswered questions towards the end as well. + + + +## Some Background + +CTFE is the mechanism used by the compiler, primarily, to evaluate items like `const x : T = ...;`. +The `...` here is going to be Rust code that must be "run" at compile-time, because it can be used as a constant in the code -- for example, it can be used for array lengths. + +Notice that CTFE is *not* the same as constant propagation: Constant propagation is an optimization pass done by LLVM that will opportunistically change code like `3 + 4` into `7` to avoid run-time work. +Being an optimization, constant propagation must, by definition, not change program behavior and will not be observable at all (other than performance). +CTFE, on the other hand, is about code that *must* be executed at compile-time because the compiler needs to know its result to proceed -- for example, it needs to know the size of an array to compute how to lay out data in memory. +You can statically see, just from the syntax of the code, whether CTFE applies to some piece of code or not: +CTFE is only used in places like the value of a `const` or the length of an array. +{% highlight rust %} +fn demo() { + const X : u32 = 3 + 4; // CTFE + let x : u32 = 4 + 3; // no CTFE (but maybe constant propagation) +} +{% endhighlight %} +We say that the `3 + 4` above is in *const context* and hence subject to CTFE, but the `4 + 3` is not. + +## Const Safety + +Not all operations can be used in const context. +For example, it makes no sense to compute your array length as "please go read that file from disk and compute something" -- we can't know what will be on the disk when the program actually runs. +We could use the disk of the machine compiling the program, but that does not sound very appearling either. +In fact, it would also be grossly unsafe: +When computing the length of an array twice, it is important that we obtain the same result. + +> *CTFE must be deterministic.* + +If not, the compiler could end up thinking that two arrays have the same length, but then later compute different layouts. +That would be a disaster. +So, any kind of external input and any kind of non-determinism is a complete no-go for CTFE. +This does not just concern I/O, even converting a reference to a `usize` is not deterministic. + +The compiler will throw a CTFE error if such an operation is ever attempted to be executed. +Those programs that *are* executable in const context are called *const safe*: + +> *A program is const safe if it can be executed by CTFE without hitting an error (panics are allowed).* + +This is very much in analogy with the idea that a *safe* (or *run-time safe*, to distinguish it from const safe) program is a program that does not cause any memory errors or data races. +In fact, we will see that this analogy between "programs that are well-behaved under CTFE" (const safety) and "programs that do not cause UB" (run-time safety) can carry us very far. + +One very interesting question now is whether some given function `foo` should be allowed to be called in const context. +We could just always say "yes", and rely on the fact that CTFE will throw an error when `foo` does anything fishy. +The problem with this approach is that, if `foo` is in a library, updating the library might change `foo` in a way that makes it no longer const-safe. +In other words, making *any* function not const-safe any more would be a semver violation because it could break downstream crates. + +The typical mechanism to solve that problem is to have an annotation that explicitly marks a function as "usable in const context". +In Rust, the proposed mechanism for this purpose is [`const fn`](https://github.com/rust-lang/rust/issues/24111); in C++ it is called `constexpr`. +The compiler can now reject calling non-`const` functions in const context, so library authors can add non-const-safe operations without breaking semver. + +## Const Type System and Const Soundness + +This leads us to the interesting situation that the compiler will reject code in const context that it would accept just fine outside of const context. +In particular, the body of a `const fn` is *also* considered to be in const context -- otherwise, if we allowed calling arbitrary functions, we would have the same problem again. +One useful way to think about this is that we have a second type system, a "const type system", that is used to type-check code in const context. +This type system does not allow calls to non-`const` functions. + +It should probably also not allow casting a reference to an integer, because (as discussed above) that is a non-deterministic operation which cannot be performed during CTFE. +What else? + +Before we go on and add random additional checks, let us step back and think about what our goals are here. +Typically, the purpose of a type system is to establish some sort of guarantee for a well-typed program. +For Rust's "main" ("run-time") type system, that guarantee is "no undefined behavior", which means no memory errors and no data races. +What is the guarantee for our new const type system? +We have already talked about it above: It's const safety! +This leads us to the definition of const soundness: + +> *Our const type system is sound if well-typed programs are const-safe.* + +Again, notice how this is very similar to the correctness statement for the run-time type system, which guarantees run-time safety. + +However, we have to be a bit careful here. +Consider the following piece of code: +{% highlight rust %} +const fn is_eight_mod_256(x: usize) -> bool { x % 256 == 8 } +{% endhighlight %} +We will definitely want to allow this code. +Why should `==` or `%` not be const-safe? +Well, we could call our function as follows: +{% highlight rust %} +is_eight_mod_256(Box::new(0).into_raw() as usize); +{% endhighlight %} +That statement is certainly *not* const-safe as the result depends on where exactly the allocator puts our `Box`. +However, we want to blame the `as usize` for this issue, not the `is_eight_mod_256`. + +The solution is for the const type system to not just have separate rules about which operations are allowed, we also must change our notion of which values are "valid" for a given type. +An integer obtained from a pointer is valid for `usize` at run-time, but it is *not* valid for `usize` in const mode! +After all, there are basic arithmetic operations that we expect all `usize` to support, that CTFE cannot support for pointers. + +> *A function is const-safe if, when executed with const-valid arguments, it does not trigger a CTFE error and returns a const-valid result (if it returns at all).* + +Under this definition, `is_eight_mod_256` is const-safe because whenever `x` is an actual integer, it will evaluate without any error. +At the same time, this shows that converting a reference into `usize` is *not* const-safe, because the input of this operation is const-valid, but the output is not! +This provides a solid justification for rejecting such casts in const context. + +## CTFE correctness + + +In Rust, CTFE is performed by miri, a MIR interpreter that used to be a [separate project](https://github.com/solson/miri/) but whose core engine has been integrated into rustc. +miri will execute the code in const context step-by-step and just complain and fail with an error when an operation cannot be performed. +This does not just concern non-determinism; miri does not support everything it could support because @oli-obk is [super careful](https://github.com/rust-lang/rust/blob/5ba21844f6c85a0cd55c8ea0250d5cd758134f84/src/librustc_mir/interpret/const_eval.rs#L199) about not accidentally stabilizing behavior that should undergo an RFC. + +In fact, right now miri will reject all operations on raw pointers. +They all raise a CTFE error and hence must all be rejected by the const type system. +The plan is to change miri so that it can support more operations, but we have to be careful in doing so. +I have already mentioned that miri must be deterministic, but there is another point to consider that you might have expected to play a much more prominent role: +CTFE, at least if it succeeds, should match run-time behavior! + +> *CTFE is correct if, when it loops forever, completes with a result, or panics, that behavior matches the run-time behavior of the same code.* + +We clearly do not want code to behave differently when it lives in const context and is run by CTFE, and when it is compiled to machine-code and executed "for real". + +Or, do we? +Don't get me wrong, I am not advocating for deliberately breaking that property, but it sure is worth considering what would go wrong if miri was *not* CTFE-correct. +Maybe surprisingly, it turns out that this would not be a soundness issue! +All we care about for the purpose of soundness is for CTFE to be deterministic, as already discussed. +We don't re-run the same code at run-time and rely on it still doing the same, so nothing actually breaks if CTFE behavior diverges from run-time behavior. + +That said, not being CTFE correct is surely very surprising and we should avoid it best we can. +However, I am told that actually predicting the result of floating-point operations deterministically [is extremely hard](https://gafferongames.com/post/floating_point_determinism/) and [LLVM isn't exactly helping](https://github.com/rust-lang/rust/issues/24111#issuecomment-386765720). +So, we will likely have to live with either considering floating point operations to be const-unsafe (raising a CTFE error), or not having CTFE correctness when floating point operations are involved. +I think it is possible to achieve CTFE correctness for all other operations, and I think we should strive to do so. + +Before we go on, notice that CTFE correctness as defined above does not say anything about the case where CTFE fails with an error, e.g. because of an unsupported operation. +That is a deliberate choice because it lets us gradually improve the operations supported by CTFE, but it is a choice that not everyone might agree with. + +## Unsafe Blocks in Const Context + +Let's say we want to extend miri to support more operations on raw pointers. +We know we have to be careful about keeping miri deterministic, and about maintaining CTFE correctness. +Which operations will we be able to support? + +Notice that at this point, const soundness and the related const safety are *not* a concern yet. +Those ideas come into play when we are changing the const type system to allow more operations. +CTFE determinism and correctness, however, are properties of the CTFE engine (miri) itself. + +Let us look at the following example: +{% highlight rust %} +const fn make_a_bool() -> bool { + let x = Box::new(0); + let x_ptr = &*x as *const i32; + drop(x); + let y = Box::new(0); + let y_ptr = &*y as *const i32; + x_ptr == y_ptr +} +{% endhighlight %} +At run-time, whether this function returns `true` or `false` depends on whether or not the allocator re-used the space for `x` when allocating `y`. +However, due to CTFE being deterministic, we have to pick *one* concrete answer at compile-time, and that may not be the right answer. +Hence we cannot allow this program to execute under CTFE if we want to maintain CTFE correctness. +Supporting memory allocation in a deterministic way is perfectly feasible (in fact, miri already has that implemented), and casting a reference to a raw pointer changes nothing but the type. +The only actually problematic operation here is testing two raw pointers for equality: +Because one of the pointers is dangling, we can not deterministically predict the result of this comparison! + +In other words, if/when miri learns how to compare pointers, we must make it raise a CTFE error if one of the pointers is dangling (points to unallocated memory), or else we would violate CTFE correctness. + +Now, let us go one level up and look at the const type system. +We have seen that comparing raw pointers can raise a CTFE error, so this is actually not a const-safe operation. +Similar to casting pointers to integers, we have to make the const type system reject code that compares raw pointers. +However, it seems like a shame to not even allow comparing two *references* for equality, after converting them to raw pointers! +After all, references never dangle, so this is a perfectly const-safe operation. + +Lucky enough, Rust *already* has an answer to the need to side-step the type system in controlled ways: `unsafe` blocks. +Comparing raw pointers is not allowed by the const type system because it is not const-safe, but just like we allow run-time-unsafe operations to be performed in unsafe blocks, we can allow const-unsafe operations as well. +So, we should be able to write the following: +{% highlight rust %} +const fn ptr_eq(x: &T, y: &T) -> bool { + unsafe { x as *const _ == y as *const _ } +} +{% endhighlight %} +As usual when writing `unsafe` code, we have to be careful not to violate the safety guarantees that are usually upheld by the type system. +We have to manually ensure that, *if* our inputs are const-valid, then we will not trigger a CTFE error and return a const-valid result. +For this example, the reason no CTFE error can arise is that references cannot dangle. +We can thus provide `ptr_eq` as an abstraction that is entirely safe to use in const context, even though it contains a potentially const-unsafe operation. +This is, again, in perfect analogy with types like `Vec` being entirely safe to use from safe Rust even though `Vec` internally uses plenty of potentially unsafe operations. + +Whenever I said above that some operation must be rejected by the const type system, what that really means is that the operation should be unsafe in const context. +Even pointer-to-integer casts can be used internally in const-safe code, for example to pack additional bits into the aligned part of a pointer in a perfectly deterministic way. + +## Static Promotion + +There is one more aspect to CTFE in Rust that I have not yet touched on: Promotion of static values. +This is the mechanism that makes the following code well-typed: +{% highlight rust %} +fn make_a_3() -> &'static i32 { &3 } +{% endhighlight %} +This may look like it should be rejected because we are returning a reference to a locally created value with lifetime `'static`, but instead, Magic (TM) is happening. +The compiler determines that `3` is a static value that can be computed at compile-time and put into static memory (like a `static` variable), and hence `&3` can have lifetime `'static`. +This also works with e.g. `&(3+4)`. +Static variables, like `const`, are computed at compile time and hence CTFE comes into play. + +The fundamentally new aspect to this is that *the user did not ask for the value to be promoted*. +That means we have to be really careful when deciding which values to promote: If we promote something that miri cannot evaluate, there will be a CTFE error and we have just broken compilation for no good reason. +We better make sure that we only promote values that we can expect miri to actually be able to compute -- i.e., we should only promote the results of const-safe code. +You probably already guessed it, but what I am proposing is to use the const type system for that purpose. +Const soundness already says that this is a way to ensure const safety. + +I propose to only ever promote values that are *safely* const-well-typed. +(So, we will not promote values involving const-unsafe operations even when we are in an unsafe block.) +When there are function calls, the function must be a safe `const fn` and all arguments, again, const-well-typed. +For example, `&is_eight_mod_256(13)` would be promoted but `&is_eight_mod_256(Box::new(0).into_raw() as usize)` would not. +As usual for type systems, this is an entirely local analysis that does not look into other functions' bodies. +Assuming our const type system is sound, the only way we could possibly have a CTFE error from promotion is when there is a safe `const fn` with an unsound `unsafe` block. + +Notably, we are relying on library authors properly writing `unsafe const fn` even for private functions whenever there is a chance that this function might cause a CTFE error for const-valid inputs. +If there is a `const fn` that is actually unsafe, arguing "but it is private and hence that's fine" will not help because the compiler might decide to promote the result of that function. +However, this can only break code within the same crate and can be fixed locally, so it seems like a reasonable compromise to me. + +Another interesting point to consider is that we probably care much more about CTFE correctness when thinking about promotion. +After all, the user asked for the run-time behavior; if they are getting a completely different behavior from CTFE, that would lead to problems. +If miri is CTFE-correct except for obscure floating point issues, that means "only" people relying on specific behavior of floating point operations could be affected, and likely LLVM will already violate whatever assumptions those people are making. +(miri's floating point implementation is perfectly sane and should be standards compliant, LLVM and the particularities of x87 rounding are the sources of uncertainty here.) +I am not sure which effect that should or will have for promotion. + +## Conclusion + +I have discussed the notions of CTFE determinism and CTFE correctness (which are properties of a CTFE engine like miri), as well as const safety (property of a piece of code) and const soundness (property of a type system). +There are still plenty of open questions, in particular around the interaction of [`const fn` and traits](https://github.com/rust-lang/rust/issues/24111#issuecomment-311029471), but I hope this terminology is useful when having those discussions. +Let the type systems guide us :) + +Thanks to @oli-obk for feedback on a draft of this post. + -- 2.30.2 From 7a0084a49697baf91a1469acabadd5acbcfbd7fa Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Thu, 19 Jul 2018 18:17:14 +0200 Subject: [PATCH 14/16] link to forum --- personal/_posts/2018-07-19-const.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/personal/_posts/2018-07-19-const.md b/personal/_posts/2018-07-19-const.md index ee1eab3..d523935 100644 --- a/personal/_posts/2018-07-19-const.md +++ b/personal/_posts/2018-07-19-const.md @@ -1,11 +1,12 @@ --- title: "Thoughts on Compile-Time Function Evaluation and Type Systems" categories: internship rust +forum: https://internals.rust-lang.org/t/thoughts-on-compile-time-function-evaluation-and-type-systems/8004 --- For some time now (since the 1.26 release, to be precise), Rust has a [very powerful machinery for CTFE](https://github.com/rust-lang/rust/pull/46882), or compile-time function evaluation. -Since then, there have been various discussions about which operations should be allowed during CTFE, which checks the compiler should do, and which kinds of guarantees we should be able to expect around CTFE. -This post is my take on those topics, and it should not be surprising that I am going to take a very type-system centric view. :) +Since then, there have been various discussions about which operations should be allowed during CTFE, which checks the compiler should do, how this all relates to promotion and which kinds of guarantees we should be able to expect around CTFE. +This post is my take on those topics, and it should not be surprising that I am going to take a very type-system centric view. Expect something like a structured brain dump, so there are some unanswered questions towards the end as well. @@ -231,4 +232,4 @@ There are still plenty of open questions, in particular around the interaction o Let the type systems guide us :) Thanks to @oli-obk for feedback on a draft of this post. - +If you have feedback or questions, [let's discuss in the internals forum](https://internals.rust-lang.org/t/thoughts-on-compile-time-function-evaluation-and-type-systems/8004)! -- 2.30.2 From 5b59a8a70a9ef91f094c88c94becb0bfa00e6857 Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Thu, 19 Jul 2018 19:09:50 +0200 Subject: [PATCH 15/16] notation --- personal/_posts/2018-07-19-const.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/personal/_posts/2018-07-19-const.md b/personal/_posts/2018-07-19-const.md index d523935..d93a6cd 100644 --- a/personal/_posts/2018-07-19-const.md +++ b/personal/_posts/2018-07-19-const.md @@ -13,7 +13,7 @@ Expect something like a structured brain dump, so there are some unanswered ques ## Some Background -CTFE is the mechanism used by the compiler, primarily, to evaluate items like `const x : T = ...;`. +CTFE is the mechanism used by the compiler, primarily, to evaluate items like `const x: T = ...;`. The `...` here is going to be Rust code that must be "run" at compile-time, because it can be used as a constant in the code -- for example, it can be used for array lengths. Notice that CTFE is *not* the same as constant propagation: Constant propagation is an optimization pass done by LLVM that will opportunistically change code like `3 + 4` into `7` to avoid run-time work. @@ -23,8 +23,8 @@ You can statically see, just from the syntax of the code, whether CTFE applies t CTFE is only used in places like the value of a `const` or the length of an array. {% highlight rust %} fn demo() { - const X : u32 = 3 + 4; // CTFE - let x : u32 = 4 + 3; // no CTFE (but maybe constant propagation) + const X: u32 = 3 + 4; // CTFE + let x: u32 = 4 + 3; // no CTFE (but maybe constant propagation) } {% endhighlight %} We say that the `3 + 4` above is in *const context* and hence subject to CTFE, but the `4 + 3` is not. -- 2.30.2 From 61d4abe5dbb1bc4d3c156995d0f546354029fe9e Mon Sep 17 00:00:00 2001 From: Ralf Jung Date: Thu, 19 Jul 2018 20:41:06 +0200 Subject: [PATCH 16/16] clarify conflusion --- personal/_posts/2018-07-19-const.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/personal/_posts/2018-07-19-const.md b/personal/_posts/2018-07-19-const.md index d93a6cd..f511b9f 100644 --- a/personal/_posts/2018-07-19-const.md +++ b/personal/_posts/2018-07-19-const.md @@ -228,6 +228,8 @@ I am not sure which effect that should or will have for promotion. ## Conclusion I have discussed the notions of CTFE determinism and CTFE correctness (which are properties of a CTFE engine like miri), as well as const safety (property of a piece of code) and const soundness (property of a type system). +In particular, I propose that *when type-checking safe code in const context, we guarantee that this code is const-safe*, i.e., that it will not hit a CTFE error (though panics are allowed, just like they are in "run-time" Rust code). + There are still plenty of open questions, in particular around the interaction of [`const fn` and traits](https://github.com/rust-lang/rust/issues/24111#issuecomment-311029471), but I hope this terminology is useful when having those discussions. Let the type systems guide us :) -- 2.30.2