personal/_posts/2026-03-13-inline-asm.md

   1 ---
   2 title: "How to use storytelling to fit inline assembly into Rust"
   3 categories: rust
   4 reddit: /rust/comments/1rshm93/how_to_use_storytelling_to_fit_inline_assembly/
   5 ---
   6
   7 The Rust Abstract Machine is full of [wonderful oddities]({% post_url 2020-12-14-provenance %}) that do not exist on the [actual hardware]({% post_url 2019-07-14-uninit %}).
   8 Inevitably, every time this is discussed, someone asks: "But, what if I use inline assembly? What happens with provenance and uninitialized memory and Tree Borrows and all these other fun things you made up that don't actually exist?"
   9 This is a great question, but answering it properly requires some effort.
  10 In this post, I will lay down my current thinking on how inline assembly fits into the Rust Abstract Machine by giving a *general principle* that explains how anything we decide about the semantics of pure Rust impacts what inline assembly may or may not do.
  11
  12 <!-- MORE -->
  13
  14 Note that everything I discuss here applies to FFI calls just as much as it applies to inline assembly.
  15 Those mechanisms are fundamentally very similar: they allow Rust code to invoke code not written in Rust.[^xlang]
  16 I will not keep repeating "inline assembly or FFI" throughout the post, but every time I refer to inline assembly this is meant to also include FFI.
  17
  18 To get started, let me explain why there are things that even inline assembly is fundamentally not allowed to do.
  19
  20 [^xlang]: FFI has one extra complication that does not arise with inline assembly, and that is cross-language LTO. That is its own separate can of worms and outside the scope of this post.
  21
  22 ## Why can't inline assembly do whatever it wants?
  23
  24 People like to think of inline assembly as freeing them from all the complicated requirements of the Abstract Machine.
  25 Unfortunately, that's a pipe dream.
  26 Here is an example to demonstrate this:
  27
  28 ```rust
  29 use std::arch::asm;
  30
  31 #[inline(never)]
  32 fn innocent(x: &i32) { unsafe {
  33     // Store 0 at the address given by x.
  34     asm!(
  35         "mov dword ptr [{x}], 0",
  36         x = in(reg) x,
  37     );
  38 } }
  39
  40 fn main() {
  41     let x = 1;
  42     innocent(&x);
  43     assert!(x == 1);
  44 }
  45 ```
  46
  47 When the compiler analyzes `main`, it realizes that only a shared reference is being passed to `innocent`.
  48 This means that whatever `innocent` does, it cannot change the value stored at `*x`.
  49 Therefore, the assertion can be optimized away.
  50
  51 However, `innocent` actually does write to `*x`!
  52 Therefore, the optimization changed the behavior of the program.
  53 And indeed, this is [exactly what happens](https://rust.godbolt.org/z/GG7YPsEcs) with current versions of rustc:
  54 without optimizations, the assertion fails, but with optimizations, it passes.
  55 Therefore, either the optimization was wrong, or the program had Undefined Behavior.
  56 And since this is an optimization that we really want to be able to perform, we can only pick the second option.[^nondet]
  57
  58 [^nondet]: The secret third option is that the program might be non-deterministic and allows both behaviors, but that definitely does not apply here.
  59
  60 However, where does the UB come from?
  61 If the entire program was written in Rust, the answer would be "the aliasing model".
  62 Both Stacked Borrows and Tree Borrows, and any other aliasing model worth considering for Rust, will make it UB to write through pointers derived from a shared reference.
  63 However, this time, parts of the program are not written in Rust, so things are not that simple.
  64 How can we say that the inline asm block violated Tree Borrows, when it is written in a language that does not have anything even remotely comparable to Tree Borrows?
  65 That's what the rest of this post is about.
  66
  67 I hope the example clearly demonstrates that we *cannot* get away with having inline assembly just ignore Abstract Machine concepts such as Tree Borrows.
  68 The inline asm block causes UB, we just have to figure out how and why---and more importantly, we have to figure out how people can ensure that their inline asm blocks do *not* cause UB.
  69
  70 ## When is inline assembly compatible with optimizations?
  71
  72 It may seem like we now have to define a version of Tree Borrows that works with assembly code.
  73 That would be an impossible task (Tree Borrows relies on pointer provenance, which does not exist in assembly).[^cheri]
  74 Lucky enough, this is also not necessary.
  75
  76 [^cheri]: I can already sense that some people want to bring up CHERI as an apparent counterexample. CHERI has capabilities, which look and feel a bit like pointer provenance, but they are nowhere near fine-grained enough for Tree Borrows, so capabilities and provenance are still distinct concepts that should not be confused with each other.
  77
  78 Instead, we can piggy-back on the already existing definition of Tree Borrows and the rest of the Abstract Machine.
  79 We do this by requiring the programmer to *tell a story* about what the inline assembly block does in Rust terms.[^alice]
  80 (If this sounds strange, please bear with me. I will explain why this makes sense.)
  81 Specifically, for every inline assembly block, there has to be a corresponding piece of Rust code that *does the same thing* ***as far as the state observable by pure Rust code is concerned***.
  82 When reasoning about the behavior of the overall program, the inline assembly block then gets replaced by that "story" code.
  83 You don't have to actually write this code; what's important is that the code exists and tells a coherent story with what the surrounding Rust code does.
  84
  85 [^alice]: Thank you to Alice Ryhl for suggesting the term "telling a story" for this model.
  86
  87 For our example above, this immediately explains what went wrong:
  88 the story code for the inline assembly block would have to be something like `(x as *const i32 as *mut i32).write(0)`, and if we insert that code in place of the inline assembly block, we can immediately see (and Miri could confirm) that the program has UB.
  89 An inline assembly block can have many possible stories, and it is enough to find *one* story that makes everything work, but in this case, that is not possible.
  90
  91 So, in slightly more detail, here is what I consider to be the rules for inline assembly:
  92 1. For every inline assembly block, pick a "story": a piece of Rust code that serves as stand-in for what this inline assembly block does *to the Abstract Machine state*.
  93    This story code only has access to data that is made available to the inline assembly block (explicit operands and global variables).
  94    When reasoning about soundness and correctness of the program on the Abstract Machine level, we pretend that the story code gets executed instead of the assembly code.
  95 2. This piece of code has to satisfy all the requirements that are imposed on the asm block by attributes such as `readonly` or `nomem` and honor operand constraints such as not mutating `in` operands.
  96 3. The actual assembly code has to *refine* the story code, i.e., whatever the assembly code does to state which the Abstract Machine can observe (in particular, operands and global variables) has to be something that the story code could also have done.
  97
  98 I should add the disclaimer that I do not have a formal theory that proves correctness of this approach.
  99 However, I am reasonably confident, because this approach fits in very well with how we prove the correctness for optimizations such as the one in our example above:
 100 At the heart of the correctness argument is a proof that *all* Rust code satisfies some universal properties.
 101 For instance, we can formalize and prove the claim that any Rust function which takes a shared reference without interior mutability as argument cannot write to that argument.
 102 This isn't the only such property; in fact the set of such properties isn't fully known: we might discover a new property upheld by all Rust code tomorrow.
 103 What's crucial is that any property of the form "for all Rust programs, ..." must also hold for the story code, since that is just regular Rust code!
 104 Finally, because the actual assembly code refines the story code, we know that for the purpose of reasoning about the program, we can pretend that actually the story code gets executed and then, at the end of compilation, replace the story code by the desired assembly code without changing program behavior.
 105
 106 So, that is why story code works.
 107 But, doesn't this make inline assembly entirely useless?
 108 After all, the entire point of inline assembly is to do things I couldn't already do in pure Rust!
 109
 110 ## Inline assembly stories by example
 111
 112 To convince you that the storytelling approach is feasible,
 113 let us consider a few representative examples for inline assembly usage and what the corresponding story might look like.
 114
 115 #### Pure instructions
 116
 117 The easiest case is code that wants to access a new hardware operation that is not exposed in the language.
 118 For instance, the inline assembly block might consist of a single instruction that returns the number of bits set to 1 in a register.
 119 Here, storytelling is trivial:
 120 we can just write some Rust code that does a bit of bit manipulation by hand to count the number of bits set to 1.
 121
 122 #### Page table manipulation
 123
 124 That was easy, so let us crank up the difficulty and consider an OS kernel that manipulates page tables.
 125 Rust has no notion of page tables.
 126 What could the "story" possibly look like here?
 127
 128 The answer is that Rust has something that is very similar to putting a new page into the page table---it is called `alloc`.
 129 It also has something very similar to removing a page (`dealloc`), and to moving a page to a different location in the address space (`realloc`).
 130 So, the story that an OS kernel would tell the compiler is that manipulating page tables is really just a funny kind of allocator.
 131
 132 Slightly more concretely, "allocating" a page in a way that is compatible with the storytelling approach could look like this:
 133 - First, some Rust code performs the actual page table manipulation using volatile loads and stores.[^volatile]
 134 - Then an asm block executes whatever barrier is needed on the current system to ensure the updated page table has taken effect.
 135 - Next, the address of the page is cast to a pointer (using [`with_exposed_provenance`](https://doc.rust-lang.org/std/ptr/fn.with_exposed_provenance.html)).
 136 - Finally, Rust code may use that pointer to access the new page.
 137
 138 [^volatile]: Why am I insisting on volatile accesses here? Because if you had the page tables inside a regular Rust allocation, writes to that page table could have "interesting" effects, and that doesn't really correspond to anything that can happen when you write to a nomral Rust allocation. In other words, I haven't (yet) come up with a proper story that would allow for those writes to be non-volatile.
 139
 140 The story of this asm block is that it performs memory allocation at the given address, which we know to be unallocated.[^alloc-control]
 141 This creates a fresh provenance that represents the new allocation.
 142 This allocation is then immediately [exposed](https://doc.rust-lang.org/std/primitive.pointer.html#method.expose_provenance) by the story code.
 143
 144 [^alloc-control]: This assumes that we refine our specification of how memory allocation works in Rust so that there are regions of memory that "native" Rust allocations such as stack and static variables do not use, and that are instead entirely controlled by the program. If the only allocation operation a language has is "non-deterministic allocation anywhere in the address space", this story does not work.
 145
 146 Even for architectures where no barrier is needed after a page table change, the asm block is still crucial:
 147 it prevents the compiler from reordering accesses to the new pages up across the page table manipulation!
 148 Using the usual rules for Rust programs, there is no way the compiler could figure out that there is any sort of dependency here.
 149 The asm block therefore serves as a compiler fence: as far as the compiler is concerned, this block might actually invoke the story code we made up,
 150 and therefore the new pointer and operations based on it cannot be moved to before the asm block.
 151
 152 This is why people sometimes think of asm blocks as compiler fences: an asm block stands in for some arbitrary "story code" the compiler doesn't know,
 153 so the compiler has to treat this code *as if* some arbitrary code was executing here,
 154 which prevents most reorderings.
 155 But the emphasis here is on *most*: if the compiler has extra aliasing information, such as from an `&mut` type, that lets the compiler reason about and reorder memory accesses even across unknown function calls and, therefore, inline asm blocks.
 156 It is therefore incorrect to say that an asm block is a barrier preventing all reordering.
 157 Thinking in terms of compiler barriers can provide useful intuition, but a rigorous correctness argument needs to go into more detail.
 158
 159 There is another caveat in this story: with page table manipulation, one cannot just create new allocations, one can also grow and shrink existing allocations.
 160 In fact, the same is possible from userspace with `mmap`.
 161 It turns out that *growing* allocations is harmless, so this has been [officially blessed](https://github.com/llvm/llvm-project/pull/141338) in LLVM and we should find a way to also expose this on the Rust side.
 162 However, *shrinking* allocations is problematic---there are [simple optimizations](https://github.com/llvm/llvm-project/pull/141338) that LLVM might reasonably perform that would break code which shrinks allocations!
 163 So, further work is required to ensure that Rust code (as well as C and C++ code) can use `munmap` without risking miscompilations.
 164 This is why it is so important to take a principled approach to language semantics and correctness: it would otherwise be way too easy to miss potential problems like this.
 165
 166 #### Page table manipulation II: duplicating pages
 167
 168 Next, let us consider another case of page table shenanigans: mapping a single page of physical memory into multiple locations in virtual memory.
 169 That means the page is "mirrored" in multiple places, and mutating any one mirror changes all of them.
 170 First of all, note that in general, this is plain unsound.
 171 LLVM will freely assume that `ptr` and `ptr.wrapping_offset(4096)` do not alias, so mapping the same memory into multiple places and freely accessing all of them can lead to subtle miscompilations.
 172 However, there is a restricted form of this where we can use inline assembly to come up with a "story" that fits into the Abstract Machine, and is therefore sound.
 173
 174 The key limitation is that the program only gets to use one of the "mirrored" version of this memory at a time.
 175 Changing which mirror is "active" requires an explicit barrier and returns a new pointer that has to be used for future accesses.
 176 This barrier can be an empty inline assembly block that just returns the pointer unchanged, but the story we attach to it is all but empty:
 177 we will say that this behaves like a `realloc`, logically moving the allocation from one mirror to another.
 178 In other words, as far as the Rust Abstract Machine is concerned, only one of the mirrored versions of memory actually "exists", and switching to a different one amounts to freeing the old allocation and creating a new one.
 179 Crucially, as usual with `realloc`, after each such switch all the old pointers to that memory become invalid and the new pointer returned by the switch is the only way to access that memory.[^intptr]
 180 These inline asm blocks will also prevent LLVM from reordering accesses to different "mirrors" around each other, thus avoiding the aforementioned miscompilations.
 181 In other words, changing our code in a way that lets us tell a proper story also introduced enough structure to prevent the optimizer from doing things it shouldn't do.
 182
 183 [^intptr]: Long-lived pointers into duplicated memory don't really work because they might point to the wrong duplicate. But if that can be avoided, then you can just store them as integers and [cast them to a pointer](https://doc.rust-lang.org/std/ptr/fn.with_exposed_provenance.html) for every access; this avoids any long-lived provenance that might let the compiler apply normal allocation-based reasoning to this memory.
 184
 185 This may sound a bit contrived, but such a "purely logical" `realloc` actually comes up in more than one situation;
 186 there even is [an open RFC](https://github.com/rust-lang/rfcs/pull/3700) proposing to add it to the language proper.
 187
 188 #### Non-temporal stores
 189
 190 The previous example already showed that some hardware features are too intrusive to be freely available inside a high-level language such as Rust.
 191 Non-temporal stores are another example of this.
 192 Specifically, I am referring to the "streaming" store operations on x86 (`_mm_stream_ps` and friends).
 193 The main point of these operations is to avoid cluttering the cache with data that is unlikely to be read again soon,
 194 but they also have the unfortunate side effect of breaking the usual "total store order" memory model of x86.
 195 This is bad news because the compilation of the rest of the program relies on that memory model.
 196
 197 To explain the problem, let us consider what the "story" for a non-temporal store might be.
 198 The obvious choice is to make it just a regular write access---caching is not modeled in the Abstract Machine, after all.
 199 Unfortunately, this does not work.
 200 Consider the case where the streaming store is followed by an atomic release write.
 201 Due to the total store order model of x86, this compiles to a regular write instruction without any extra fences.
 202 However, streaming stores actually *do* require a fence (`_mm_sfence`) for proper synchronization.
 203 Therefore, one can write a Rust program that seems to be data-race-free (according to the story) but actually has a data race.
 204 In other words, rule 3 (the inline asm block must refine the story code) is violated.
 205
 206 The principled fix for this is to extend the C++ memory model (which is shared by Rust) with a notion of non-temporal stores so that one can reason about how they interact with everything else that can go on in a concurrent program.
 207 I am not aware of anyone having actually done this---extending or even just slightly adjusting the C++ memory model is an [enormous undertaking](https://plv.mpi-sws.org/scfix/paper.pdf).
 208 However, there is a simpler alternative: we can try coming up with a more complicated story such that rule 3 is not violated.
 209 This is exactly what a bunch of folks did when the issue around non-temporal stores was discovered.
 210 The story says that doing a non-temporal store corresponds to *spawning a thread* that will asynchronously perform the actual store,
 211 and `_mm_sfence` corresponds to waiting for all those threads to finish.
 212 This explains why release-acquire synchronization fails: synchronization picks up all writes performed by the releasing thread, but the streaming store conceptually happened on a different thread!
 213 The new story code was the basis for the updated [documentation](https://doc.rust-lang.org/nightly/core/arch/x86/fn._mm_sfence.html) for streaming stores on x86,
 214 and the code itself can even be found in a [comment in the code](https://github.com/rust-lang/rust/blob/cb3046e5f2f0736366c0fea4977a8df579d96311/library/stdarch/crates/core_arch/src/x86/sse.rs#L1456-L1481).
 215
 216 There is one caveat:
 217 The story we picked implies that it is UB for the thread that performed the streaming store to do a load from that memory before `_mm_sfence`, even though this operation would be well-defined on the underlying hardware.
 218 This is the price we pay in exchange for having a principled argument for why code using streaming stores will not be miscompiled.
 219 It is not a high price: streaming stores are used for data that is unlikely to be read again soon, that is their entire point.
 220 None of the examples of streaming stores we found in the wild had a problem with this limitation.[^forgotten-fence]
 221
 222 [^forgotten-fence]: All of the examples we found forgot to insert the `_mm_sfence`, which was clearly unsound. Thanks to the story, we now have a clear idea *why* it is unsound, i.e., which rule of the Rust language was violated.
 223
 224 #### Stack painting
 225
 226 Another possible use for inline assembly is measuring the stack consumption of a program using stack painting.
 227 This was brought up as a question in the t-opsem Zulip, and I am including it here because it is a nice demonstration of how much freedom the storytelling approach provides, and which limitations it has.
 228
 229 Roughly speaking, stack painting means that before the program starts, the memory region that will later become the stack is filled with a fixed bitpattern.
 230 Later, we can then measure the maximum stack usage of the program by checking where the bit pattern is still intact and where it has been overwritten.
 231 This can be done with inline assembly code that simply directly reads from the stack.
 232
 233 The first reflex might be to say that this is obviously UB:
 234 that stack memory might be subject to noalias constraints (due to a mutable reference pointing to the stack); you can't just read from memory that you don't have permission to read.
 235 However, that presupposes that the story for this asm block involves reading memory.
 236 An alternative story is to say that the asm block just returns some arbitrary, non-deterministically chosen value.
 237 The upside of this story is that, as long as the read doesn't trap, the story is always correct according to our rules:
 238 whatever the assembly code actually does, it surely refines returning an arbitrary value.
 239 However, the downside of this story is that when reasoning about our code, we cannot make *any* assumptions about the value we read!
 240 Correctness of our program is defined under the storytelling semantics, i.e., the program has to be correct no matter which values are returned by the inline asm.
 241 That may sound like a problem, but for this use case, it is actually entirely fine:
 242 stack painting anyway provides just an estimate of the real stack usage.
 243 The compiler makes no *guarantees* that the measurement produced this way is remotely accurate, but experiments show that this works well in practice.
 244 Incorrect measurements do not lead to soundness or correctness issues, so providing accurate answers is "just" a quality-of-life concern.
 245
 246 #### Floating-point status and control register
 247
 248 The final example I want to consider are floating-point status and control registers.
 249 This is an example where the storytelling approach mostly serves to explain why using these registers is not possible or not useful.
 250
 251 Programmers sometimes want to read the status register to check if a floating-point exception has been raised, and to write the control register to adjust the rounding mode or other aspects of floating-point computations.
 252 However, actually supporting such control register mutations is a disaster for optimizations:
 253 the control register is global (well, thread-local) state, meaning it affects all subsequent operations until the register is changed again.
 254 This means that to optimize any floating-point operation that might need rounding, the compiler has to statically predict what the value of the control register will be.
 255 That's usually not going to be possible, so what compilers do instead is they just assume that the control register always remains in its default state.
 256 (Sometimes they provide ways to opt-out of that, but this is hard to do well and Rust currently has no facilities for this.)
 257 The status register is less obviously problematic, but note that if we say that a floating-point operation can mutate the status register, then it is no longer a pure operation, and therefore it cannot be freely reordered.
 258 To allow compilers to do basic optimizations like common subexpression elimination on floating-point operations, languages therefore generally also say that they consider the status register to be not observable.
 259
 260 What does this mean for inline assembly code that reads/writes these registers?
 261 For reading the status register, it means that the story code has no way of saying that this has anything to do with actual floating-point operations.
 262 The Abstract Machine has no floating-point status bits that the story code could read, so the best possible story is to return a non-deterministic value.
 263 This directly reflects the fact that the compiler makes no guarantees for the value that the program will observe in the status register, and since floating-point operations can be arbitrarily reordered, this should be taken quite literally.
 264
 265 For writing the control register, there simply is no possible story: no Rust operation exists that would change the rounding mode of subsequent floating-point operations.
 266 Any inline asm block that changes the rounding mode therefore has undefined behavior.
 267
 268 While this may sound bleak, it is entirely possible to write an inline asm block that changes the rounding mode, performs some floating-point operations, and then changes it back!
 269 The story code for this block can use a soft-float library to perform exactly the same floating-point operations with a non-default rounding mode.
 270 Crucially, since the asm block overall leaves the control register unchanged, the story code does not even have to worry about that register.
 271 In other words, having a single big asm block that performs floating-point operations in a non-default rounding mode is fine.
 272 This also makes sense from an optimization perspective: there is no risk of the compiler moving a floating-point operation into the region of code where the rounding mode is different.
 273
 274 ## Conclusion
 275
 276 I hope these examples were useful to demonstrate both the flexibility and limitations of the storytelling approach.
 277 In many cases, the inability to come up with a story directly correlates with potential miscompilations.
 278 This is great!
 279 Those are the kinds of inline asm blocks that we *have* to rule out as incorrect.[^noopt]
 280 In some cases, however, there are no obvious miscompilations.
 281 And indeed, if we knew exactly which universal properties of Rust programs the compiler relies on, we could allow inline asm code that satisfies all those universal properties, even if it has no story which can be expressed in Rust source code.
 282 Unfortunately, this approach would require us to commit to the full set of universal properties the compiler may ever use.
 283 If we discover a new universal property tomorrow, we cannot use it since there might be an inline asm block for which the universal property does not hold.
 284
 285 This is why I am proposing to take the conservative approach:
 286 only allow inline asm blocks that are obviously compatible with all universal properties of actual Rust code, because their story can be expressed as actual Rust code.
 287 If there is an operation we want to allow that currently has no valid story, we should just [add](https://github.com/rust-lang/rfcs/pull/3700) a [new language operation](https://github.com/rust-lang/rfcs/pull/3605), which corresponds to officially blessing that operation as one the compiler will keep respecting.
 288
 289 [^noopt]: This assumes that we do not want to sacrifice the optimizations in question. Since inline assembly could hide inside any function call, this typically becomes a language-wide tradeoff: either we forbid such inline asm blocks, or we cannot do the optimization even in pure Rust code.
 290
 291 Right now, we have no official documentation or guidelines for how inline asm blocks and FFI interact with Rust-level UB,
 292 but as the `innocent` example at the top of the post shows, we cannot leave inline asm blocks unconstrained like that.
 293 The storytelling approach is my proposal for filling that gap.
 294 I plan to eventually suggest it as the official rules for inline assembly.
 295 But before I do that, I'd like to be more confident that this approach really can handle most real-world scenarios.
 296 If you have examples of assembly blocks that cannot be explained with storytelling, but that you are convinced are correct and hence should be supported, please let us know, either in the immediate [discussion](https://www.reddit.com/r/rust/comments/1rshm93/how_to_use_storytelling_to_fit_inline_assembly/) for this blog post or (if you are reading this later) in the [t-opsem Zulip channel](https://rust-lang.zulipchat.com/#narrow/channel/136281-t-opsem).
 297
 298 #### Footnotes