Here is the source program:
{% highlight c %}
char p[1], q[1] = {0};
-int ip = (int)(p+1);
-int iq = (int)q;
+uintptr_t ip = (uintptr_t)(p+1);
+uintptr_t iq = (uintptr_t)q;
if (iq == ip) {
*(char*)iq = 10;
print(q[0]);
}
{% endhighlight %}
I am using C syntax here just as a convenient way to write programs in LLVM IR.
-For simplicity, we assume that `int` has the right size to hold a pointer value; just imagine we used `uintptr_t` if you want to be more general.
This program has two possible behaviors: either `ip` (the address one-past-the-end of `p`) and `iq` (the address of `q`) are different, and nothing is printed.
Or the two are equal, in which case the program will print "10" (`iq` is the result of casting `q` to an integer, so casting it back will yield the original pointer, or at least a pointer pointing to the same object / location in memory).
Subsequently the definition of `ip` is inlined:
{% highlight c %}
char p[1], q[1] = {0};
-int ip = (int)(p+1);
-int iq = (int)q;
+uintptr_t ip = (uintptr_t)(p+1);
+uintptr_t iq = (uintptr_t)q;
if (iq == ip) {
- *(char*)(int)(p+1) = 10; // <- This line changed
+ *(char*)(uintptr_t)(p+1) = 10; // <- This line changed
print(q[0]);
}
{% endhighlight %}
The second optimization notices that we are taking a pointer `p+1`, casting it to an integer, and casting it back, so we can remove the cast roundtrip:
{% highlight c %}
char p[1], q[1] = {0};
-int ip = (int)(p+1);
-int iq = (int)q;
+uintptr_t ip = (uintptr_t)(p+1);
+uintptr_t iq = (uintptr_t)q;
if (iq == ip) {
*(p+1) = 10; // <- This line changed
print(q[0]);
The final optimization notices that `q` is never written to, so we can replace `q[0]` by its initial value `0`:
{% highlight c %}
char p[1], q[1] = {0};
-int ip = (int)(p+1);
-int iq = (int)q;
+uintptr_t ip = (uintptr_t)(p+1);
+uintptr_t iq = (uintptr_t)q;
if (iq == ip) {
*(p+1) = 10;
print(0); // <- This line changed
Now that we know that provenance exists in pointers, we have to also consider what happens to provenance when a pointer gets cast to an integer and back.
The second optimization gives us a clue into this aspect of LLVM IR semantics: casting a pointer to an integer and back is optimized away, which means that *integers have provenance*.
-To see why, consider the two expressions `(char*)(int)(p+1)` and `(char*)(int)q`:
+To see why, consider the two expressions `(char*)(uintptr_t)(p+1)` and `(char*)(uintptr_t)q`:
if the optimization of removing pointer-integer-pointer roundtrips is correct, the first operation will output `p+1` and the second will output `q`, which we just established are two different pointers (they differ in their provenance).
The only way to explain this is to say that the input to the `(char*)` cast is different, since the program state is otherwise identical in both cases.
-But we know that the integer values computed by `(int)(p+1)` and `(int)q` (i.e., the bit pattern of length 32 that serve as input to the `(char*)` casts) are the same, and hence a difference can only arise if these integers consist of more than just this bit pattern---just like pointers, integers have provenance.
+But we know that the integer values computed by `(uintptr_t)(p+1)` and `(uintptr_t)q` (i.e., the bit pattern as stored in some CPU register) are the same, and hence a difference can only arise if these integers consist of more than just this bit pattern---just like pointers, integers have provenance.
Finally, let us consider the first optimization.
Here, a successful equality test `iq == ip` prompts the optimizer to replace one value by the other.