Yet another reason to not use printf (or write C code in general)

Author: Chloé Lourseyre

Recently, Joe Groff @jckarter tweeted a very interesting behavior inherited from C:

C++ pro tip: hardcoding constants in your code can create maintenance burdens. Instead of writing 2*x, try the handy double(x) function pic.twitter.com/vZigQW4O0h
— Joe Groff ⎷⃣ (@jckarter) August 27, 2021

Obviously, it’s a joke, but we’re gonna talk more about what’s happening in the code itself.

So, what’s happening?

Just to be 100% clear, double(2101253) does not actually double the value of 2101253. It’s a cast from int to double.

If we write this differently, we can obtain this:

#include <cstdio>

int main() {
    printf("%d\n", 666);
    printf("%d\n", double(42));
}

On the x86_64 gcc 11.2 compiler, the prompt is as follows:

666
4202506

So we can see that the value 4202506 has nothing to do with the 666 nor the 42 values.

In fact, if we launch the same code in the x86_64 clang 12.0.1 compiler, things are a little bit different:

666
4202514

You can see the live results here: https://godbolt.org/z/c6Me7a5ee

You may have guessed it already, but this comes from line 5, where we print a double as an int. But this is not some kind of conversion error (of course that your computer knows how to convert from double to int, it will do it fine if this was what was happening), the issue comes from somewhere else.

The truth

If we want to understand how it works that way, we’ll have to take a look at the assembly code (https://godbolt.org/z/5YKEdj73r):

.LC0:
        .string "%d\n"
main:
        push    rbp
        mov     rbp, rsp
        mov     esi, 666
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     rax, QWORD PTR .LC1[rip]
        movq    xmm0, rax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 1
        call    printf
        mov     eax, 0
        pop     rbp
        ret
.LC1:
        .long   0
        .long   1078263808

(use this Godbolt link to have a clearer matching between the C++ code and the assembly instructions: https://godbolt.org/z/5YKEdj73r)

In the yellow zone of the assembly code (lines 6-to 9, the equivalent to printf("%d\n", 666);) we can see that everything’s fine, the 666 value is put in the esi register and then the function printf is called. So it’s an educated guess to say that when the printf function reads a %d in the string it is given, it’ll look in the esi register for what to print.

However, we can see in the blue part of the code (lines 10 to 14, the equivalent to printf("%d\n", double(42));) the value is put in another register: the xmm0 register. Since it is given the same string as before, it’s pretty guessable that the printf function will look into the esi register again, whatever there is in there.

We can prove that statement pretty easily. Take the following code:

#include <cstdio>

int main() {
    printf("%d\n", 666);
    printf("%d %d\n", double(42), 24);
}

It’s the same code, with an additional integer that is print in the second printf instruction.

If we look at the assembly (https://godbolt.org/z/jjeca8qd7):

.LC0:
        .string "%d %d\n"
main:
        push    rbp
        mov     rbp, rsp
        mov     esi, 666
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     rax, QWORD PTR .LC1[rip]
        mov     esi, 24
        movq    xmm0, rax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 1
        call    printf
        mov     eax, 0
        pop     rbp
        ret
.LC1:
        .long   0
        .long   1078263808

The double(42) value still goes into the xmm0 register, and the 24 integer, logically, ends up in the esi register. Thus, this happens in the output:

666
24 0

Why? Well, since we asked for two integers, the printf call will look into the first integer register (esi) and print its content (24, as we stated above), then look in the following integer register (edx) and print whatever is in it (incidentally 0).

In the end, the behavior we see occurs because of how the x86_64 architecture is made. If you want to learn more about that, follow these links:

What does the doc say?

The truth is that according to the reference (printf, fprintf, sprintf, snprintf, printf_s, fprintf_s, sprintf_s, snprintf_s – cppreference.com):

If a conversion specification is invalid, the behavior is undefined.

And this same reference is unambiguous about the %d conversion specifier:

converts a signed integer into decimal representation [-]dddd.
Precision specifies the minimum number of digits to appear. The default precision is 1.
If both the converted value and the precision are 0 the conversion results in no characters.

So, giving a double to a printf argument where you are supposed to give a signed integer is UB. So it was our mistake to write this in the first place.

This actually generates a warning with clang. But with gcc, you’ll have to activate -Wall to see any warning about that.

Wrapping up

The C language is a very, very old language. It’s older than the C++ (obviously) that is itself very old. As a reminder, the first edition of the K&R has been printed in 1978. This was thirteen years before my own birth. And unlike us humans, programming languages don’t age well.

I could have summarized this article with a classic “don’t perform UB”, but I think it’s a bit off-purpose this time. So I’ll go and say it: don’t use printf at all.

The problem is not with printf itself, it’s with using a feature from another language¹ that was originally published forty-three years ago. In short: don’t write C code.

Thanks for reading and see you next week!

^{_{1. Yeah, like it or not, but C and C++ and different languages. Different purpose, different intentions, different meta. That is exactly why I always deny job offers that have the tag “C/C++” because they obviously can’t pick a side.}}