Dilemma – Belay the C++

Duff’s device in 2021

Author: Chloé Lourseyre
Editor: Peter Fordham

This year at the CppCon, Walter E. Brown made a Lightning Talk about Duff’s device (I’ll put a Youtube link here as soon as it’s available).

Duff’s device is a pretty old contraption and I was wondering “How useful can it be in 2021, with C++20 and all?”.

So here we go.

What is Duff’s device?

What we call Duff’s device (named after its creator Tom Duff) is a way to implement manual loop unrolling in the C language.

Loop unrolling is an execution time optimization technique in which we reduce the number loop control by manually “unrolling” the loop content. It trades execution time for binary size (because your code is generally larger when you use this technique).

The principle of Duff’s device is to perform several computations (usually four to eight) in the same loop, so the loop control is evaluated one out of a few computations instead of each computation.

So, instead of doing this:

void execute_loop(int & data, const size_t loop_size)
{
    for (int i = 0 ; i < loop_size ; ++i)
    {
        computation(data);
    }
}

We do something that looks like this:

void execute_loop(int & data, const size_t loop_size)
{
    for (int i = 0 ; i < loop_size/4 ; ++i)
    {
        computation(data);
        computation(data);
        computation(data);
        computation(data);
    }
}

However, as you might have noticed, if loop_size if not a multiple of 4, the function performs the wrong number of calls to computation(). To rectify this, Duff’s device uses the C switch fall-through functionality, just like this:

void execute_loop(int & data, const size_t loop_size)
{
    size_t i = 0;
    switch(loop_size%4)
    {
        do{
            case 0: computation(data);
            case 3: computation(data);
            case 2: computation(data);
            case 1: computation(data);
            ++i;
        } while (i < (loop_size+3)/4);
    }
}

This is a bit more alien written that way, so I’ll explain it here :

At the beginning of the function we enter the switch statement and immediately check for modulo of loop_size. Depending on the result, we end up in one of the four cases. Then, because of the switch fallthrough, we end up doing a different number of computation depending on this modulo. This allows us to rectifies the problem of doing the wrong number of computations when loop_size is not a multiple of 4.

But then what happens? After falling through, the program encounters the while keyword. Thus, since it’s technically inside a do while() loop, the program goes back to the do statement and continue the loop as normal.

After the first occurrence, the case N labels are ignored, so it is as if it was falling through again.

You can check the numbers if you like: every time, we end up doing the correct number of computations.

Is Duff’s device worth it?

Duff’s device is from another time, another era (heck, it’s even from another language), so my first reaction about it in 2021 would be “This kind of device is probably counter-productive, I’d rather let the compiler do the optimization for me.”

But I want tangible proof of that. So what about a few benchmarks?

Benchmarks

To do the benchmarks, I used this code: Quick C++ Benchmarks – Duff’s device (quick-bench.com).

Here are the results¹:

Compiler	Optimization option	Basic loop (cpu_time)	Duff’s device (cpu_time)	Duff’s device / Basic loop
Clang 12.0	-Og	7.5657e+4	7.2965e+4	– 3.6%
Clang 12.0	-O1	7.0786e+4	7.3221e+4	+ 3.4%
Clang 12.0	-O2	1.2452e-1	1.2423e-1	– 0.23%
Clang 12.0	-O3	1.2620e-1	1.2296e-1	– 2.6%
GCC 10.3	-Og	4.7117e+4	4.7933e+4	+ 1.7%
GCC 10.3	-O1	7.0789e+4	7.2404e+4	+ 2.3%
GCC 10.3	-O2	4.1516e-6	4.1224e-6	– 0.70%
GCC 10.3	-O3	4.0523e-6	4.0654e-6	+ 0.32%

In this case, the difference is insignificant (3,5% on a benchmark really is not much, in live code this would be diluted into the rest of the codebase). Plus, whether it is the basic loop or Duff’s device which is the fastest depends on the optimization level and compiler.

After that, I used a more simple version of computation() (one the compiler will optimize easier), this one: Quick C++ Benchmarks – Duff’s device (quick-bench.com).

This give these results:

Compiler	Optimization option	Basic loop (cpu_time)	Duff’s device (cpu_time)	Duff’s device / Basic loop
Clang 12.0	-Og	5.9463e+4	5.9547e+4	+ 0.14%
Clang 12.0	-O1	5.9182e+4	5.9235e+4	+ 0.09%
Clang 12.0	-O2	4.0450e-6	1.2233e-1	+ 3 000 000%
Clang 12.0	-O3	4.0398e-6	1.2502e-1	+ 3 000 000%
GCC 10.3	-Og	4.2780e+4	4.0090e+4	– 6.3%
GCC 10.3	-O1	1.1299e+4	5.9238e+4	+ 420%
GCC 10.3	-O2	3.8900e-6	3.8850e-6	– 0.13%
GCC 10.3	-O3	5.3264e-6	4.1162e-6	– 23%

This is interesting, because we can see that Clang can, on its own, greatly optimize the basic loop without managing to optimize Duff’s device (with -O2 and -O3 the basic loop is 30 000 times faster than Duff’s device; this is because the compiler optimize the basic loop into a single operation, but consider Duff’s device too complicated to be optimized).

On the other hand, GCC does not manage to optimize the basic much more than Duff’s device in the end, so if at -O1 the basic loop is more than 5 times faster, with -O3 Duff’s device is almost 23% faster (which is significant)².

Readability and semantics

At first glance, Duff’s device is a very odd contraption. However, it is relatively well known among C and C++ developers (especially the oldest ones). Plus, we already have a name for it and a pretty good Wikipedia page to explains how it works.

As long as you identify it as such in the comments, I think it s pretty safe to use Duff’s device, even if you know your coworkers don’t know about it (you can even put the Wikipedia link in the comments if you like!).

Trying to seek a very specific case

Principle

The loop unrolling specifically aims to reduce the number of loop controls that are evaluated. So I set up a specific case where the loop control (and index increment) are both heavier to evaluate.

So instead of using an integer as the loop index, I used this class:

struct MyIndex
{
  int index;
  
  MyIndex(int base_index): index(base_index) {}
  
  MyIndex& operator++() 
  {  
    if (index%2 == 0)
      index+=3;
    else
      index-=1;
    return *this;
  }

  bool operator<(const MyIndex& rhs)
  {
    if (index%3 == 0)
      return index < rhs.index;
    else if (index%3 == 1)
      return index < rhs.index+2;
    else
      return index < rhs.index+6;
  }
};

Each time we increment or compare the MyIndex, we perform at least one modulo operation (a pretty heavy arithmetic operation).

And I ran the benchmarks on it.

Benchmarks

So I use the following code: Quick C++ Benchmarks – Duff’s device with strange index (quick-bench.com)

This give these results:

Compiler	Optimization option	Basic loop (cpu_time)	Duff’s device (cpu_time)	Duff’s device / Basic loop
Clang 12.0	-Og	2.0694e+5	5.9710e+4	– 71%
Clang 12.0	-O1	1.8356e+5	5.8805e+4	– 68%
Clang 12.0	-O2	1.2318e-1	1.2582e-1	+ 2.1%
Clang 12.0	-O3	1.2955e-1	1.2553e-4	– 3.1%
GCC 10.3	-Og	6.2676e+4	4.0014e+4	– 36%
GCC 10.3	-O1	7.0324e+4	6.0959e+4	– 13%
GCC 10.3	-O2	6.5143e+4	4.0898e-6	– 100%
GCC 10.3	-O3	4.1155e-6	4.0917e-6	– 0.58%

Here, we can see that Duff’s device is always better in the low optimization levels, but never in a significant advantage at -O3. This means that the compiler manages to optimize the basic loop as much as Duff’s device in the higher grades of optimization. This is significantly different from the previous results.

Why are the results so inconsistent?

The benchmark show very inconsistent results: for instance, how come that in the context of the simple computation(), with GCC and -O1, the basic loop is more than five times faster than Duff’s device, whereas with -O3, it’s Duff’s device that is 23% faster? How come that for the same code, Clang show totally different results than GCC and show that the basic loop is thirty thousand times faster with -O2 and -O3?

This is because each compiler has its own ways to optimize these kinds of loops at different level of optimization.

I you want to look into it, you can compare the assembly code generated by each compiler, just like in this example: Compiler Explorer (godbolt.org) where the GCC and the Clang version of the -O3 level of optimization are put side-by-side.

I would have loved detailing that here, but unfortunately it would take more than one whole article to analyze them all. If you, reader of this article, want to take things into your own hands and perform the analysis yourself, I’ll be more than glad to publish your results on this blog (you can contact me here).

Wrapping up

I will summarize the results in the following chart, which indicate which device is best in the different implementations we saw:

Compiler	Optimization option	Complex `computation()`	Trivial `computation()`	Heavy loop control
Clang 12.0	-Og	None	None	Duff’s device
Clang 12.0	-O1	None	None	Duff’s device
Clang 12.0	-O2	None	Basic loop	None
Clang 12.0	-O3	None	Basic loop	None
GCC 10.3	-Og	None	None	Duff’s device
GCC 10.3	-O1	None	Basic loop	Duff’s device
GCC 10.3	-O2	None	None	Duff’s device
GCC 10.3	-O3	None	Duff’s device	None

How to interpret these results?

First, when we have a complex computation and a trivial loop control, there is no significant difference between the both.

Second, when to computation is trivial, it’s often the basic loop which is better, but not always.

Third, as expected, it is Duff’s device which is preferred with a heavy loop control, but it is not always necessary.

And finally, the results will almost always depend on your implementation. While doing my research for this article, I found myself trying several implementations of the code I used to illustrate Duff’s device, and I often ended up with pretty different benchmarks each time I made a tiny edit on the code.

My point here is that sometimes Duff’s device is better than a basic basic loop, and sometimes it’s the other way around (even if, most of the time, there is no major difference).

In conclusion, Duff’s device if still worth considering³, but you’ll have to do your own benchmarks to be sure where it is indeed useful. However, Duff’s device does add more verbosity to the code. Even if it’s easy to document (as stated before), you may not have the time (or not want to spend the time) to do benchmarks and consider Duff’s device. It’s up to you.

Thanks for reading and see you next week!

Author: Chloé Lourseyre
Editor: Peter Fordham

Addendum

Notes

The “cpu_time” mentioned in the chart is an abstract unit of measure, prompted by quick-bench. It has no use on its own, it s only used to compare each benchmark. That’s why the order of magnitude may vary from one line to another. You want to look at the last column.
The results presented here also depends on the implementation of each compute_*(). For instance, if you evaluate (loop_size+3)/4 each loop instead of using a const size_t to store it, GCC results are very different and Duff’s device is no longer significantly the best with -O3.
I’ll just add this note here to remind you one trivial fact: execution time optimization is only worth considering when your code is time-sensitive. If you work on a non-time-sensitive code, you shouldn’t even consider Duff’s device in the first place. When possible, keep it simple, and keep in mind the 80:20 rule.

Pragma: once or twice?

Author: Chloé Lourseyre
Editor: Peter Fordham

Context

Header guards

Every C++ developer has been taught header guards. Header guards are a way to prevent a header being included multiple times which would be problematic because it would mean that the variables, function and classes in that header would be defined several times, leading to a compilation error.

Example of a header guard:

#ifndef HEADER_FOOBAR
#define HEADER_FOOBAR

class FooBar
{
    // ...
};

#endif // HEADER_FOOBAR

For those who are not familiar with it, here is how it works: the first time the file is included, the macro HEADER_FOOBAR is not defined. Thus, we enter into the #ifndef control directive. In there, we defined HEADER_FOOBAR and the class FooBar. Later, if we include the file again, since HEADER_FOOBAR is defined, we don’t enter into the #ifndef again, so the class FooBar is not defined a second time.

#pragma once

#pragma is a preprocessor directive providing additional information to the compiler, beyond what is conveyed in the language itself.

Any compiler is free to interpret pragma directive as it wishes. However, over the years, some pragma directives have acquired significant popularity and are now almost-standard (such as #pragma once, which is the topic of this article, or #pragma pack).

#pragma once is a directive that indicates to the compiler to include the file only once. The compiler manages itself how it remembers which files are already included or not.

So, instinctively, we can think that the #pragma once directive does the job of a header guard, but with only one line and without having to think of a macro name.

Today?

In the past, the #pragma once directive was not implemented for every compiler, so it was less portable than header guards.

But today, in C++, I did not find a single instance of compiler that does not implement this directive.

So why bother using header guards anymore? Answer: because of the issue I’m about to describe.

A strange issues with #pragma once

There is actually one kind of issue that can occur with #pragma once that cannot occur with header guards.

Say, for instance, that your header file is duplicated somewhere. This is a not-so-uncommon issue that may have multiple causes:

You messed up a merge and your version control system duplicated some files.
The version control system messed up the move of some files and they ended up duplicated.
Your filesystem has two separate mount points that gives a path to the same files. They all appear as two different sets of files (since they are present on both disks).
Someone duplicates one of your file in another part of the project for its personal use, without renaming anything (this is bad manners, but it happens).

(please note that I encountered each of these four issues at some point in my career).

When this happens, when you have the same file duplicated, header guards and #pragma once do not behave the same way:

Since the macros that guard each file have the same name, the header guards will work perfectly fine and only include one file.
Since, from the FS point of view, files are different, #pragma once will behave as if they are different file, and thus include each file separately. This leads to a compilation error.

Issues with header guards?

Header guards can have issues too. You can have typos in the macro names, rendering your guards useless. That can’t happen with #pragma once, also it’s possible for macro name to clash if they are badly chosen, also can’t happen with #pragma once.

However, these issues can be easily avoided (typos are easy to detect and name clashes are prevented if you have a good naming convention).

A huge benefit though!

There is also a usage of header guards that is very useful for testing and that is not possible with #pragma once.

Say you want to test the class Foo (in file Foo.h) that uses the class Bar (in file Bar.h). But, for testing purpose, you want to stub class Bar.

One option header guards allows you is to create your won mock of class Bar in file BarMock.h. If the mock uses the same headers guards than the original Bar, then in you test, when you include BarMock.h then Foo.h, the header Bar.h will not be included (because the mock is already included and has the same guards).

So, should I use #pragma once or header guards?

This question is a bit difficult to answer. Let’s take a look at the cons of each method:

#pragma once are non-standard and are a major issue when you end up in a degraded environment.
Header guards may have issues if handled improperly.

In my opinion, #pragma directives are to be avoid when possible. If, in practice, they work, they are not formally standard.

Dear C++20, what about Modules?

Modules, one of the “big four” features of C++20, changes our vision of the “classical build process”. Instead of having source and header files, we can now have modules. They overcome the restrictions of header files and promise a lot: faster build-times, fewer violations of the One-Definition-Rule, less usage of the preprocessor.

Thanks to modules, we can say that #pragma once and header guards issues are no more.

To learn more about modules, check these out:

Wrapping up

This article, talking about pragmas and header guards, targets project that are prior to C++20. If you are in this case and hesitate between #pragma once and header guards, maybe it’s time to upgrade to C++20?

If you can’t upgrade to C++20 (few industrial project can), then choose wisely between #pragma once and header guards,

Thanks for reading and see you next time!

Author: Chloé Lourseyre
Editor: Peter Fordham

Category: Dilemma

Duff’s device in 2021

What is Duff’s device?

Is Duff’s device worth it?

Benchmarks

Readability and semantics

Trying to seek a very specific case

Principle

Benchmarks

Why are the results so inconsistent?

Wrapping up

Addendum

Notes

Pragma: once or twice?

Context

Header guards

#pragma once

Today?

A strange issues with #pragma once

Issues with header guards?

A huge benefit though!

So, should I use #pragma once or header guards?

Dear C++20, what about Modules?

Wrapping up

Addendum

Source