Author: Chloé Lourseyre
Editor: Peter Fordham
This year at the CppCon, Walter E. Brown made a Lightning Talk about Duff’s device (I’ll put a Youtube link here as soon as it’s available).
Duff’s device is a pretty old contraption and I was wondering “How useful can it be in 2021, with C++20 and all?”.
So here we go.
What is Duff’s device?
What we call Duff’s device (named after its creator Tom Duff) is a way to implement manual loop unrolling in the C language.
Loop unrolling is an execution time optimization technique in which we reduce the number loop control by manually “unrolling” the loop content. It trades execution time for binary size (because your code is generally larger when you use this technique).
The principle of Duff’s device is to perform several computations (usually four to eight) in the same loop, so the loop control is evaluated one out of a few computations instead of each computation.
So, instead of doing this:
void execute_loop(int & data, const size_t loop_size)
{
for (int i = 0 ; i < loop_size ; ++i)
{
computation(data);
}
}
We do something that looks like this:
void execute_loop(int & data, const size_t loop_size)
{
for (int i = 0 ; i < loop_size/4 ; ++i)
{
computation(data);
computation(data);
computation(data);
computation(data);
}
}
However, as you might have noticed, if loop_size
if not a multiple of 4, the function performs the wrong number of calls to computation()
. To rectify this, Duff’s device uses the C switch
fall-through functionality, just like this:
void execute_loop(int & data, const size_t loop_size)
{
size_t i = 0;
switch(loop_size%4)
{
do{
case 0: computation(data);
case 3: computation(data);
case 2: computation(data);
case 1: computation(data);
++i;
} while (i < (loop_size+3)/4);
}
}
This is a bit more alien written that way, so I’ll explain it here :
At the beginning of the function we enter the switch
statement and immediately check for modulo of loop_size
. Depending on the result, we end up in one of the four case
s. Then, because of the switch fallthrough, we end up doing a different number of computation depending on this modulo. This allows us to rectifies the problem of doing the wrong number of computations when loop_size
is not a multiple of 4.
But then what happens? After falling through, the program encounters the while
keyword. Thus, since it’s technically inside a do while()
loop, the program goes back to the do
statement and continue the loop as normal.
After the first occurrence, the case N
labels are ignored, so it is as if it was falling through again.
You can check the numbers if you like: every time, we end up doing the correct number of computations.
Is Duff’s device worth it?
Duff’s device is from another time, another era (heck, it’s even from another language), so my first reaction about it in 2021 would be “This kind of device is probably counter-productive, I’d rather let the compiler do the optimization for me.”
But I want tangible proof of that. So what about a few benchmarks?
Benchmarks
To do the benchmarks, I used this code: Quick C++ Benchmarks – Duff’s device (quick-bench.com).
Here are the results1:
Compiler | Optimization option | Basic loop (cpu_time) | Duff’s device (cpu_time) | Duff’s device / Basic loop |
Clang 12.0 | -Og | 7.5657e+4 | 7.2965e+4 | – 3.6% |
Clang 12.0 | -O1 | 7.0786e+4 | 7.3221e+4 | + 3.4% |
Clang 12.0 | -O2 | 1.2452e-1 | 1.2423e-1 | – 0.23% |
Clang 12.0 | -O3 | 1.2620e-1 | 1.2296e-1 | – 2.6% |
GCC 10.3 | -Og | 4.7117e+4 | 4.7933e+4 | + 1.7% |
GCC 10.3 | -O1 | 7.0789e+4 | 7.2404e+4 | + 2.3% |
GCC 10.3 | -O2 | 4.1516e-6 | 4.1224e-6 | – 0.70% |
GCC 10.3 | -O3 | 4.0523e-6 | 4.0654e-6 | + 0.32% |
In this case, the difference is insignificant (3,5% on a benchmark really is not much, in live code this would be diluted into the rest of the codebase). Plus, whether it is the basic loop or Duff’s device which is the fastest depends on the optimization level and compiler.
After that, I used a more simple version of computation()
(one the compiler will optimize easier), this one: Quick C++ Benchmarks – Duff’s device (quick-bench.com).
This give these results:
Compiler | Optimization option | Basic loop (cpu_time) | Duff’s device (cpu_time) | Duff’s device / Basic loop |
Clang 12.0 | -Og | 5.9463e+4 | 5.9547e+4 | + 0.14% |
Clang 12.0 | -O1 | 5.9182e+4 | 5.9235e+4 | + 0.09% |
Clang 12.0 | -O2 | 4.0450e-6 | 1.2233e-1 | + 3 000 000% |
Clang 12.0 | -O3 | 4.0398e-6 | 1.2502e-1 | + 3 000 000% |
GCC 10.3 | -Og | 4.2780e+4 | 4.0090e+4 | – 6.3% |
GCC 10.3 | -O1 | 1.1299e+4 | 5.9238e+4 | + 420% |
GCC 10.3 | -O2 | 3.8900e-6 | 3.8850e-6 | – 0.13% |
GCC 10.3 | -O3 | 5.3264e-6 | 4.1162e-6 | – 23% |
This is interesting, because we can see that Clang can, on its own, greatly optimize the basic loop without managing to optimize Duff’s device (with -O2
and -O3
the basic loop is 30 000 times faster than Duff’s device; this is because the compiler optimize the basic loop into a single operation, but consider Duff’s device too complicated to be optimized).
On the other hand, GCC does not manage to optimize the basic much more than Duff’s device in the end, so if at -O1
the basic loop is more than 5 times faster, with -O3
Duff’s device is almost 23% faster (which is significant)2.
Readability and semantics
At first glance, Duff’s device is a very odd contraption. However, it is relatively well known among C and C++ developers (especially the oldest ones). Plus, we already have a name for it and a pretty good Wikipedia page to explains how it works.
As long as you identify it as such in the comments, I think it s pretty safe to use Duff’s device, even if you know your coworkers don’t know about it (you can even put the Wikipedia link in the comments if you like!).
Trying to seek a very specific case
Principle
The loop unrolling specifically aims to reduce the number of loop controls that are evaluated. So I set up a specific case where the loop control (and index increment) are both heavier to evaluate.
So instead of using an integer as the loop index, I used this class:
struct MyIndex
{
int index;
MyIndex(int base_index): index(base_index) {}
MyIndex& operator++()
{
if (index%2 == 0)
index+=3;
else
index-=1;
return *this;
}
bool operator<(const MyIndex& rhs)
{
if (index%3 == 0)
return index < rhs.index;
else if (index%3 == 1)
return index < rhs.index+2;
else
return index < rhs.index+6;
}
};
Each time we increment or compare the MyIndex
, we perform at least one modulo operation (a pretty heavy arithmetic operation).
And I ran the benchmarks on it.
Benchmarks
So I use the following code: Quick C++ Benchmarks – Duff’s device with strange index (quick-bench.com)
This give these results:
Compiler | Optimization option | Basic loop (cpu_time) | Duff’s device (cpu_time) | Duff’s device / Basic loop |
Clang 12.0 | -Og | 2.0694e+5 | 5.9710e+4 | – 71% |
Clang 12.0 | -O1 | 1.8356e+5 | 5.8805e+4 | – 68% |
Clang 12.0 | -O2 | 1.2318e-1 | 1.2582e-1 | + 2.1% |
Clang 12.0 | -O3 | 1.2955e-1 | 1.2553e-4 | – 3.1% |
GCC 10.3 | -Og | 6.2676e+4 | 4.0014e+4 | – 36% |
GCC 10.3 | -O1 | 7.0324e+4 | 6.0959e+4 | – 13% |
GCC 10.3 | -O2 | 6.5143e+4 | 4.0898e-6 | – 100% |
GCC 10.3 | -O3 | 4.1155e-6 | 4.0917e-6 | – 0.58% |
Here, we can see that Duff’s device is always better in the low optimization levels, but never in a significant advantage at -O3
. This means that the compiler manages to optimize the basic loop as much as Duff’s device in the higher grades of optimization. This is significantly different from the previous results.
Why are the results so inconsistent?
The benchmark show very inconsistent results: for instance, how come that in the context of the simple computation()
, with GCC and -O1
, the basic loop is more than five times faster than Duff’s device, whereas with -O3
, it’s Duff’s device that is 23% faster? How come that for the same code, Clang show totally different results than GCC and show that the basic loop is thirty thousand times faster with -O2
and -O3
?
This is because each compiler has its own ways to optimize these kinds of loops at different level of optimization.
I you want to look into it, you can compare the assembly code generated by each compiler, just like in this example: Compiler Explorer (godbolt.org) where the GCC and the Clang version of the -O3
level of optimization are put side-by-side.
I would have loved detailing that here, but unfortunately it would take more than one whole article to analyze them all. If you, reader of this article, want to take things into your own hands and perform the analysis yourself, I’ll be more than glad to publish your results on this blog (you can contact me here).
Wrapping up
I will summarize the results in the following chart, which indicate which device is best in the different implementations we saw:
Compiler | Optimization option | Complex computation() | Trivial computation() | Heavy loop control |
Clang 12.0 | -Og | None | None | Duff’s device |
Clang 12.0 | -O1 | None | None | Duff’s device |
Clang 12.0 | -O2 | None | Basic loop | None |
Clang 12.0 | -O3 | None | Basic loop | None |
GCC 10.3 | -Og | None | None | Duff’s device |
GCC 10.3 | -O1 | None | Basic loop | Duff’s device |
GCC 10.3 | -O2 | None | None | Duff’s device |
GCC 10.3 | -O3 | None | Duff’s device | None |
How to interpret these results?
First, when we have a complex computation and a trivial loop control, there is no significant difference between the both.
Second, when to computation is trivial, it’s often the basic loop which is better, but not always.
Third, as expected, it is Duff’s device which is preferred with a heavy loop control, but it is not always necessary.
And finally, the results will almost always depend on your implementation. While doing my research for this article, I found myself trying several implementations of the code I used to illustrate Duff’s device, and I often ended up with pretty different benchmarks each time I made a tiny edit on the code.
My point here is that sometimes Duff’s device is better than a basic basic loop, and sometimes it’s the other way around (even if, most of the time, there is no major difference).
In conclusion, Duff’s device if still worth considering3, but you’ll have to do your own benchmarks to be sure where it is indeed useful. However, Duff’s device does add more verbosity to the code. Even if it’s easy to document (as stated before), you may not have the time (or not want to spend the time) to do benchmarks and consider Duff’s device. It’s up to you.
Thanks for reading and see you next week!
Author: Chloé Lourseyre
Editor: Peter Fordham
Addendum
Notes
- The “cpu_time” mentioned in the chart is an abstract unit of measure, prompted by quick-bench. It has no use on its own, it s only used to compare each benchmark. That’s why the order of magnitude may vary from one line to another. You want to look at the last column.
- The results presented here also depends on the implementation of each
compute_*()
. For instance, if you evaluate(loop_size+3)/4
each loop instead of using aconst size_t
to store it, GCC results are very different and Duff’s device is no longer significantly the best with-O3
. - I’ll just add this note here to remind you one trivial fact: execution time optimization is only worth considering when your code is time-sensitive. If you work on a non-time-sensitive code, you shouldn’t even consider Duff’s device in the first place. When possible, keep it simple, and keep in mind the 80:20 rule.