标签云

微信群

扫码加入我们

WeChat QR Code

为什么GCC产生15-20%更快的代码如果我优化大小而不是速度?

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why.

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.)

Here is the generated assembly with -Os and -O2.

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here. This code still runs in 0.38s and the only difference is the .p2align stuff.

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?


UPDATE:

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:

  -falign-functions  -falign-jumps  -falign-loops <br/>
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition <br/>
  -fprefetch-loop-arrays <br/>

So, it pretty much seems like a (mis)alignment issue.

I am still skeptical about -march=native as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)


UPDATE 2:

We can take -Os out of the picture. The following times are obtained by compiling with

  • -O2 -fno-omit-frame-pointer 0.37s

  • -O2 -fno-align-functions -fno-align-loops 0.37s

  • -S -O2 then manually moving the assembly of add() after work() 0.37s

  • -O2 0.44s

It looks like to me the distance of add() from the call site matters a lot. I have tried perf, but the output of perf stat and perf report makes very little sense to me. However, I could only get one consistent result out of it:

-O2:

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle
       3,318 cache-misses
 0.432703993 seconds time elapsed
 [...]
 81.23%  a.out  a.out              [.] work(int, int)
 18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
100.00 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
       ¦   ? retq
[...]
       ¦            int z = add(x, y);
  1.93 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 79.79 ¦      add    %eax,%ebx

For fno-align-*:

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle
       9,508 cache-misses
 0.375681928 seconds time elapsed
 [...]
 82.58%  a.out  a.out              [.] work(int, int)
 16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
 51.59 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
[...]
       ¦    __attribute__((noinline))
       ¦    static int work(int xval, int yval) {
       ¦        int sum(0);
       ¦        for (int i=0; i<LOOP_BOUND; ++i) {
       ¦            int x(xval+sum);
  8.20 ¦      lea    0x0(%r13,%rbx,1),%edi
       ¦            int y(yval+sum);
       ¦            int z = add(x, y);
 35.34 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 39.48 ¦      add    %eax,%ebx
       ¦    }

For -fno-omit-frame-pointer:

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle
      10,514 cache-misses
 0.375445137 seconds time elapsed
 [...]
 75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     ¦
 24.46%  a.out  a.out              [.] work(int, int)
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
 18.67 ¦     push   %rbp
       ¦       return x + y;
 18.49 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   const int LOOP_BOUND = 200000000;
       ¦
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦     mov    %rsp,%rbp
       ¦       return x + y;
       ¦   }
 12.71 ¦     pop    %rbp
       ¦   ? retq
 [...]
       ¦            int z = add(x, y);
       ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 29.83 ¦      add    %eax,%ebx

It looks like we are stalling on the call to add() in the slow case.

I have examined everything that perf -e can spit out on my machine; not just the stats that are given above.

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.)

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.


Blind guess: can this be a cache miss?

2018年05月27日12分25秒

H2CO3 That was my 1st thought as well, but wasn't encouraged enough to post the comment without reading and understanding the OP's question in depth.

2018年05月27日12分25秒

g-makulik That's why I warned that it's a "blind guess" ;-) "TL;DR" is reserved for bad questions. :P

2018年05月27日12分25秒

Just an interesting data point: I find that -O3 or -Ofast is about 1.5x as fast as -Os when I compile this with clang on OS X. (I haven't tried reproducing with gcc.)

2018年05月28日12分25秒

It is the same code. Take a closer look at the address of .L3, misaligned branch targets are expensive.

2018年05月27日12分25秒

Just to make it clear: did you actually go and measure the performance of OP's code on 12 different platforms? (+1 for the mere thought that you would do that)

2018年05月28日12分25秒

anatolyg Yes, I did! (and will add few more soon)

2018年05月28日12分25秒

Indeed. Another +1 for not only theorizing about different CPUs but actually proving it. Not something (alas) you see in every answer concerning speed. Are these tests run with the same OS? (As it might be possible this skews the result...)

2018年05月27日12分25秒

Ali On AMD-FX 6300 -O2 -fno-align-functions -fno-align-loops drops time to 0.340s, so it could be explained by alignment. However, optimal alignment is processor-dependent: some processors prefer aligned loops and functions.

2018年05月27日12分25秒

Jongware I don't see how the OS would significantly influence the results; the loop never makes system calls.

2018年05月27日12分25秒

Wow... This definitely goes beyond what I usually do to get around benchmarking anomalies.

2018年05月27日12分25秒

Ali I guess that makes sense since how can the compiler inline something it doesn't see? That's probably why we use inline + function definition in the header. Not sure how mature lto is in gcc. My experience with it at least in mingw is a hit or miss.

2018年05月27日12分25秒

greatwolf Yes, after the fact, it is clear. :) I believe both the Intel C++ Compiler and the Visual Studio compiler can do link time optimization. In the latter it is called link time code generation, although I haven't used Windows for ages.

2018年05月27日12分25秒

I think it was Communications of the ACM that had an article a few years ago about running fairly big applications (perl, Spice, etc.) while shifting the whole binary image one byte at a time by using Linux environments of different size. I recall typical variance of 15% or so. Their summary was that many benchmark results are useless because this external variable of alignment is not taken into account.

2018年05月27日12分25秒

up'd particularly for -flto. it's pretty revolutionary if you've never used it before, speaking from experience :)

2018年05月27日12分25秒

The funny thing is: -O2 -fno-omit-frame-pointer is just as good as -Os. Please check the updated question.

2018年05月27日12分25秒

How much you wanna bet those align= parameters relate to the size of cache lines?

2018年05月27日12分25秒

I don't really care anymore: It's not visible on my machine. And by passing the -falign-*=16 flags, everything is back to normal, everything behaves consistently. As far as I am concerned, this question is resolved.

2018年05月27日12分25秒

Thanks. I played with it: I only get a speed up by swapping add() and work() if -O2 is passed. In all other cases the code gets significantly slower by swapping. During the week-end, I also analyzed branch prediction / mis-prediction statistics with perf and I did not notice anything that could explain this weird behavior. The only consistent result is that in the slow case perf reports 100.0 in add() and a large value on the line right after the call to add() in the loop. It looks like we are stalling for some reason on add() in the slow case but not in the fast runs.

2018年05月27日12分25秒

I am thinking about installing Intel's VTune on one of my machines and do a profiling myself. perf supports only a limited number of things, perhaps Intel's stuff is a bit more handy on their own processor.

2018年05月27日12分25秒