标签云

微信群

扫码加入我们

WeChat QR Code


SO MANY COMMENTS! You can view them in chat and even leave your own there if you want, but please don't add any more here!

2018年09月26日36分20秒

Also see GCC Issue 62011, False Data Dependency in popcnt instruction. Someone else provided it, but it seems to have been lost during cleanups.

2018年09月26日36分20秒

Hi folks! Lots of past comments here; before leaving a new one, please review the archive.

2018年09月26日36分20秒

This still reproduces in clang at head. I've filed a bug: bugs.llvm.org/show_bug.cgi?id=34936.

2018年09月26日36分20秒

Interesting, can you add compiler version and compiler flags? The best thing is that on your machine, the results are turned around, i.e., using u64 is faster. Until now, I have never thought about which type my loop variable has, but it seems I have to think twice next time :).

2018年09月26日36分20秒

gexicide: I wouldn't call a jump from 16.8201 to 16.8126 making it "faster".

2018年09月26日36分20秒

Mehrdad: The jump I mean is the one between 12.9 and 16.8, so unsigned is faster here. In my benchmark, the opposite was the case, i.e. 26 for unsigned, 15 for uint64_t

2018年09月26日36分20秒

gexicide Have you notice the difference in addressing buffer[i]?

2018年09月26日36分20秒

Calvin: No, what do you mean?

2018年09月27日36分20秒

Unfortunately, ever since (Core 2?) there are virtually no performance differences between 32-bit and 64-bit integer operations except for multiply/divide - which aren't present in this code.

2018年09月26日36分20秒

Gene: Note that all versions store the size in a register and never read it from stack in the loop. Thus, address calculation cannot be in the mix, at least not inside the loop.

2018年09月26日36分20秒

Gene: Interesting explanation indeed! But it does not explain the main WTF points: That 64bit is slower than 32bit due to pipeline stalls is one thing. But if this is the case, shouldn't the 64bit version be reliably slower than the 32bit one? Instead, three different compilers emit slow code even for the 32bit version when using compile-time-constant buffer size; changing the buffer size to static again changes things completely. There was even a case on my colleagues machine (and in Calvin's answer) where the 64bit version is considerably faster! It seems to be absolutely unpredictable..

2018年09月26日36分20秒

Mysticial That's my point. There is no peak performance difference when there's zero contention for IU, bus time, etc. The reference clearly shows that. Contention makes everything different. Here's an example from the Intel Core literature: "One new technology included in the design is Macro-Ops Fusion, which combines two x86 instructions into a single micro-operation. For example, a common code sequence like a compare followed by a conditional jump would become a single micro-op. Unfortunately, this technology does not work in 64-bit mode." So we have a 2:1 ratio in execution speed.

2018年09月27日36分20秒

gexicide I see what you're saying, but you're inferring more than I meant. I'm saying the code that's running the fastest is keeping the pipeline and dispatch queues full. This condition is fragile. Minor changes like adding 32 bits to the total data flow and instruction reordering are enough to break it. In short, the OP assertion that fiddling and testing is the only way forward is correct.

2018年09月26日36分20秒

But still, your results are totally strange (first unsigned faster, then uint64_t faster) as unrolling does not fix the main problem of the false dependency.

2018年09月26日36分20秒

That was the first thing I've did after I've read the question. Break the dependency chain. As it turned out the performance difference does not change (on my computer at least - Intel Haswell with GCC 4.7.3).

2018年09月27日36分20秒

BenVoigt: It is conformant to strict aliasing. void* and char* are the two types which may be aliased, as they are esentially considered "pointers into some chunk of memory"! Your idea concerning the data dependency removal is nice for optimization, but it does not answer the question. And, as NilsPipenbrinck says, it does not seem to change anything.

2018年09月27日36分20秒

gexicide: The strict aliasing rule is not symmetric. You can use char* to access a T[]. You cannot safely use a T* to access a char[], and your code appears to do the latter.

2018年09月26日36分20秒

BenVoigt: Then you could never savely malloc an array of anything, as malloc returns void* and you interpret it as T[]. And I am pretty sure that void* and char* had the same semantics concerning strict aliasing. However, I guess this is quite offtopic here:)

2018年09月27日36分20秒

Personally I think the right way is uint64_t* buffer = new uint64_t[size/8]; /* type is clearly uint64_t[] */ char* charbuffer=reinterpret_cast<char*>(buffer); /* aliasing a uint64_t[] with char* is safe */

2018年09月26日36分20秒

It's just good luck that -funroll-loops happens to make code that doesn't bottleneck on a loop-carried dependency chain created by popcnt's false dep. Using an old compiler version that doesn't know about the false dependency is a risk. Without -funroll-loops, gcc 4.8.5's loop will bottleneck on popcnt latency instead of throughput, because it counts into rdx. The same code, compiled by gcc 4.9.3 adds an xor edx,edx to break the dependency chain.

2018年09月26日36分20秒

With old compilers, your code would still be vulnerable to exactly the same performance variation the OP experienced: seemingly-trivial changes could make gcc something slow because it had no idea it would cause a problem. Finding something that happens to work in one case on an old compiler is not the question.

2018年09月26日36分20秒

For the record, x86intrin.h's _mm_popcnt_* functions on GCC are forcibly inlined wrappers around the __builtin_popcount*; the inlining should make one exactly equivalent to the other. I highly doubt you'd see any difference that could be caused by switching between them.

2018年09月26日36分20秒