为什么处理排序数组要比处理未排序数组快?

Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data miraculously makes the code almost six times faster:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might be just a language or compiler anomaly, so I tried Java:

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a similar but less extreme result.


My first thought was that sorting brings the data into the cache, but then I thought how silly that was because the array was just generated.

  • What is going on?
  • Why is processing a sorted array faster than processing an unsorted array?

The code is summing up some independent terms, so the order should not matter.


Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases.

user194715: any compiler that uses a cmov or other branchless implementation (like auto-vectorization with pcmpgtd) will have performance that&#39;s not data dependent on any CPU. But if it&#39;s branchy, it will be sort-dependent on any CPU with out-of-order speculative execution. (Even high-performance in-order CPUs use branch-prediction to avoid fetch/decode bubbles on taken branches; the miss penalty is smaller).

Woops... re: Meltdown and Spectre

KyleMit does it have something to do with both? I haven&#39;t read much on both

mohitmun, both of those security flaws fit into a broad category of vulnerabilities classified as “branch target injection” attacks

phonetagger Take a look at this followup question: stackoverflow.com/questions/11276291/&hellip; The Intel Compiler came pretty close to completely getting rid of the outer loop.

Novelocrat Only half of that is correct. Shifting a 1 into the sign-bit when it is zero is indeed UB. That&#39;s because it&#39;s signed integer overflow. But shifting a 1 out of the sign-bit is IB. Right-shifting a negative signed integer is IB. You can go into the argument that that C/C++ doesn&#39;t require that the top bit be the sign indicator. But implementation details are IB.

Unheilig Using bitwise operations for anything other than legitimate bit manipulation or multiplying/dividing by a variable power-of-two is not something I usually recommend since it&#39;s often obfuscating. Nevertheless, here&#39;s a good reference for bit twiddling hacks: graphics.stanford.edu/~seander/bithacks.html

Mysticial Thanks so much for the link. It looks promising. I will go though it. One last request. Sorry, but please don&#39;t mind, could you tell me how you could do this int t = (data[c] - 128) &gt;&gt; 31; sum += ~t &amp; data[c]; to replace the original if-condition above?

The grammar in me wants me to think this should read &quot;... victim of branch prediction failure&quot; rather than just &quot;... victim of branch prediction fail.&quot;

Does branch prediction work better on sorted arrays vs. arrays with different patterns? For example, for the array --&gt; { 10, 5, 20, 10, 40, 20, ... } the next element in the array from the pattern is 80. Would this kind of array be sped up by branch prediction in which the next element is 80 here if the pattern is followed? Or does it usually only help with sorted arrays?

So basically everything I conventionally learned about big-O is out of the window? Better to incur a sorting cost than a branching cost?

AgrimPathak That depends. For not too large input, an algorithm with higher complexity is faster than an algorithm with lower complexity when the constants are smaller for the algorithm with higher complexity. Where the break-even point is can be hard to predict. Also, compare this, locality is important. Big-O is important, but it is not the sole criterion for performance.

When does branch prediction takes place? When does language will know that array is sorted? I&#39;m thinking of situation of array that looks like: [1,2,3,4,5,...998,999,1000, 3, 10001, 10002] ? will this obscure 3 increase running time? Will it be as long as unsorted array?

FilipBartuzi Branch prediction takes place in the processor, below the language level (but the language may offer ways to tell the compiler what&#39;s likely, so the compiler can emit code suited to that). In your example, the out-of-order 3 will lead to a branch-misprediction (for appropriate conditions, where 3 gives a different result than 1000), and thus processing that array will likely take a couple dozen or hundred nanoseconds longer than a sorted array would, hardly ever noticeable. What costs time is i high rate of mispredictions, one misprediction per 1000 isn&#39;t much.

BlueRaja-DannyPflughoeft This is the un-optimized version. The compiler did NOT optimize the ternary-operator, it just TRANSLATE it. GCC can optimize if-then if given sufficient optimization level, nevertheless, this one shows the power of conditional move, and manual optimization makes a difference.

WiSaGaN The code demonstrates nothing, because your two pieces of code compile to the same machine code. It&#39;s critically important that people don&#39;t get the idea that somehow the if statement in your example is different from the terenary in your example. It&#39;s true that you own up to the similarity in your last paragraph, but that doesn&#39;t erase the fact that the rest of the example is harmful.

WiSaGaN My downvote would definitely turn into an upvote if you modified your answer to remove the misleading -O0 example and to show the difference in optimized asm on your two testcases.

UpAndAdam At the moment of the test, VS2010 can&#39;t optimize the original branch into a conditional move even when specifying high optimization level, while gcc can.

This ternary operator trick works beautifully for Java. After reading Mystical&#39;s answer, I was wondering what could be done for Java to avoid false branch prediction since Java doesn&#39;t have anything equivalent to -O3. ternary operator: 2.1943s and original: 6.0303s.

If you want to cheat, you might as well take the multiplication outside the loop and do sum*=100000 after the loop.

Michael - I believe that this example is actually an example of loop-invariant hoisting (LIH) optimization, and NOT loop swap. In this case, the entire inner loop is independent of the outer loop and can therefore be hoisted out of the outer loop, whereupon the result is simply multiplied by a sum over i of one unit =1e5. It makes no difference to the end result, but I just wanted to set the record straight since this is such a frequented page.

Although not in the simple spirit of swapping loops, the inner if at this point could be converted to: sum += (data[j] &gt;= 128) ? data[j] * 100000 : 0; which the compiler may be able to reduce to cmovge or equivalent.

The outer loop is to make the time taken by inner loop large enough to profile. So why would you loop swap. At the end, that loop will be removed anyways.

saurabheights: Wrong question: why would the compiler NOT loop swap. Microbenchmarks is hard ;)

This is scary, in the unsorted list, there should be 50% chance of hitting the add. Somehow the branch prediction only has a 25% miss rate, how can it do better than 50% miss?

tall.b.lo: The 25% is of all branches - there are two branches in the loop, one for data[c] &gt;= 128 (which has a 50% miss rate as you suggest) and one for the loop condition c &lt; arraySize which has ~0% miss rate.

You want to bypass the branch-predictor, why? It&#39;s an optimization.

Because no branch is better than a branch :-) In a lot of situations this is simply a lot faster... if you&#39;re optimizing, it&#39;s definitely worth a try. They also use it quite a bit in f.ex. graphics.stanford.edu/~seander/bithacks.html

In general lookup tables can be fast, but have you ran the tests for this particular condition? You&#39;ll still have a branch condition in your code, only now it&#39;s moved to the look up table generation part. You still wouldn&#39;t get your perf boost

Zain if you really want to know... Yes: 15 seconds with the branch and 10 with my version. Regardless, it&#39;s a useful technique to know either way.

Why not sum += lookup[data[j]] where lookup is an array with 256 entries, the first ones being zero and the last ones being equal to the index?

MooingDuck &#39;Cause it won&#39;t make a difference - that value can be anything, but it still will be in the bounds of these thresholds. So why show a random value when you already know the limits? Although I agree that you could show one for the sake of completeness, and &#39;just for the heck of it&#39;.

cst1992: Right now his slowest timing is TTFFTTFFTTFF, which seems, to my human eye, quite predictable. Random is inherently unpredictable, so it&#39;s entirely possible it would be slower still, and thus outside the limits shown here. OTOH, it could be that TTFFTTFF perfectly hits the pathological case. Can&#39;t tell, since he didn&#39;t show the timings for random.

MooingDuck To a human eye, &quot;TTFFTTFFTTFF&quot; is a predictable sequence, but what we are talking about here is the behavior of the branch predictor built into a CPU. The branch predictor is not AI-level pattern recognition; it&#39;s very simple. When you just alternate branches it doesn&#39;t predict well. In most code, branches go the same way almost all the time; consider a loop that executes a thousand times. The branch at the end of the loop goes back to the start of the loop 999 times, and then the thousandth time does something different. A very simple branch predictor works well, usually.

steveha: I think you&#39;re making assumptions about how the CPU branch predictor works, and I disagree with that methodology. I don&#39;t know how advanced that branch predictor is, but I seem to think it&#39;s far more advanced than you do. You&#39;re probably right, but measurements would definitely be good.

steveha: The Two-level adaptive predictor could lock onto the TTFFTTFF pattern with no issue whatsoever. &quot;Variants of this prediction method are used in most modern microprocessors&quot;. Local branch prediction and Global branch prediction are based on a two level adaptive predictor, they can as well. &quot;Global branch prediction is used in AMD processors, and in Intel Pentium M, Core, Core 2, and Silvermont-based Atom processors&quot; Also add Agree predictor, Hybrid predictor, Prediction of indirect jumps, to that list. Loop predictor wont lock on, but hits 75%. That leaves only 2 that can&#39;t lock on

Right, you can also just use the bit directly and multiply (data[c]&gt;&gt;7 - which is discussed somewhere here as well); I intentionally left this solution out, but of course you are correct. Just a small note: The rule of thumb for lookup tables is that if it fits in 4KB (because of caching), it&#39;ll work - preferably make the table as small as possible. For managed languages I&#39;d push that to 64KB, for low-level languages like C++ and C, I&#39;d probably reconsider (that&#39;s just my experience). Since typeof(int) = 4, I&#39;d try to stick to max 10 bits.

I think indexing with the 0/1 value will probably be faster than an integer multiply, but I guess if performance is really critical you should profile it. I agree that small lookup tables are essential to avoid cache pressure, but clearly if you have a bigger cache you can get away with a bigger lookup table, so 4KB is more a rule of thumb than a hard rule. I think you meant sizeof(int) == 4? That would be true for 32-bit. My two-year-old cell phone has a 32KB L1 cache, so even a 4K lookup table might work, especially if the lookup values were a byte instead of an int.

Possibly I&#39;m missing something but in your j equals 0 or 1 method why don&#39;t you just multiply your value by j before adding it rather than using the array indexing (possibly should be multiplied by 1-j rather than j)

steveha Multiplication should be faster, I tried looking it up in the Intel books, but couldn&#39;t find it... either way, benchmarking also gives me that result here.

steveha P.S.: another possible answer would be int c = data[j]; sum += c &amp; -(c &gt;&gt; 7); which requires no multiplications at all.

sum= 3137536 - clever. That&#39;s kinda obviously not the point of the question. The question is clearly about explaining surprising performance characteristics. I&#39;m inclined to say that the addition of doing std::partition instead of std::sort is valuable. Though the actual question extends to more than just the synthetic benchmark given.

DeadMG: this is indeed not the standard dichotomic search for a given key, but a search for the partitioning index; it requires a single compare per iteration. But don&#39;t rely on this code, I have not checked it. If you are interested in a guaranteed correct implementation, let me know.

how are two instructions executed together? is this done with separate cpu cores or is pipeline instruction is integrated in single cpu core?

M.kazemAkhgary It&#39;s all inside one logical core. If you&#39;re interested, this is nicely described for example in Intel Software Developer Manual

That is a very interesting article (in fact, I have just read all of it), but how does it answer the question?

PeterMortensen I am a bit flummoxed by your question. For example here is one relevant line from that piece: When the input is unsorted, all the rest of the loop takes substantial time. But with sorted input, the processor is somehow able to spend not just less time in the body of the loop, meaning the buckets at offsets 0x18 and 0x1C, but vanishingly little time on the mechanism of looping. Author is trying to discuss profiling in the context of code posted here and in the process trying to explain why the sorted case is so much more faster.

Almost as good as the Intel marketing animations, and they were obsessed not just with branch prediction but out of order execution, both strategies being &quot;speculative&quot;. Reading ahead in memory and storage (sequential pre-fetch to buffer) is also speculative. It all adds up.

mckenzm: out-of-order speculative exec makes branch prediction even more valuable; as well as hiding fetch/decode bubbles, branch prediction + speculative exec removes control dependencies from critical path latency. Code inside or after an if() block can execute before the branch condition is known. Or for a search loop like strlen or memchr, interations can overlap. If you had to wait for the match-or-not result to be known before running any of the next iteration, you&#39;d bottleneck on cache load + ALU latency instead of throughput.

While flushing pipelines is super fast Not really. It&#39;s fast compared to a cache miss all the way to DRAM, but on a modern high-performance x86 (like Intel Sandybridge-family) it&#39;s about a dozen cycles. Although fast recovery does allow it to avoid waiting for all older independent instructions to reach retirement before starting recovery, you still lose a lot of front-end cycles on a mispredict. What exactly happens when a skylake CPU mispredicts a branch?. (And each cycle can be about 4 instructions of work.) Bad for high-throughput code.

Are you saying that every instruction can be conditional? So, multiple instructions with the GE suffix could be performed sequentially, without changing the value of R3 in between?

Yes, correct, every instruction can be conditional on ARM, at least in the 32 and 64 bit instruction sets. There&#39;s a devoted 4-bit condition field. You can have several instructions in a row with the same condition, but at some point, if the chance of the condition being false is non-negligible, then it is more efficient to add a branch.

The other innovation in ARM is the addition of the S instruction suffix, also optional on (almost) all instructions, which if absent, prevents instructions from changing status bits (with the exception of the CMP instruction, whose job is to set status bits, so it doesn&#39;t need the S suffix). This allows you to avoid CMP instructions in many cases, as long as the comparison is with zero or similar (eg. SUBS R0, R0, #1 will set the Z (Zero) bit when R0 reaches zero). Conditionals and the S suffix incur zero overhead. It&#39;s quite a beautiful ISA.

Not adding the S suffix allows you to have several conditional instructions in a row without worrying that one of them might change the status bits, which might otherwise have the side effect of skipping the rest of the conditional instructions.

(ARM predicated execution truly NOPs the instruction, so you can even use it on loads or stores that would fault, unlike x86 cmov with a memory source operand. Most ISAs, including AArch64, only have ALU select operations. So ARM predication can be powerful, and usable more efficiently than branchless code on most ISAs.)

Right, but the setup cost of sorting the array is O(N log N), so breaking early doesn&#39;t help you if the only reason you are sorting the array is to be able to break early. If, however, you have other reasons to pre-sort the array, then yes, this is valuable.

Depends how many times you sort the data compared to how many times you loop on it. The sort in this example is just an example, it doesn&#39;t have to be just before the loop

Yes, that&#39;s exactly the point I made in my first comment :-) You say &quot;The branch prediction will miss only once.&quot; But you are not counting the O(N log N) branch prediction misses inside the sort algorithm, which is actually greater than the O(N) branch prediction misses in the unsorted case. So you would need to use the entirety of the sorted data O(log N) times to break even (probably actually closer to O(10 log N), depending on the sort algorithm, e.g. for quicksort, due to cache misses -- mergesort is more cache-coherent, so you would need closer to O(2 log N) usages to break even.)

One significant optimization though would be to do only &quot;half a quicksort&quot;, sorting only items less than the target pivot value of 127 (assuming everything less than or equal to the pivot is sorted after the pivot). Once you reach the pivot, sum the elements before the pivot. This would run in O(N) startup time rather than O(N log N), although there will still be a lot of branch prediction misses, probably of the order of O(5 N) based on the numbers I gave before, since it&#39;s half a quicksort.

Just for the sake of completeness, this is probably not how you&#39;d implement that in Matlab. I bet it&#39;d be much faster if done after vectorizing the problem.

Matlab does automatic parallelization / vectorization in many situations but the issue here is to check the effect of branch prediction. Matlab is not immune in anyway!

Does matlab use native numbers or a mat lab specific implementation (infinite amount of digits or so?)

I don&#39;t really see how this proves anything? The only thing you have shown is that &quot;not doing all the work of sorting the whole array takes less time than sorting the whole array&quot;. Your claim that this &quot;also runs fastest&quot; is very architecture-dependent. See my answer about how this works on ARM. PS you could make your code faster on non-ARM architectures by putting the summation inside the 200-element block loop, sorting in reverse, and then using Yochai Timmer&#39;s suggestion of breaking once you get an out-of range value. That way each 200-element block summation can be terminated early.

If you just want to implement the algorithm efficiently over unsorted data, you would do that operation branchlessly (and with SIMD, e.g. with x86 pcmpgtb to find elements with their high bit set, then AND to zero smaller elements). Spending any time actually sorting chunks would be slower. A branchless version would have data-independent performance, also proving that the cost came from branch misprediction. Or just use performance counters to observe that directly, like Skylake int_misc.clear_resteer_cycles or int_misc.recovery_cycles to count front-end idle cycles from mispredicts

The instructions stay hot in the CPU&#39;s L1 instruction cache regardless of mispredicts. The problem is fetching them into the pipeline in the right order, before the immediately-previous instructions have decoded and finished executing.

What C++ compiler / hardware did you test this with, and with what compiler options? I&#39;m surprised the original version didn&#39;t auto-vectorize to nice branchless SIMD code. Did you enable full optimization?

A 4096 entry lookup table sounds insane. If you shift out any bits, you need to can&#39;t only use the LUT result if you want to add the original number. These all sound like silly tricks to work around your compiler not easily using branchless techniques. More straightforward would be mask = tmp &lt; 128 : 0 : -1UL; / total += tmp &amp; mask;