Friday, November 1, 2019

x86_64 - Assembly - loop conditions and out of order




I am not asking for a benchmark.



(If that was the case, I would have done it myself.)






My question:



I tend to avoid the indirect/index addressing modes for convenience.




As a replacement, I often use immediate, absolute or register addressing.



The code:



; %esi has the array address. Say we iterate a doubleword (4bytes) array.
; %ecx is the array elements count
(0x98767) myloop:
... ;do whatever with %esi
add $4, %esi
dec %ecx

jnz 0x98767;


Here, we have a serialized combo(dec and jnz) which prevent proper out of order execution (dependency).



Is there a way to avoid that / break the dep? (I am not an assembly expert).


Answer



When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction (if it's one of the simple ones listed in the table below), so they can macro-fuse into one uop in the decoders.



Doing this is not significantly worse for older CPUs that don't do macro-fusion. Putting the flag-setting earlier might shorten the branch mispredict penalty by one for such CPUs, by having letting a mispredict be detected sooner. I don't have benchmarks, but I don't think the small downside on increasingly-rare CPUs justifies missing out on the front-end throughput benefit (decode and issue) for CPUs that do fusion. Total uop throughput can often be a bottleneck.




AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp with any jcc, but only test/cmp, not any other ALU instructions. So definitely put compares with branches.



From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):



First       | can pair with these  |  cannot pair with
instruction | (and the inverse) |
---------------------------------------------
cmp |jz, jc, jb, ja, jl, jg| js, jp, jo
add, sub |jz, jc, jb, ja, jl, jg| js, jp, jo

adc, sbb |none |
inc, dec |jz, jl, jg | jc, jb, ja, js, jp, jo
test | all |
and | all |
or, xor, not, neg | none |
shift, rotate | none |

Table 9.2. Instruction fusion



So basically, inc/dec can macro-fuse with a jcc as long as the condition only depends on bits that are modified by inc/dec.



(Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax after writing al). Or on earlier CPUs, a partial-flags stall.)



Core2 / Nehalem was more limited in macro-fusion capability (just for CMP/TEST with more limited JCC combinations), and Core2 couldn't macro-fuse in 64bit mode at all.



Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...