Friday, August 30, 2019

c# - Huge performance difference (26x faster) when compiling for 32 and 64 bits



I was trying to measure the difference of using a for and a foreach when accessing lists of value types and reference types.



I used the following class to do the profiling.




public static class Benchmarker
{
public static void Profile(string description, int iterations, Action func)
{
Console.Write(description);

// Warm up
func();

Stopwatch watch = new Stopwatch();


// Clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();

watch.Start();
for (int i = 0; i < iterations; i++)
{
func();

}
watch.Stop();

Console.WriteLine(" average time: {0} ms", watch.Elapsed.TotalMilliseconds / iterations);
}
}


I used double for my value type.
And I created this 'fake class' to test reference types:




class DoubleWrapper
{
public double Value { get; set; }

public DoubleWrapper(double value)
{
Value = value;
}
}



Finally I ran this code and compared the time differences.



static void Main(string[] args)
{
int size = 1000000;
int iterationCount = 100;

var valueList = new List(size);

for (int i = 0; i < size; i++)
valueList.Add(i);

var refList = new List(size);
for (int i = 0; i < size; i++)
refList.Add(new DoubleWrapper(i));

double dummy;

Benchmarker.Profile("valueList for: ", iterationCount, () =>

{
double result = 0;
for (int i = 0; i < valueList.Count; i++)
{
unchecked
{
var temp = valueList[i];
result *= temp;
result += temp;
result /= temp;

result -= temp;
}
}
dummy = result;
});

Benchmarker.Profile("valueList foreach: ", iterationCount, () =>
{
double result = 0;
foreach (var v in valueList)

{
var temp = v;
result *= temp;
result += temp;
result /= temp;
result -= temp;
}
dummy = result;
});


Benchmarker.Profile("refList for: ", iterationCount, () =>
{
double result = 0;
for (int i = 0; i < refList.Count; i++)
{
unchecked
{
var temp = refList[i].Value;
result *= temp;
result += temp;

result /= temp;
result -= temp;
}
}
dummy = result;
});

Benchmarker.Profile("refList foreach: ", iterationCount, () =>
{
double result = 0;

foreach (var v in refList)
{
unchecked
{
var temp = v.Value;
result *= temp;
result += temp;
result /= temp;
result -= temp;
}

}

dummy = result;
});

SafeExit();
}


I selected Release and Any CPU options, ran the program and got the following times:




valueList for:  average time: 483,967938 ms
valueList foreach: average time: 477,873079 ms
refList for: average time: 490,524197 ms
refList foreach: average time: 485,659557 ms
Done!


Then I selected Release and x64 options, ran the program and got the following times:




valueList for:  average time: 16,720209 ms
valueList foreach: average time: 15,953483 ms
refList for: average time: 19,381077 ms
refList foreach: average time: 18,636781 ms
Done!


Why is x64 bit version so much faster? I expected some difference, but not something this big.



I do not have access to other computers. Could you please run this on your machines and tell me the results? I'm using Visual Studio 2015 and I have an Intel Core i7 930.




Here's the SafeExit() method, so you can compile/run by yourself:



private static void SafeExit()
{
Console.WriteLine("Done!");
Console.ReadLine();
System.Environment.Exit(1);
}



As requested, using double? instead of my DoubleWrapper:



Any CPU



valueList for:  average time: 482,98116 ms
valueList foreach: average time: 478,837701 ms
refList for: average time: 491,075915 ms
refList foreach: average time: 483,206072 ms
Done!



x64



valueList for:  average time: 16,393947 ms
valueList foreach: average time: 15,87007 ms
refList for: average time: 18,267736 ms
refList foreach: average time: 16,496038 ms
Done!



Last but not least: creating a x86 profile gives me almost the same results of using Any CPU.


Answer



I can reproduce this on 4.5.2. No RyuJIT here. Both x86 and x64 disassemblies look reasonable. Range checks and so on are the same. The same basic structure. No loop unrolling.



x86 uses a different set of float instructions. The performance of these instructions seems to be comparable with the x64 instructions except for the division:




  1. The 32 bit x87 float instructions use 10 byte precision internally.

  2. Extended precision division is super slow.




The division operation makes the 32 bit version extremely slow. Uncommenting the division equalizes performance to a large degree (32 bit down from 430ms to 3.25ms).



Peter Cordes points out that the instruction latencies of the two floating point units are not that dissimilar. Maybe some of the intermediate results are denormalized numbers or NaN. These might trigger a slow path in one of the units. Or, maybe the values diverge between the two implementations because of 10 byte vs. 8 byte float precision.



Peter Cordes also points out that all intermediate results are NaN... Removing this problem (valueList.Add(i + 1) so that no divisor is zero) mostly equalizes the results. Apparently, the 32 bit code does not like NaN operands at all. Let's print some intermediate values: if (i % 1000 == 0) Console.WriteLine(result);. This confirms that the data is now sane.



When benchmarking you need to benchmark a realistic workload. But who would have thought that an innocent division can mess up your benchmark?!




Try simply summing the numbers to get a better benchmark.



Division and modulo are always very slow. If you modify the BCL Dictionary code to simply not use the modulo operator to compute the bucket index performance measurable improves. This is how slow division is.



Here's the 32 bit code:



enter image description here



64 bit code (same structure, fast division):




enter image description here



This is not vectorized despite SSE instructions being used.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...