decimal performance regression on x86 for netcore 3 #12286

Daniel-Svensson · 2019-03-18T12:58:46Z

It seems performance for decimal arithmetics has decreased significantly (~50%-80% slower in simple cases and more in others) since moving to the managed implementation (I believe the relevant method PR dotnet/coreclr#18948).
It seems x64 performance has increased almost as much as the improved code proposed some time back

Benchmark is availible at https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks/Scenarios/InterestBenchmark.cs

First approach with to large number was availible at https://github.com/Daniel-Svensson/ClrExperiments/blob/978343a887c305de931a2caef442af89297a7362/ClrDecimal/Benchmarks/Scenarios/InterestBenchmark.cs
With results here

It is a simplified version from part of of real world code from a financial application.
In essense is just increases the interest on an account based on current interest rate similar to code below for 100 000 items.

var dayFactor = 1m / array[i].DaysInYear;

array[i].NewInterest = array[i].CurrentInterest + array[i].Amount * (array[i].CurrentInterestRate * dayFactor );

The actual workload does valuation of different financial instruments (loans, etc.) and it needs to use
the forecasted interest rate (which in turn are interpolated between different values so they forecasted/expected interest rate will not be rounded).
The actual workload has much more operations on the result of the "interest" calculation all of which will
get executed using "full precision" in the large majority of cases.

Measurements

Results for different runtimes can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks/results

Performance in netcore 3

Updated 2018-06-13

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core ? (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 32bit RyuJIT

Job=ShortRun  Toolchain=InProcessToolchain  IterationCount=3  
LaunchCount=1  WarmupCount=3

Method	RoundedAmounts	SmallDivisor	Count	Mean	Error	StdDev
System.Decimal	False	False	100000	50.42 ms	11.8249 ms	0.6482 ms
'P/Invoke New C++'	False	False	100000	33.27 ms	8.6233 ms	0.4727 ms
'P/Invoke oleauto32'	False	False	100000	30.25 ms	0.9500 ms	0.0521 ms
System.Decimal	False	True	100000	46.80 ms	13.5972 ms	0.7453 ms
'P/Invoke New C++'	False	True	100000	30.32 ms	0.1360 ms	0.0075 ms
'P/Invoke oleauto32'	False	True	100000	28.78 ms	1.9610 ms	0.1075 ms
System.Decimal	True	False	100000	37.30 ms	10.4275 ms	0.5716 ms
'P/Invoke New C++'	True	False	100000	29.47 ms	0.1789 ms	0.0098 ms
'P/Invoke oleauto32'	True	False	100000	26.36 ms	6.1196 ms	0.3354 ms
System.Decimal	True	True	100000	32.52 ms	5.5799 ms	0.3059 ms
'P/Invoke New C++'	True	True	100000	26.54 ms	0.1219 ms	0.0067 ms
'P/Invoke oleauto32'	True	True	100000	24.07 ms	0.0689 ms	0.0038 ms

PInvokeDummy - just p/invoke to method doing nothing
Oleauto32 - p/invoke to oleauto32.dll (code used previously via internal fast call)
"New" - The code proposed in https://github.com/dotnet/coreclr/issues/10642
The code doing the p/invoke calls are not "optimized", for example throwing code has not been extracted to separate method so they are not currently suitable for inlining.

Net core 2.2

Important: these results are from the first version of the benchmark, so different scales

Performance in netcore 2.2 is very similar to net framework

BenchmarkDotNet=v0.11.4, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core ? (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 32bit RyuJIT

Job=MediumRun  Toolchain=InProcessToolchain  IterationCount=15  
LaunchCount=2  WarmupCount=10

Method	RoundedAmounts	SmallDivisor	Count	Mean	Error	StdDev	Median
System.Decimal	False	False	100000	38.041 ms	0.2255 ms	0.3305 ms	38.072 ms
Oleauto32	False	False	100000	30.787 ms	0.1140 ms	0.1635 ms	30.746 ms
System.Decimal	False	True	100000	35.373 ms	0.2204 ms	0.3090 ms	35.327 ms
Oleauto32	False	True	100000	29.285 ms	0.2590 ms	0.3631 ms	29.084 ms
System.Decimal	True	False	100000	34.526 ms	0.1504 ms	0.2205 ms	34.426 ms
System.Decimal	True	True	100000	31.612 ms	0.2084 ms	0.3119 ms	31.511 ms
New	True	True	100000	29.316 ms	0.1713 ms	0.2286 ms	29.147 ms

Daniel-Svensson · 2019-03-18T13:06:49Z

@jkotas , @pentp you were part of the discussion in #7778

As I tried to communicate there I would be happy to port my performance improvements to the c++ code (with up to over 300% improvements to x64 and over 100% improvements to x86 worst case performance) to C# but would need some instrinct support for

div & mul (https://github.com/dotnet/corefx/issues/32075)
bitscan (think that it may already be in place)
add/sub with carry (I have not seen any issue , but the no-instrinct c++ code in my branch is faster than the old if/else code that was in coreclr)

pentp · 2019-03-18T15:11:55Z

Is the NetFramework result the one for .NET Core?

The test data looks a bit contrived (interest calculation for 100 trillion, maybe in Venezuela only?). I focused optimizations on more realistic numbers (weighted towards smaller numbers with less decimal places).

The setup method Random does not return a value between min/max (it should return rnd * range + min).

Daniel-Svensson · 2019-03-26T11:26:02Z

Is the NetFramework result the one for .NET Core?

Yes, will try to change it to system.decimal when on a computer

The test data looks a bit contrived (interest calculation for 100 trillion, maybe in Venezuela only?). I focused optimizations on more realistic numbers (weighted towards smaller numbers with less decimal places).

Input can probably be divided by 100 for a little more realistic case, millions and sometime billions is not so uncommon for large Corp (non usd/eur currencies).
But if I remember correctly it will not make any large difference since the day factor will cause "full precision" numbers and the example is just actually the starting point for many other calculations.

The setup method Random does not return a value between min/max (it should return rnd * range + min).

Thanks for the input will look into that at a later point

danmoseley · 2019-05-15T20:03:36Z

@tannergooding can you take a look at this problem and share your thoughts?

pentp · 2019-05-16T11:01:04Z

I think the benchmark needs some adjustments:

Setup method Random does not return a value between min/max (as mentioned above)
The non-rounded cases (raw data from double) do not make sense for financial calculations
The input data is unrealistic (amounts in trillions, interest in billions)
If the performance of said algorithm would matter, then dayFactor should not be calculated at all: NewInterest = CurrentInterest + Amount * (CurrentInterestRate / DaysInYear );

Currently this is testing the most expensive paths for decimal - all 96bits / 28 decimal places used + very high scaling (20+).

danmoseley · 2019-06-10T22:14:24Z

@tannergooding thoughts? Is this needed for 3.0?

tannergooding · 2019-06-10T23:29:07Z

I would agree with @pentp. The benchmark doesn't appear to be representative of what would be likely for real world scenarios (the range on the inputs/interest rates is too large) and I would like to see it updated and for new numbers to be posted.

I don't think there is anything particular pressing here for 3.0. Although it would be good to validate we haven't egregiously regressed the perf for x86.

danmoseley · 2019-06-11T19:42:45Z

@tannergooding wrt regressions the key thing is to make sure there's appropriate microbenchmarks in the dotnet/performance repo. If so, we will automatically catch any regressions in our next pass. Even if the benchmark didn't exist in the last release timeframe. (And perhaps this can be closed)

pentp · 2019-06-12T00:55:09Z

The change to managed decimal was done with careful performance measurements and some regressions for 32-bit were deemed acceptable (the unmanaged implementation was optimized for 32-bit). For 64-bit there are only a couple of odd corner cases that are slower, but real world use cases should actually see significant improvement.

Daniel-Svensson · 2019-06-13T10:32:15Z

I did some updates to the benchmarks some time ago, but it seems my update post did not make it here. I have updated the description with the results from the benchmarks with much smaller numbers.

There is still an "off by one" error for random so it will never use 366 days, but apart from that I think I
did make most suggested updates.
I also did some test with much smaller numbers so the amounts was in the order of 200 but with no major changes on the relative timings.

The important thing about this simplified example is that it is far from the worse case scenario,
but it still triggers "bad performance.

As soon as you start dividing numbers or take input from doubles you will very quickly get
"full precision" numbers and performance regressions.
There will certainly be many types of workloads where full precision numbers are the majority and not the exception

Just perform operations where one of the operands is (1/3 , 1/365) or similar and you will fall inte
the category of bad (not always worst case) performance.

Since I believe that also worst/bad case performance is important I really think you should also measure those and not assume almost all input is limited to 32bit precision.

ghost · 2020-06-23T21:38:39Z

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

dotnet-policy-service · 2024-12-24T02:15:08Z

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020

tannergooding added area-System.Numerics and removed area-System.Runtime untriaged New issue has not been triaged by the area owner labels Jun 23, 2020

dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decimal performance regression on x86 for netcore 3 #12286

decimal performance regression on x86 for netcore 3 #12286

Daniel-Svensson commented Mar 18, 2019

Daniel-Svensson commented Mar 18, 2019

pentp commented Mar 18, 2019

Daniel-Svensson commented Mar 26, 2019

danmoseley commented May 15, 2019

pentp commented May 16, 2019 •

edited

Loading

danmoseley commented Jun 10, 2019

tannergooding commented Jun 10, 2019

danmoseley commented Jun 11, 2019

pentp commented Jun 12, 2019

Daniel-Svensson commented Jun 13, 2019

ghost commented Jun 23, 2020

dotnet-policy-service bot commented Dec 24, 2024

decimal performance regression on x86 for netcore 3 #12286

decimal performance regression on x86 for netcore 3 #12286

Comments

Daniel-Svensson commented Mar 18, 2019

Measurements

Performance in netcore 3

Net core 2.2

Daniel-Svensson commented Mar 18, 2019

pentp commented Mar 18, 2019

Daniel-Svensson commented Mar 26, 2019

danmoseley commented May 15, 2019

pentp commented May 16, 2019 • edited Loading

danmoseley commented Jun 10, 2019

tannergooding commented Jun 10, 2019

danmoseley commented Jun 11, 2019

pentp commented Jun 12, 2019

Daniel-Svensson commented Jun 13, 2019

ghost commented Jun 23, 2020

dotnet-policy-service bot commented Dec 24, 2024

pentp commented May 16, 2019 •

edited

Loading