Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decimal performance regression on x86 for netcore 3 #12286

Open
Daniel-Svensson opened this issue Mar 18, 2019 · 12 comments
Open

decimal performance regression on x86 for netcore 3 #12286

Daniel-Svensson opened this issue Mar 18, 2019 · 12 comments
Labels
area-System.Numerics backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity tenet-performance Performance related issue
Milestone

Comments

@Daniel-Svensson
Copy link
Contributor

It seems performance for decimal arithmetics has decreased significantly (~50%-80% slower in simple cases and more in others) since moving to the managed implementation (I believe the relevant method PR dotnet/coreclr#18948).
It seems x64 performance has increased almost as much as the improved code proposed some time back

Benchmark is availible at https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks/Scenarios/InterestBenchmark.cs

First approach with to large number was availible at https://github.com/Daniel-Svensson/ClrExperiments/blob/978343a887c305de931a2caef442af89297a7362/ClrDecimal/Benchmarks/Scenarios/InterestBenchmark.cs
With results here

It is a simplified version from part of of real world code from a financial application.
In essense is just increases the interest on an account based on current interest rate similar to code below for 100 000 items.

var dayFactor = 1m / array[i].DaysInYear;

array[i].NewInterest = array[i].CurrentInterest + array[i].Amount * (array[i].CurrentInterestRate * dayFactor );

The actual workload does valuation of different financial instruments (loans, etc.) and it needs to use
the forecasted interest rate (which in turn are interpolated between different values so they forecasted/expected interest rate will not be rounded).
The actual workload has much more operations on the result of the "interest" calculation all of which will
get executed using "full precision" in the large majority of cases.

Measurements

Results for different runtimes can be found under https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/Benchmarks/results

Performance in netcore 3

Updated 2018-06-13

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core ? (CoreCLR 4.6.27422.72, CoreFX 4.7.19.12807), 32bit RyuJIT

Job=ShortRun  Toolchain=InProcessToolchain  IterationCount=3  
LaunchCount=1  WarmupCount=3  
Method RoundedAmounts SmallDivisor Count Mean Error StdDev
System.Decimal False False 100000 50.42 ms 11.8249 ms 0.6482 ms
'P/Invoke New C++' False False 100000 33.27 ms 8.6233 ms 0.4727 ms
'P/Invoke oleauto32' False False 100000 30.25 ms 0.9500 ms 0.0521 ms
System.Decimal False True 100000 46.80 ms 13.5972 ms 0.7453 ms
'P/Invoke New C++' False True 100000 30.32 ms 0.1360 ms 0.0075 ms
'P/Invoke oleauto32' False True 100000 28.78 ms 1.9610 ms 0.1075 ms
System.Decimal True False 100000 37.30 ms 10.4275 ms 0.5716 ms
'P/Invoke New C++' True False 100000 29.47 ms 0.1789 ms 0.0098 ms
'P/Invoke oleauto32' True False 100000 26.36 ms 6.1196 ms 0.3354 ms
System.Decimal True True 100000 32.52 ms 5.5799 ms 0.3059 ms
'P/Invoke New C++' True True 100000 26.54 ms 0.1219 ms 0.0067 ms
'P/Invoke oleauto32' True True 100000 24.07 ms 0.0689 ms 0.0038 ms
  • PInvokeDummy - just p/invoke to method doing nothing
  • Oleauto32 - p/invoke to oleauto32.dll (code used previously via internal fast call)
  • "New" - The code proposed in https://github.com/dotnet/coreclr/issues/10642
    The code doing the p/invoke calls are not "optimized", for example throwing code has not been extracted to separate method so they are not currently suitable for inlining.

Net core 2.2

Important: these results are from the first version of the benchmark, so different scales

Performance in netcore 2.2 is very similar to net framework

BenchmarkDotNet=v0.11.4, OS=Windows 10.0.17763.379 (1809/October2018Update/Redstone5)
Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=3.0.100-preview3-010431
  [Host] : .NET Core ? (CoreCLR 4.6.27414.05, CoreFX 4.6.27414.05), 32bit RyuJIT

Job=MediumRun  Toolchain=InProcessToolchain  IterationCount=15  
LaunchCount=2  WarmupCount=10  
Method RoundedAmounts SmallDivisor Count Mean Error StdDev Median
System.Decimal False False 100000 38.041 ms 0.2255 ms 0.3305 ms 38.072 ms
Oleauto32 False False 100000 30.787 ms 0.1140 ms 0.1635 ms 30.746 ms
System.Decimal False True 100000 35.373 ms 0.2204 ms 0.3090 ms 35.327 ms
Oleauto32 False True 100000 29.285 ms 0.2590 ms 0.3631 ms 29.084 ms
System.Decimal True False 100000 34.526 ms 0.1504 ms 0.2205 ms 34.426 ms
System.Decimal True True 100000 31.612 ms 0.2084 ms 0.3119 ms 31.511 ms
New True True 100000 29.316 ms 0.1713 ms 0.2286 ms 29.147 ms
@Daniel-Svensson
Copy link
Contributor Author

@jkotas , @pentp you were part of the discussion in #7778

As I tried to communicate there I would be happy to port my performance improvements to the c++ code (with up to over 300% improvements to x64 and over 100% improvements to x86 worst case performance) to C# but would need some instrinct support for

  • div & mul (https://github.com/dotnet/corefx/issues/32075)
  • bitscan (think that it may already be in place)
  • add/sub with carry (I have not seen any issue , but the no-instrinct c++ code in my branch is faster than the old if/else code that was in coreclr)

@pentp
Copy link
Contributor

pentp commented Mar 18, 2019

Is the NetFramework result the one for .NET Core?

The test data looks a bit contrived (interest calculation for 100 trillion, maybe in Venezuela only?). I focused optimizations on more realistic numbers (weighted towards smaller numbers with less decimal places).

The setup method Random does not return a value between min/max (it should return rnd * range + min).

@Daniel-Svensson
Copy link
Contributor Author

Is the NetFramework result the one for .NET Core?

Yes, will try to change it to system.decimal when on a computer

The test data looks a bit contrived (interest calculation for 100 trillion, maybe in Venezuela only?). I focused optimizations on more realistic numbers (weighted towards smaller numbers with less decimal places).

Input can probably be divided by 100 for a little more realistic case, millions and sometime billions is not so uncommon for large Corp (non usd/eur currencies).
But if I remember correctly it will not make any large difference since the day factor will cause "full precision" numbers and the example is just actually the starting point for many other calculations.

The setup method Random does not return a value between min/max (it should return rnd * range + min).

Thanks for the input will look into that at a later point

@danmoseley
Copy link
Member

@tannergooding can you take a look at this problem and share your thoughts?

@pentp
Copy link
Contributor

pentp commented May 16, 2019

I think the benchmark needs some adjustments:

  • Setup method Random does not return a value between min/max (as mentioned above)
  • The non-rounded cases (raw data from double) do not make sense for financial calculations
  • The input data is unrealistic (amounts in trillions, interest in billions)
  • If the performance of said algorithm would matter, then dayFactor should not be calculated at all: NewInterest = CurrentInterest + Amount * (CurrentInterestRate / DaysInYear );

Currently this is testing the most expensive paths for decimal - all 96bits / 28 decimal places used + very high scaling (20+).

@danmoseley
Copy link
Member

@tannergooding thoughts? Is this needed for 3.0?

@tannergooding
Copy link
Member

I would agree with @pentp. The benchmark doesn't appear to be representative of what would be likely for real world scenarios (the range on the inputs/interest rates is too large) and I would like to see it updated and for new numbers to be posted.

I don't think there is anything particular pressing here for 3.0. Although it would be good to validate we haven't egregiously regressed the perf for x86.

@danmoseley
Copy link
Member

@tannergooding wrt regressions the key thing is to make sure there's appropriate microbenchmarks in the dotnet/performance repo. If so, we will automatically catch any regressions in our next pass. Even if the benchmark didn't exist in the last release timeframe. (And perhaps this can be closed)

@pentp
Copy link
Contributor

pentp commented Jun 12, 2019

The change to managed decimal was done with careful performance measurements and some regressions for 32-bit were deemed acceptable (the unmanaged implementation was optimized for 32-bit). For 64-bit there are only a couple of odd corner cases that are slower, but real world use cases should actually see significant improvement.

@Daniel-Svensson
Copy link
Contributor Author

I did some updates to the benchmarks some time ago, but it seems my update post did not make it here. I have updated the description with the results from the benchmarks with much smaller numbers.

  • There is still an "off by one" error for random so it will never use 366 days, but apart from that I think I
    did make most suggested updates.
  • I also did some test with much smaller numbers so the amounts was in the order of 200 but with no major changes on the relative timings.

The important thing about this simplified example is that it is far from the worse case scenario,
but it still triggers "bad performance.

As soon as you start dividing numbers or take input from doubles you will very quickly get
"full precision" numbers and performance regressions
.
There will certainly be many types of workloads where full precision numbers are the majority and not the exception

  • Just perform operations where one of the operands is (1/3 , 1/365) or similar and you will fall inte
    the category of bad (not always worst case) performance.

Since I believe that also worst/bad case performance is important I really think you should also measure those and not assume almost all input is limited to 32bit precision.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@tannergooding tannergooding added area-System.Numerics and removed area-System.Runtime untriaged New issue has not been triaged by the area owner labels Jun 23, 2020
@ghost
Copy link

ghost commented Jun 23, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

Copy link
Contributor

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants