Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRT Pow function has bad performance on Windows #10798

Open
fiigii opened this issue Jul 30, 2018 · 19 comments
Open

CRT Pow function has bad performance on Windows #10798

fiigii opened this issue Jul 30, 2018 · 19 comments
Labels
area-System.Numerics backlog-cleanup-candidate An inactive issue that has been marked for automated closure. enhancement Product code improvement that does NOT require public API changes/additions no-recent-activity tenet-performance Performance related issue tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@fiigii
Copy link
Contributor

fiigii commented Jul 30, 2018

During benchmarking AoS/SoA ray-tracer dotnet/coreclr#18839, we found that the Vector3 benchmark (RayTracer) is much slower on Windows than Linux.

Execution time Windows Linux
Baseline (RayTracer ) 6.00s 4.13s
PacketTracer 1.20s 1.35s
Performance Gains 5.00x 3.06x

According to VTune analysis, this gap is caused by the CRT math library, which RayTracer uses Math.Pow at https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/Raytracer.cs#L153

Windows

image

Linux

image

On the left side (AoS means RayTracer), we can see ucrtbase.dll on Windows has much more time consuming and instruction retired than libm-2.23.so on Linux.

The data is collected on Core i9 + VS2017, but Core i7+ VS2015 has the same performance gap.

@tannergooding
Copy link
Member

@fiigii, do you have any metrics for what time is spent in ucrtbase.dll (as well as RayTracer.dll and libm-2.23.so)? I wonder if the metrics are partially "skewed" due to different inlining characteristics of the math libraries/etc.

@fiigii
Copy link
Contributor Author

fiigii commented Jul 30, 2018

@tannergooding Here is the VTune data of CRT
image

@fiigii
Copy link
Contributor Author

fiigii commented Jul 30, 2018

The CPI of pow on Windows looks healthy, the issue may be from its algorithm/implementation.

@fiigii
Copy link
Contributor Author

fiigii commented Jul 30, 2018

cc @AndyAyersMS @CarolEidt @jkotas

@tannergooding
Copy link
Member

the issue may be from its algorithm/implementation.

@fiigii, that would be a bit surprising. I'm looking at the implementation and it is some fairly heavily optimized FMA3 code (there is also an SSE2 code path, but you shouldn't hit that).

Unfortunately, that implementation is closed source, so I can't share it here.

@tannergooding
Copy link
Member

I'm trying to collect a trace locally as well, to see if I get the same.

@fiigii
Copy link
Contributor Author

fiigii commented Jul 30, 2018

is some fairly heavily optimized FMA3 code

Right, I saw it also from disasm.

@tannergooding
Copy link
Member

@fiigii, how was CoreCLR compiled for you?

I'm testing on only a i7-7700 @ 3.6 GHz and the run completes in just 1.231s (as compared to the nearly 6s the above shows) -- This is for Baseline (AoS))

@fiigii
Copy link
Contributor Author

fiigii commented Jul 30, 2018

I'm testing on only a i7-7700 @ 3.6 GHz and the run completes in just 1.231s (as compared to the nearly 6s the above shows) -- This is for Baseline (AoS))

I changed the image size from 250x250 to 2480x2480 and just rendered one image (not using RenderLoop) to avoid the collection pool imapct for the profling.
https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/RayTracerBench.cs#L33-L34
Sorry for the unclarity.

@tannergooding
Copy link
Member

just rendered one image (not using RenderLoop) to avoid the collection pool imapct for the profling.

So, to clarify, you changed just the following?:

-private const int Width = 250;
-private const int Height = 250;
-private const int Iterations = 7;
+private const int Width = 2480;
+private const int Height = 2480;
+private const int Iterations = 1;

@fiigii
Copy link
Contributor Author

fiigii commented Jul 30, 2018

@tannergooding
Copy link
Member

Looks like glibc recently (07 AUG 2017) made a few changes: https://sourceware.org/git/?p=glibc.git;a=commit;h=57a72fa3502673754d14707da02c7c44e83b8d20

Namely, they still use the IBM Accurate Mathematical Library as their root source code, however, they now have some new logic which additionally compiles that code with the -mfma and -mavx2 flags, which provides some automatic transformations/optimizations (it looks like they do a cached CPUID check at runtime and jump to the appropriate code).

Additionally, it looks like, since the calling conventions map up, they generally end up calling libm-2.27.so~__pow directly, rather than having an intermediate call through COMDouble::Pow.

CC. @CarolEidt, @AndyAyersMS, @jkotas

@roterdam
Copy link

Does CoreCLR not JIT methods with the platform calling convention?

@tannergooding
Copy link
Member

@roterdam, it does. However, the backing implementations for most System.Math functions aren't managed code, they are "FCALLs" to the underlying C Runtime implementation (through the COMDouble and COMSingle classes).

@fiigii fiigii changed the title CRT math library has bad performance on Windows CRT Pow function has bad performance on Windows Jul 31, 2018
@AaronRobinsonMSFT
Copy link
Member

@tannergooding Were you able to reproduce this issue? Is this something we can do or do we need to loop in the VC++ team?

@tannergooding
Copy link
Member

@AaronRobinsonMSFT. Yes, I was able to reproduce this.

I believe this is already tracked by one of the C++ bugs I logged internally, but I will double-check and log a new one if not.

This is also part of a bigger picture with System.Math/MathF that is being tracked internally. I can share more details offline if necessary.

@AaronRobinsonMSFT
Copy link
Member

@tannergooding Not necessary. My main goal was simply to set milestones for issues tagged with VM. If this is something that is post-3.0, then feel free to tag as needed. If 3.0, do we have a plan to deliver it?

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@tannergooding tannergooding added area-System.Numerics and removed area-System.Runtime untriaged New issue has not been triaged by the area owner labels Jun 23, 2020
@ghost
Copy link

ghost commented Jun 23, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

Copy link
Contributor

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics backlog-cleanup-candidate An inactive issue that has been marked for automated closure. enhancement Product code improvement that does NOT require public API changes/additions no-recent-activity tenet-performance Performance related issue tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

6 participants