Skip to content

[Flang] Fix performance issue in 549.fotonik3d_r #58303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kiranchandramohan opened this issue Oct 11, 2022 · 11 comments
Open

[Flang] Fix performance issue in 549.fotonik3d_r #58303

kiranchandramohan opened this issue Oct 11, 2022 · 11 comments
Assignees
Labels
flang Flang issues not falling into any other category performance

Comments

@kiranchandramohan
Copy link
Contributor

fotonik benchmark performs poorly with Flang. The performance is almost three times slower when compared to other compilers. Investigate the performance issue.

@kiranchandramohan kiranchandramohan added flang Flang issues not falling into any other category and removed new issue labels Oct 11, 2022
@Leporacanthicus
Copy link
Contributor

So the single sentence answer to this is "we're lacking alias analysis".

Here's a few more words. If someone really wants to have further details, please feel free to ask, and I will try to answer any questions.

With flang compiled from llvm/main, and comparing to two different versions of gfortran (9.4.0 and 13.0 (top of tree) as of a few days ago), the total runtime in 549.fotonik3d_r is around 2.6-3 times slower for Flang than Gfortran, on same hardware platform and using -O3 optimization for both (-Ofast lost a bit of time for the gfortran 13.0, I didn't try -Ofast with the gfortran 9.4 version).

llvm-flang vs gfortran 9.4:

Function %flang %gfortran Relative
material::mat_update 28.08% 26.16% 2.85
UPML::upml_updatee_simple 24.43% 20.64% 3.07
UPML::upml_updateh 22.36% 16.21% 3.67
update::update_h 12.14% 15.28% 2.11
power::power_dft 11.68% 19.1% 1.62

The relative number is calculated based on percentage of total runtime, and llvm-flang time divided by gfortran - so say gfortran ran in 200s, then 26.16% of that would be 52.3s, and the time in the llvm-flang is about 149s (2.85 x 52.3s). These are NOT the actual numbers, just explaining the calculation. Percentage figures are as per perf record/perf report.

There are differences in use of fma type operations and vector operations (Vector operations may get resolved with alias analysis, the former would probably happen with an -Ofast flag available).

Hacking the compiler to use fma aggressively, it gives about 3% overall improvement, so from 2.65 to 2.6 times for the 9.4 - I didn't make this precise comparison with 13.0.

Hacking the ScopedNoAliasAA to always say "no alias", while also using the the "force use of fma", and the 9.4 comparison comes down to 1.23x slower, and 1.37x slower against 13.0.

Function %flang %gfortran Relative
material::mat_update 34.14% 26.16% 1.6
UPML::upml_updatee_simple 17.48% 20.64% 43.3
UPML::upml_updateh 15.91% 16.21% 33.2
update::update_h 15.46% 15.28% 1.25
power::power_dft 14.20% 19.1% 0.91

Same table for 13.0 comparison:

Function %flang %gfortran Relative
material::mat_update 34.14% 23.51% 2.0
UPML::upml_updatee_simple 17.48% 22.14% 1.06
UPML::upml_updateh 15.91% 16.85% 1.30
update::update_h 15.46% 15.63% 1.36
power::power_dft 14.20% 19.1% 1.03

(Note that UPML::upml_updatee in 9.4.0 is inlining UPML::upml_updatee_single, so if you were doing this experiment yourself, you may not see the same names in gfortran as listed above - the inlining strategy for the latest version seems to have changed, so it's no longer inlining the rather large function).

@vzakhari
Copy link
Contributor

I approached this a different way on an x86 CPU, but the conclusion is the same.

I took the innermost loop in the loopnest at line 2228 in UPML::upml_updatee_simple. Initial Flang time is 54.36s for the whole loop nest. I manually added alias.scope/noalias metadata, fast attributes and also nsw/nuw flags for the offsets and loop iv computations (see LLVM IR diff in the attached files). Execution time of the loop nest decreased with each modification:

  • Alias metadata - enables innermost loop vectorization: 27.825s
  • nsw/nuw - gets rid of redundant integer computations inside the innermost loop: 26.06s
  • fast - FMAs: 24.1

There are multiple loop nests like this one in UPML::upml_updatee_simple, so they should all benefit from the same LLVM optimizations, when proper metadata/attributes are provided. UPML::upml_updateh consists of loop nests of the same characteristics.

LLVMIRdiff.zip

@kiranchandramohan
Copy link
Contributor Author

We need either Full TBAA like classic flang or the full restrict patches to get the benefits in fotonik.

@kiranchandramohan kiranchandramohan changed the title [Flang] Investigate performance issue in 549.fotonik3d_r [Flang] Fix performance issue in 549.fotonik3d_r Jun 1, 2023
@tblah
Copy link
Contributor

tblah commented Sep 29, 2023

I'm working on classic flang like TBAA here https://discourse.llvm.org/t/rfc-propagate-fir-alias-analysis-information-using-tbaa/73755

@tblah
Copy link
Contributor

tblah commented Nov 28, 2023

TBAA tags are enabled by default, but fotonik is still a bit slower than classic flang so I will leave this ticket open for now.

#73111

@vzakhari
Copy link
Contributor

Hi Tom,

I see performance degradations on 549.fotonik3d (-33%), 437.leslie3d (-168%) and 459.GemsFDTD (-27%) on x86, and 437.leslie3d (-217%) and 459.GemsFDTD (-74%) on aarch64.

I tried 437 with -mllvm -disable-fir-alias-tags on icelake, and performance restored. I am using flang-new -Ofast -march=native.

Are these regressions expected?

@tblah
Copy link
Contributor

tblah commented Nov 29, 2023

Hi Slava,

Thanks for letting me know. No these were not expected. I hadn't tried on x86, but the aarch64 regressions are a big surprise. I will look into this immediately.

Would you like me to revert enabling alias tags by default in the meantime?

@tblah
Copy link
Contributor

tblah commented Nov 29, 2023

Just to confirm, I have reproduced the aarch64 issue and am looking into it. After that me or Mats will look at X86.

The aarch64 regression is not present when LTO is used. GemsFDTD is 28% faster with the TBAA tags when LTO is enabled.

@vzakhari
Copy link
Contributor

Thank you for looking into this, Tom! If the investigation/resolution takes more than a couple of days, I would prefer reverting it.

tblah added a commit to tblah/llvm-project that referenced this issue Nov 29, 2023
This reverts commit caba031.

Serious performance regressions were reported by @vzakhari
llvm#58303 (comment)

Fixing this doesn't look quick so I will revert for now.
tblah added a commit that referenced this issue Nov 29, 2023
This reverts commit caba031.

Serious performance regressions were reported by @vzakhari
#58303 (comment)

Fixing this doesn't look quick so I will revert for now.
@tblah
Copy link
Contributor

tblah commented Nov 29, 2023

It is reverted #73821

@vzakhari
Copy link
Contributor

Thank you, Tom! It turns out the 437.leslie3d regression is the only real one on my side (both on aarch64 and x86). GemsFDTD and fotonik3d actually got nice improvements from this change (both on aarch64 and x86). capacita_11 slowed down by 8% on x86, but I would not bother about it at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang Flang issues not falling into any other category performance
Projects
Status: In Progress
Status: In Progress
Development

No branches or pull requests

5 participants