Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874

Open
SeanTAllen opened this issue Oct 2, 2021 · 30 comments
Assignees
Labels
bug Something isn't working discuss during sync Should be discussed during an upcoming sync help wanted Extra attention is needed
Milestone

Comments

@SeanTAllen
Copy link
Member

SeanTAllen commented Oct 2, 2021

While getting pony working on 64-bit Arm, we found a bug where the test-stdlib-debug tests would segfault. Everything worked fine with release builds, but debug builds would crash.

After investigation, we've found that in codegen_machine in host.cc if the opt_level is get to either CodeGenOpt::Default or CodeGenOpt::Aggressive instead of CodeGenOpt::None the problem goes away.

It seems likely that this is an LLVM bug, but that hasn't been established. We've committed a fix in host.cc that when the target is arm (which is 64-bit arm) that we use CodeGenOpt::Default. This is far from ideal, but gets us working compilations until we can investigate further.

The end result of this should either be a patch that we upstream to LLVM or possibly, the discovery that we are doing something wrong with our Arm code generation that optimization manages to fix.

@SeanTAllen SeanTAllen added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed needs investigation This needs to be looked into before its "ready for work" labels Oct 2, 2021
@SeanTAllen
Copy link
Member Author

The host.cc change makes no difference when we are building using musl rather than glibc. We still get segfaults in the debug build on Arm-64 when using musl.

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Mar 25, 2022

The tests that fail seem to be repeatable from run to run. I've been working through the builtin tests commenting out the ones that crash and then another crashes but it is always one that wasn't run previously. I'm building LLVM on a 64 bit raspberry pi 4 to see if it is reproducible there or if we are in for a tricky time getting more info.

The changes are on the branch 3874.

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Mar 25, 2022
@SeanTAllen SeanTAllen added this to the Arm support milestone Mar 25, 2022
@SeanTAllen
Copy link
Member Author

Ok this is reproducible with the same tests failing on Raspberry PI 4 64 bit.

@SeanTAllen
Copy link
Member Author

Well this is fascinating...

(gdb) bt
#0  0x00000055559e98b0 in pony_test_TestHelper_val__format_loc_oo (this=0x7fe6d016a0, loc=0x0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:362
#1  0x00000055559e4fc4 in pony_test_TestHelper_val_assert_error_ooob (this=0x7fddce1c40, test=0x555640a108, msg=0x5556410380, loc=0x55564077c0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:95
#2  0x0000005555b49434 in builtin_test__TestArrayInsert_ref_apply_oo (this=0x7fe65001e0, h=0x7fddce1c40)
    at /home/pi/code/ponylang/ponyc/packages/builtin_test/_test.pony:1512
#3  0x00000055557bef9c in pony_test__TestRunner_tag_run_o (this=0x7fe649dc00)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/_test_runner.pony:63
#4  0x0000005555764728 in pony_test__TestRunner_Dispatch ()
#5  0x0000005555c6993c in ponyint_actor_run ()
#6  0x0000005555c73d28 in run_thread ()
#7  0x0000007ff7fa67e4 in start_thread (arg=0x7ffffff16f) at pthread_create.c:486
#8  0x0000007ff7dd6adc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

@SeanTAllen
Copy link
Member Author

@jemc this is one for you and i to talk about.

@ergl
Copy link
Member

ergl commented Mar 25, 2022

The backtrace looks identical to what we were getting on the M1. I can't remember the root cause, but maybe @jemc can remember.

Edit: some of these threads might help.

Although on second thought, it seems like the problem with the backtrace above is that the location object is null. I'd look at the IR to check if we're getting any undef references for that.

@SeanTAllen SeanTAllen changed the title "LLVM bug" impacting debug builds on 64-bit Arm Segmentation fault with debug standard library build with release runtime on Aarch64 Mar 25, 2022
@SeanTAllen
Copy link
Member Author

So it looks like:

#0  0x00000055559e98b0 in pony_test_TestHelper_val__format_loc_oo (this=0x7fe6d016a0, loc=0x0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:362
#1  0x00000055559e4fc4 in pony_test_TestHelper_val_assert_error_ooob (this=0x7fddce1c40, test=0x555640a108, msg=0x5556410380, loc=0x55564077c0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:95

We have the location in frame 1. But its null in frame 2.

  fun assert_error(test: ITest box, msg: String = "", loc: SourceLoc = __loc)
    : Bool
  =>
    """
    Assert that the given test function throws an error when run.
    """
    try
      test()?
      fail(_format_loc(loc) + "Assert error failed. " + msg)
      false
    else
      log(_format_loc(loc) + "Assert error passed. " + msg, true)
      true
    end

Its a straight passthrough. I suspect that the reason that having some optimization causes the crash to go away is that we inline the very small _format_loc method and so whatever our weirdness is for loc being null goes away. But I have no actual evidence of that yet, its purely a guess. Really thought the important question is "how the hell does loc end up null".

Here's _format_loc for edification:

  fun _format_loc(loc: SourceLoc): String =>
    loc.file() + ":" + loc.line().string() + ": "

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Mar 25, 2022

I have determined that the failure happens with a more minimal example as well:

use "pony_test"
use "collections"

actor \nodoc\ Main is TestList
  new create(env: Env) => PonyTest(env, this)
  new make() => None

  fun tag tests(test: PonyTest) =>
    test(_TestArrayFind)

class \nodoc\ iso _TestArrayFind is UnitTest
  fun name(): String => "builtin/Array.find"

  fun apply(h: TestHelper) =>
    let a = [as ISize: 0; 1; 2; 3; 4; 1]
    h.assert_error({() ? => a.find(6)? })

And changing so that a.find won't error doesn't result in a crash:

use "pony_test"
use "collections"

actor \nodoc\ Main is TestList
  new create(env: Env) => PonyTest(env, this)
  new make() => None

  fun tag tests(test: PonyTest) =>
    test(_TestArrayFind)

class \nodoc\ iso _TestArrayFind is UnitTest
  fun name(): String => "builtin/Array.find"

  fun apply(h: TestHelper) =>
    let a = [as ISize: 0; 1; 2; 3; 4; 1]
    h.assert_error({() ? => a.find(1)? })

@SeanTAllen
Copy link
Member Author

Linting the llvm with --lint-llvm results in:

Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8

but this happens on x86 where there is no issue as well.

--verify reports no issues.

@SeanTAllen
Copy link
Member Author

Here's a more minimal example:

actor Main
  new create(env: Env) =>
    foo("hello")

  fun foo(s: String) =>
    try
      one(s)
      error
    else
      one(s)
    end

  fun one(s: String) =>
    s.clone()

@SeanTAllen
Copy link
Member Author

@ergl could you see if this fails on an M1? The key seems to be to use a release runtime but do -d when compiling the program.

@ergl
Copy link
Member

ergl commented Mar 25, 2022

@SeanTAllen No errors for me. No combination of release/debug runtime and setting -d or leaving it off had the issue for the two examples you posted.

@SeanTAllen
Copy link
Member Author

Interesting so this might be aarch64 + linux issue.

@jemc
Copy link
Member

jemc commented Mar 25, 2022

Note that if it's like the M1 issue, then the problem goes away if you link the program against the runtime bitcode (which I still think should be the default build flow where available).

@SeanTAllen
Copy link
Member Author

@ergl can you try on m1 with the host.cc change removed? so that debug is doing no optimization?

@jemc
Copy link
Member

jemc commented Mar 25, 2022

@SeanTAllen and I think this is related to us needing to write some ARM64-specific code in posix_except.c. We currently have ARM32-specific code, and then catch-all code for everything else.

Here's a PDF that includes some info on the ARM64 Exception Handling ABI: https://documentation-service.arm.com/static/5fa190a7b1a7c5445f2904d6?token=

@jemc
Copy link
Member

jemc commented Mar 25, 2022

That PDF links to this doc, which seems to have the more relevant information: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Mar 25, 2022

@sylvanc says he is believes that yes, the current usage of catch all code is probably wrong for arm exceptions.

but when i told him that aarch64 docs say that it is the same as itanium. he said itanium is the same as x86 except perhaps some low level register setting.

@jemc
Copy link
Member

jemc commented Mar 25, 2022

@SeanTAllen - try this diff:

diff --git a/src/libponyc/codegen/codegen.c b/src/libponyc/codegen/codegen.c
index 6a972f88..147435b5 100644
--- a/src/libponyc/codegen/codegen.c
+++ b/src/libponyc/codegen/codegen.c
@@ -265,6 +265,7 @@ static void init_runtime(compile_t* c)
 
   unsigned int align_value = target_is_ilp32(c->opt->triple) ? 4 : 8;
 
+  LLVM_DECLARE_ATTRIBUTEREF(uwtable_attr, uwtable, 0);
   LLVM_DECLARE_ATTRIBUTEREF(nounwind_attr, nounwind, 0);
   LLVM_DECLARE_ATTRIBUTEREF(readnone_attr, readnone, 0);
   LLVM_DECLARE_ATTRIBUTEREF(readonly_attr, readonly, 0);
@@ -606,7 +607,9 @@ static void init_runtime(compile_t* c)
 
   // i32 ponyint_personality_v0(...)
   type = LLVMFunctionType(c->i32, NULL, 0, true);
-  c->personality = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+  value = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+  c->personality = value;
+  LLVMAddAttributeAtIndex(value, LLVMAttributeFunctionIndex, uwtable_attr);
 
   // i32 memcmp(i8*, i8*, intptr)
   params[0] = c->void_ptr;

@SeanTAllen
Copy link
Member Author

@jemc same boom

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Mar 26, 2022

Interestingly, this also segfaults because of the message send that originates in the else of the try:

actor Main
  let _out: OutStream

  new create(env: Env) =>
    _out = env.out
    foo("hello")

  fun foo(s: String) =>
    try
      one("h")
      error
    else
      one("e")
    end

  fun one(s: String) =>
    _out.print(s)

but this does not:

actor Main
  let _out: OutStream

  new create(env: Env) =>
    _out = env.out
    foo("hello")

  fun foo(s: String) =>
    try
      one("h")
      error
    else
      one("e")
    end

  fun one(s: String) =>
    s.clone()

@SeanTAllen SeanTAllen removed the good first issue Good for newcomers label Mar 26, 2022
@SeanTAllen
Copy link
Member Author

SeanTAllen commented Mar 27, 2022

I discussed this with @sylvanc. He took a look at the pony_try llvm ir that we use and said it looked correct and the personality function looked correct as well. He suggested looking at a C++ aarch64 personality function to see if it was different in a way we weren't aware of.

I'll be taking a look at the one from the LLVM project:

https://github.com/llvm/llvm-project/blob/main/libcxxabi/src/cxa_personality.cpp

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Apr 2, 2022

I've found a difference in our personality function from the c++ abi.

When we set registers, we do not do what the C++ does:

  _Unwind_SetGR(context, __builtin_eh_return_data_regno(1),
                static_cast<uintptr_t>(results.ttypeIndex));

It is setting context to the type index found from the lsda scan.

Whereas we do:

  _Unwind_SetGR(context, __builtin_eh_return_data_regno(1), 0);

Additionally there are differences in ordering in the personality functions which might be because we need the scan results for use elsewhere or simply different way of writing the code. The pony code is clearer, but maybe not correct in addition to the register issue.

@SeanTAllen
Copy link
Member Author

So we get the ttypeIndex from:

/// Read a sleb128 encoded value and advance pointer
/// See Variable Length Data Appendix C in:
/// @link http://dwarfstd.org/Dwarf4.pdf @unlink
/// @param data reference variable holding memory pointer to decode from
/// @returns decoded value
static
intptr_t
readSLEB128(const uint8_t** data)

plus there's additional logic for deciding it to actually set that in results that I haven't fully worked through.

@SeanTAllen SeanTAllen added triggers release Major issue that when fixed, results in an "emergency" release and removed help wanted Extra attention is needed needs investigation This needs to be looked into before its "ready for work" labels Apr 2, 2022
@SeanTAllen SeanTAllen self-assigned this Apr 2, 2022
@SeanTAllen
Copy link
Member Author

I changed this to "triggers release" as this is actually incorrect exception handling on aarch64.

@SeanTAllen SeanTAllen changed the title Segmentation fault with debug standard library build with release runtime on Aarch64 Segmentation fault caused by incorrect exception handling on aarch64 Apr 2, 2022
SeanTAllen added a commit that referenced this issue Apr 2, 2022
@jemc jemc removed the discuss during sync Should be discussed during an upcoming sync label Apr 5, 2022
@jemc
Copy link
Member

jemc commented Apr 5, 2022

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Apr 5, 2022
@SeanTAllen
Copy link
Member Author

I talked through this with Sylvan and we came to the conclusion that this is a probably a bug in LLVM like I initially thought. In particular a bug in the aarch64 code that happens to get fixed by some optimization. Because the LLVM ir is the same for x86 debug and aarch64 debug besides the platform things you'd expect to be different and we aren't doing anything special for aarch64 vs x86 so it isn't a bug in what we are doing AND as joe has seen, if you create bitcode and link with the llvm tools, the bug also goes away so... bug somewhere in llvm for aarch64.

@SeanTAllen SeanTAllen removed the triggers release Major issue that when fixed, results in an "emergency" release label Apr 13, 2022
@SeanTAllen SeanTAllen changed the title Segmentation fault caused by incorrect exception handling on aarch64 Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 Apr 26, 2022
@SeanTAllen SeanTAllen added help wanted Extra attention is needed and removed discuss during sync Should be discussed during an upcoming sync labels Apr 26, 2022
@SeanTAllen
Copy link
Member Author

I've been thinking about this issue and something new occurs to me.

Why does this only happen with a release/optimized version of the runtime? Why is it fine with a debug runtime + debug pony binary? That seems very odd. I have a feeling that perhaps there's something there that will help us find the issue.

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Oct 27, 2022
@SeanTAllen
Copy link
Member Author

I retested this and it looks like it effects both debug and release versions of the runtime, which was either overlooked when i first opened this issue or is new. Either way, it makes me feel a little better that it isn't only happening with release versions of the runtime.

@SeanTAllen SeanTAllen removed the discuss during sync Should be discussed during an upcoming sync label Nov 1, 2022
SeanTAllen added a commit that referenced this issue Sep 12, 2023
After extensive investigation of the weirdness that is #4369, Joe
and I believe the issue is most likely an LLVM bug. However, we
don't have time to pursue further.

During the investigation, I came to believe that the change to use
CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations
would be prudent.

It prevents the known instances of bugs like #4369 which is nice but
not key to this change. Changes things because they make symptoms
go away isn't a good strategy if you don't understand why. We do not
"fully" understand the impact of this change on bugs like #4369. However,
it appears that CodeGenOpt::None is a code path that sees little use
or testing and is more prone to bugs/regression.

Given that we have seen more than 1 bug that "smells like" #4369, we felt
it was wise to switch to Default from None at least until such time as we
understand #4369 and #3874. LLVM is a complicated beast and using code paths
that are little used can be a dangerous proposition.

This commit doesn't include release notes as we addresses to user facing
instance of #4369 already.
SeanTAllen added a commit that referenced this issue Sep 12, 2023
After extensive investigation of the weirdness that is #4369, Joe
and I believe the issue is most likely an LLVM bug. However, we
don't have time to pursue further.

During the investigation, I came to believe that the change to use
CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations
would be prudent.

It prevents the known instances of bugs like #4369 which is nice but
not key to this change. Changes things because they make symptoms
go away isn't a good strategy if you don't understand why. We do not
"fully" understand the impact of this change on bugs like #4369. However,
it appears that CodeGenOpt::None is a code path that sees little use
or testing and is more prone to bugs/regression.

Given that we have seen more than 1 bug that "smells like" #4369, we felt
it was wise to switch to Default from None at least until such time as we
understand #4369 and #3874. LLVM is a complicated beast and using code paths
that are little used can be a dangerous proposition.

This commit doesn't include release notes as we addresses to user facing
instance of #4369 already.
@IgorDeepakM
Copy link

IgorDeepakM commented Oct 15, 2024

This problem is also seen with ARM32 using --debug option, however for me the unwinding gets stuck looping. When not using --debug, then exceptions work.

Another thing that is worth noting that, without adding -funwind-tables when compiling Pony runtime, the exceptions for ARM32 doesn't work at all and it unwinds to the stack end. Clang doesn't add unwind tables by default to C code which is a possible difference to GCC.

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discuss during sync Should be discussed during an upcoming sync help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants