Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874

SeanTAllen · 2021-10-02T20:32:02Z

While getting pony working on 64-bit Arm, we found a bug where the test-stdlib-debug tests would segfault. Everything worked fine with release builds, but debug builds would crash.

After investigation, we've found that in codegen_machine in host.cc if the opt_level is get to either CodeGenOpt::Default or CodeGenOpt::Aggressive instead of CodeGenOpt::None the problem goes away.

It seems likely that this is an LLVM bug, but that hasn't been established. We've committed a fix in host.cc that when the target is arm (which is 64-bit arm) that we use CodeGenOpt::Default. This is far from ideal, but gets us working compilations until we can investigate further.

The end result of this should either be a patch that we upstream to LLVM or possibly, the discovery that we are doing something wrong with our Arm code generation that optimization manages to fix.

The text was updated successfully, but these errors were encountered:

SeanTAllen · 2021-10-05T15:12:35Z

The host.cc change makes no difference when we are building using musl rather than glibc. We still get segfaults in the debug build on Arm-64 when using musl.

SeanTAllen · 2022-03-25T03:01:31Z

The tests that fail seem to be repeatable from run to run. I've been working through the builtin tests commenting out the ones that crash and then another crashes but it is always one that wasn't run previously. I'm building LLVM on a 64 bit raspberry pi 4 to see if it is reproducible there or if we are in for a tricky time getting more info.

The changes are on the branch 3874.

SeanTAllen · 2022-03-25T03:34:07Z

Ok this is reproducible with the same tests failing on Raspberry PI 4 64 bit.

SeanTAllen · 2022-03-25T03:45:23Z

Well this is fascinating...

(gdb) bt
#0  0x00000055559e98b0 in pony_test_TestHelper_val__format_loc_oo (this=0x7fe6d016a0, loc=0x0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:362
#1  0x00000055559e4fc4 in pony_test_TestHelper_val_assert_error_ooob (this=0x7fddce1c40, test=0x555640a108, msg=0x5556410380, loc=0x55564077c0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:95
#2  0x0000005555b49434 in builtin_test__TestArrayInsert_ref_apply_oo (this=0x7fe65001e0, h=0x7fddce1c40)
    at /home/pi/code/ponylang/ponyc/packages/builtin_test/_test.pony:1512
#3  0x00000055557bef9c in pony_test__TestRunner_tag_run_o (this=0x7fe649dc00)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/_test_runner.pony:63
#4  0x0000005555764728 in pony_test__TestRunner_Dispatch ()
#5  0x0000005555c6993c in ponyint_actor_run ()
#6  0x0000005555c73d28 in run_thread ()
#7  0x0000007ff7fa67e4 in start_thread (arg=0x7ffffff16f) at pthread_create.c:486
#8  0x0000007ff7dd6adc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

SeanTAllen · 2022-03-25T03:47:46Z

@jemc this is one for you and i to talk about.

ergl · 2022-03-25T06:53:13Z

The backtrace looks identical to what we were getting on the M1. I can't remember the root cause, but maybe @jemc can remember.

Edit: some of these threads might help.

Although on second thought, it seems like the problem with the backtrace above is that the location object is null. I'd look at the IR to check if we're getting any undef references for that.

SeanTAllen · 2022-03-25T12:36:25Z

So it looks like:

#0  0x00000055559e98b0 in pony_test_TestHelper_val__format_loc_oo (this=0x7fe6d016a0, loc=0x0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:362
#1  0x00000055559e4fc4 in pony_test_TestHelper_val_assert_error_ooob (this=0x7fddce1c40, test=0x555640a108, msg=0x5556410380, loc=0x55564077c0)
    at /home/pi/code/ponylang/ponyc/packages/pony_test/test_helper.pony:95

We have the location in frame 1. But its null in frame 2.

  fun assert_error(test: ITest box, msg: String = "", loc: SourceLoc = __loc)
    : Bool
  =>
    """
    Assert that the given test function throws an error when run.
    """
    try
      test()?
      fail(_format_loc(loc) + "Assert error failed. " + msg)
      false
    else
      log(_format_loc(loc) + "Assert error passed. " + msg, true)
      true
    end

Its a straight passthrough. I suspect that the reason that having some optimization causes the crash to go away is that we inline the very small _format_loc method and so whatever our weirdness is for loc being null goes away. But I have no actual evidence of that yet, its purely a guess. Really thought the important question is "how the hell does loc end up null".

Here's _format_loc for edification:

  fun _format_loc(loc: SourceLoc): String =>
    loc.file() + ":" + loc.line().string() + ": "

SeanTAllen · 2022-03-25T12:52:52Z

I have determined that the failure happens with a more minimal example as well:

use "pony_test"
use "collections"

actor \nodoc\ Main is TestList
  new create(env: Env) => PonyTest(env, this)
  new make() => None

  fun tag tests(test: PonyTest) =>
    test(_TestArrayFind)

class \nodoc\ iso _TestArrayFind is UnitTest
  fun name(): String => "builtin/Array.find"

  fun apply(h: TestHelper) =>
    let a = [as ISize: 0; 1; 2; 3; 4; 1]
    h.assert_error({() ? => a.find(6)? })

And changing so that a.find won't error doesn't result in a crash:

use "pony_test"
use "collections"

actor \nodoc\ Main is TestList
  new create(env: Env) => PonyTest(env, this)
  new make() => None

  fun tag tests(test: PonyTest) =>
    test(_TestArrayFind)

class \nodoc\ iso _TestArrayFind is UnitTest
  fun name(): String => "builtin/Array.find"

  fun apply(h: TestHelper) =>
    let a = [as ISize: 0; 1; 2; 3; 4; 1]
    h.assert_error({() ? => a.find(1)? })

SeanTAllen · 2022-03-25T12:58:44Z

Linting the llvm with --lint-llvm results in:

Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8
Pessimization: Static alloca outside of entry block
  %22 = alloca i8*, align 8

but this happens on x86 where there is no issue as well.

--verify reports no issues.

SeanTAllen · 2022-03-25T13:06:05Z

Here's a more minimal example:

actor Main
  new create(env: Env) =>
    foo("hello")

  fun foo(s: String) =>
    try
      one(s)
      error
    else
      one(s)
    end

  fun one(s: String) =>
    s.clone()

SeanTAllen · 2022-03-25T13:06:50Z

@ergl could you see if this fails on an M1? The key seems to be to use a release runtime but do -d when compiling the program.

ergl · 2022-03-25T13:16:45Z

@SeanTAllen No errors for me. No combination of release/debug runtime and setting -d or leaving it off had the issue for the two examples you posted.

SeanTAllen · 2022-03-25T13:18:05Z

Interesting so this might be aarch64 + linux issue.

jemc · 2022-03-25T15:48:21Z

Note that if it's like the M1 issue, then the problem goes away if you link the program against the runtime bitcode (which I still think should be the default build flow where available).

SeanTAllen · 2022-03-25T17:04:16Z

@ergl can you try on m1 with the host.cc change removed? so that debug is doing no optimization?

jemc · 2022-03-25T17:25:15Z

@SeanTAllen and I think this is related to us needing to write some ARM64-specific code in posix_except.c. We currently have ARM32-specific code, and then catch-all code for everything else.

Here's a PDF that includes some info on the ARM64 Exception Handling ABI: https://documentation-service.arm.com/static/5fa190a7b1a7c5445f2904d6?token=

jemc · 2022-03-25T17:28:29Z

That PDF links to this doc, which seems to have the more relevant information: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html

SeanTAllen · 2022-03-25T17:30:47Z

@sylvanc says he is believes that yes, the current usage of catch all code is probably wrong for arm exceptions.

but when i told him that aarch64 docs say that it is the same as itanium. he said itanium is the same as x86 except perhaps some low level register setting.

jemc · 2022-03-25T17:54:15Z

@SeanTAllen - try this diff:

diff --git a/src/libponyc/codegen/codegen.c b/src/libponyc/codegen/codegen.c
index 6a972f88..147435b5 100644
--- a/src/libponyc/codegen/codegen.c
+++ b/src/libponyc/codegen/codegen.c
@@ -265,6 +265,7 @@ static void init_runtime(compile_t* c)
 
   unsigned int align_value = target_is_ilp32(c->opt->triple) ? 4 : 8;
 
+  LLVM_DECLARE_ATTRIBUTEREF(uwtable_attr, uwtable, 0);
   LLVM_DECLARE_ATTRIBUTEREF(nounwind_attr, nounwind, 0);
   LLVM_DECLARE_ATTRIBUTEREF(readnone_attr, readnone, 0);
   LLVM_DECLARE_ATTRIBUTEREF(readonly_attr, readonly, 0);
@@ -606,7 +607,9 @@ static void init_runtime(compile_t* c)
 
   // i32 ponyint_personality_v0(...)
   type = LLVMFunctionType(c->i32, NULL, 0, true);
-  c->personality = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+  value = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+  c->personality = value;
+  LLVMAddAttributeAtIndex(value, LLVMAttributeFunctionIndex, uwtable_attr);
 
   // i32 memcmp(i8*, i8*, intptr)
   params[0] = c->void_ptr;

SeanTAllen · 2022-03-25T18:00:44Z

@jemc same boom

SeanTAllen · 2022-03-26T03:39:33Z

Interestingly, this also segfaults because of the message send that originates in the else of the try:

actor Main
  let _out: OutStream

  new create(env: Env) =>
    _out = env.out
    foo("hello")

  fun foo(s: String) =>
    try
      one("h")
      error
    else
      one("e")
    end

  fun one(s: String) =>
    _out.print(s)

but this does not:

actor Main
  let _out: OutStream

  new create(env: Env) =>
    _out = env.out
    foo("hello")

  fun foo(s: String) =>
    try
      one("h")
      error
    else
      one("e")
    end

  fun one(s: String) =>
    s.clone()

SeanTAllen · 2022-03-27T17:22:03Z

I discussed this with @sylvanc. He took a look at the pony_try llvm ir that we use and said it looked correct and the personality function looked correct as well. He suggested looking at a C++ aarch64 personality function to see if it was different in a way we weren't aware of.

I'll be taking a look at the one from the LLVM project:

https://github.com/llvm/llvm-project/blob/main/libcxxabi/src/cxa_personality.cpp

SeanTAllen · 2022-04-02T13:18:00Z

I've found a difference in our personality function from the c++ abi.

When we set registers, we do not do what the C++ does:

  _Unwind_SetGR(context, __builtin_eh_return_data_regno(1),
                static_cast<uintptr_t>(results.ttypeIndex));

It is setting context to the type index found from the lsda scan.

Whereas we do:

  _Unwind_SetGR(context, __builtin_eh_return_data_regno(1), 0);

Additionally there are differences in ordering in the personality functions which might be because we need the scan results for use elsewhere or simply different way of writing the code. The pony code is clearer, but maybe not correct in addition to the register issue.

SeanTAllen · 2022-04-02T16:39:29Z

So we get the ttypeIndex from:

/// Read a sleb128 encoded value and advance pointer
/// See Variable Length Data Appendix C in:
/// @link http://dwarfstd.org/Dwarf4.pdf @unlink
/// @param data reference variable holding memory pointer to decode from
/// @returns decoded value
static
intptr_t
readSLEB128(const uint8_t** data)

plus there's additional logic for deciding it to actually set that in results that I haven't fully worked through.

SeanTAllen · 2022-04-02T19:04:52Z

I changed this to "triggers release" as this is actually incorrect exception handling on aarch64.

Fixes #3874

jemc · 2022-04-05T18:52:30Z

In case anything here is helpful: https://llvm.org/docs/ExceptionHandling.html#exception-handling-support-on-the-target

SeanTAllen · 2022-04-07T18:12:32Z

I talked through this with Sylvan and we came to the conclusion that this is a probably a bug in LLVM like I initially thought. In particular a bug in the aarch64 code that happens to get fixed by some optimization. Because the LLVM ir is the same for x86 debug and aarch64 debug besides the platform things you'd expect to be different and we aren't doing anything special for aarch64 vs x86 so it isn't a bug in what we are doing AND as joe has seen, if you create bitcode and link with the llvm tools, the bug also goes away so... bug somewhere in llvm for aarch64.

SeanTAllen · 2022-10-27T01:19:29Z

I've been thinking about this issue and something new occurs to me.

Why does this only happen with a release/optimized version of the runtime? Why is it fine with a debug runtime + debug pony binary? That seems very odd. I have a feeling that perhaps there's something there that will help us find the issue.

SeanTAllen · 2022-10-27T01:31:29Z

I retested this and it looks like it effects both debug and release versions of the runtime, which was either overlooked when i first opened this issue or is new. Either way, it makes me feel a little better that it isn't only happening with release versions of the runtime.

After extensive investigation of the weirdness that is #4369, Joe and I believe the issue is most likely an LLVM bug. However, we don't have time to pursue further. During the investigation, I came to believe that the change to use CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations would be prudent. It prevents the known instances of bugs like #4369 which is nice but not key to this change. Changes things because they make symptoms go away isn't a good strategy if you don't understand why. We do not "fully" understand the impact of this change on bugs like #4369. However, it appears that CodeGenOpt::None is a code path that sees little use or testing and is more prone to bugs/regression. Given that we have seen more than 1 bug that "smells like" #4369, we felt it was wise to switch to Default from None at least until such time as we understand #4369 and #3874. LLVM is a complicated beast and using code paths that are little used can be a dangerous proposition. This commit doesn't include release notes as we addresses to user facing instance of #4369 already.

IgorDeepakM · 2024-10-15T23:21:43Z

This problem is also seen with ARM32 using --debug option, however for me the unwinding gets stuck looping. When not using --debug, then exceptions work.

Another thing that is worth noting that, without adding -funwind-tables when compiling Pony runtime, the exceptions for ARM32 doesn't work at all and it unwinds to the stack end. Clang doesn't add unwind tables by default to C code which is a possible difference to GCC.

SeanTAllen added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed needs investigation This needs to be looked into before its "ready for work" labels Oct 2, 2021

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Mar 25, 2022

SeanTAllen added this to the Arm support milestone Mar 25, 2022

SeanTAllen changed the title ~~"LLVM bug" impacting debug builds on 64-bit Arm~~ Segmentation fault with debug standard library build with release runtime on Aarch64 Mar 25, 2022

SeanTAllen removed the good first issue Good for newcomers label Mar 26, 2022

SeanTAllen added triggers release Major issue that when fixed, results in an "emergency" release and removed help wanted Extra attention is needed needs investigation This needs to be looked into before its "ready for work" labels Apr 2, 2022

SeanTAllen self-assigned this Apr 2, 2022

SeanTAllen changed the title ~~Segmentation fault with debug standard library build with release runtime on Aarch64~~ Segmentation fault caused by incorrect exception handling on aarch64 Apr 2, 2022

SeanTAllen added a commit that referenced this issue Apr 2, 2022

Fix incorrect exception handling on aarch64

a36cc0e

Fixes #3874

SeanTAllen mentioned this issue Apr 2, 2022

Fix incorrect exception handling on aarch64 #4084

Closed

jemc removed the discuss during sync Should be discussed during an upcoming sync label Apr 5, 2022

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Apr 5, 2022

SeanTAllen removed the triggers release Major issue that when fixed, results in an "emergency" release label Apr 13, 2022

SeanTAllen changed the title ~~Segmentation fault caused by incorrect exception handling on aarch64~~ Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 Apr 26, 2022

SeanTAllen added help wanted Extra attention is needed and removed discuss during sync Should be discussed during an upcoming sync labels Apr 26, 2022

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Oct 27, 2022

SeanTAllen removed the discuss during sync Should be discussed during an upcoming sync label Nov 1, 2022

SeanTAllen mentioned this issue Aug 4, 2023

Segmentation fault when assigning Timer #4369

Open

SeanTAllen mentioned this issue Sep 12, 2023

Address issue 4369 "as best we can" #4441

Merged

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874

Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874

SeanTAllen commented Oct 2, 2021 •

edited

Loading

SeanTAllen commented Oct 5, 2021

SeanTAllen commented Mar 25, 2022 •

edited

Loading

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

ergl commented Mar 25, 2022 •

edited

Loading

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022 •

edited

Loading

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

ergl commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

jemc commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

jemc commented Mar 25, 2022

jemc commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022 •

edited

Loading

jemc commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 26, 2022 •

edited

Loading

SeanTAllen commented Mar 27, 2022 •

edited

Loading

SeanTAllen commented Apr 2, 2022 •

edited

Loading

SeanTAllen commented Apr 2, 2022

SeanTAllen commented Apr 2, 2022

jemc commented Apr 5, 2022

SeanTAllen commented Apr 7, 2022

SeanTAllen commented Oct 27, 2022

SeanTAllen commented Oct 27, 2022

IgorDeepakM commented Oct 15, 2024 •

edited

Loading

Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874

Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874

Comments

SeanTAllen commented Oct 2, 2021 • edited Loading

SeanTAllen commented Oct 5, 2021

SeanTAllen commented Mar 25, 2022 • edited Loading

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

ergl commented Mar 25, 2022 • edited Loading

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022 • edited Loading

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

ergl commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

jemc commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

jemc commented Mar 25, 2022

jemc commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022 • edited Loading

jemc commented Mar 25, 2022

SeanTAllen commented Mar 25, 2022

SeanTAllen commented Mar 26, 2022 • edited Loading

SeanTAllen commented Mar 27, 2022 • edited Loading

SeanTAllen commented Apr 2, 2022 • edited Loading

SeanTAllen commented Apr 2, 2022

SeanTAllen commented Apr 2, 2022

jemc commented Apr 5, 2022

SeanTAllen commented Apr 7, 2022

SeanTAllen commented Oct 27, 2022

SeanTAllen commented Oct 27, 2022

IgorDeepakM commented Oct 15, 2024 • edited Loading

SeanTAllen commented Oct 2, 2021 •

edited

Loading

SeanTAllen commented Mar 25, 2022 •

edited

Loading

ergl commented Mar 25, 2022 •

edited

Loading

SeanTAllen commented Mar 25, 2022 •

edited

Loading

SeanTAllen commented Mar 25, 2022 •

edited

Loading

SeanTAllen commented Mar 26, 2022 •

edited

Loading

SeanTAllen commented Mar 27, 2022 •

edited

Loading

SeanTAllen commented Apr 2, 2022 •

edited

Loading

IgorDeepakM commented Oct 15, 2024 •

edited

Loading