-
-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault caused by likely LLVM bug that results in incorrect exception handling on aarch64 #3874
Comments
The host.cc change makes no difference when we are building using musl rather than glibc. We still get segfaults in the debug build on Arm-64 when using musl. |
The tests that fail seem to be repeatable from run to run. I've been working through the builtin tests commenting out the ones that crash and then another crashes but it is always one that wasn't run previously. I'm building LLVM on a 64 bit raspberry pi 4 to see if it is reproducible there or if we are in for a tricky time getting more info. The changes are on the branch |
Ok this is reproducible with the same tests failing on Raspberry PI 4 64 bit. |
Well this is fascinating...
|
@jemc this is one for you and i to talk about. |
The backtrace looks identical to what we were getting on the M1. I can't remember the root cause, but maybe @jemc can remember. Edit: some of these threads might help.
Although on second thought, it seems like the problem with the backtrace above is that the location object is null. I'd look at the IR to check if we're getting any |
So it looks like:
We have the location in frame 1. But its null in frame 2. fun assert_error(test: ITest box, msg: String = "", loc: SourceLoc = __loc)
: Bool
=>
"""
Assert that the given test function throws an error when run.
"""
try
test()?
fail(_format_loc(loc) + "Assert error failed. " + msg)
false
else
log(_format_loc(loc) + "Assert error passed. " + msg, true)
true
end Its a straight passthrough. I suspect that the reason that having some optimization causes the crash to go away is that we inline the very small Here's fun _format_loc(loc: SourceLoc): String =>
loc.file() + ":" + loc.line().string() + ": " |
I have determined that the failure happens with a more minimal example as well: use "pony_test"
use "collections"
actor \nodoc\ Main is TestList
new create(env: Env) => PonyTest(env, this)
new make() => None
fun tag tests(test: PonyTest) =>
test(_TestArrayFind)
class \nodoc\ iso _TestArrayFind is UnitTest
fun name(): String => "builtin/Array.find"
fun apply(h: TestHelper) =>
let a = [as ISize: 0; 1; 2; 3; 4; 1]
h.assert_error({() ? => a.find(6)? }) And changing so that use "pony_test"
use "collections"
actor \nodoc\ Main is TestList
new create(env: Env) => PonyTest(env, this)
new make() => None
fun tag tests(test: PonyTest) =>
test(_TestArrayFind)
class \nodoc\ iso _TestArrayFind is UnitTest
fun name(): String => "builtin/Array.find"
fun apply(h: TestHelper) =>
let a = [as ISize: 0; 1; 2; 3; 4; 1]
h.assert_error({() ? => a.find(1)? }) |
Linting the llvm with
but this happens on x86 where there is no issue as well.
|
Here's a more minimal example: actor Main
new create(env: Env) =>
foo("hello")
fun foo(s: String) =>
try
one(s)
error
else
one(s)
end
fun one(s: String) =>
s.clone() |
@ergl could you see if this fails on an M1? The key seems to be to use a release runtime but do |
@SeanTAllen No errors for me. No combination of release/debug runtime and setting |
Interesting so this might be aarch64 + linux issue. |
Note that if it's like the M1 issue, then the problem goes away if you link the program against the runtime bitcode (which I still think should be the default build flow where available). |
@ergl can you try on m1 with the host.cc change removed? so that debug is doing no optimization? |
@SeanTAllen and I think this is related to us needing to write some ARM64-specific code in Here's a PDF that includes some info on the ARM64 Exception Handling ABI: https://documentation-service.arm.com/static/5fa190a7b1a7c5445f2904d6?token= |
That PDF links to this doc, which seems to have the more relevant information: https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html |
@sylvanc says he is believes that yes, the current usage of catch all code is probably wrong for arm exceptions. but when i told him that aarch64 docs say that it is the same as itanium. he said itanium is the same as x86 except perhaps some low level register setting. |
@SeanTAllen - try this diff: diff --git a/src/libponyc/codegen/codegen.c b/src/libponyc/codegen/codegen.c
index 6a972f88..147435b5 100644
--- a/src/libponyc/codegen/codegen.c
+++ b/src/libponyc/codegen/codegen.c
@@ -265,6 +265,7 @@ static void init_runtime(compile_t* c)
unsigned int align_value = target_is_ilp32(c->opt->triple) ? 4 : 8;
+ LLVM_DECLARE_ATTRIBUTEREF(uwtable_attr, uwtable, 0);
LLVM_DECLARE_ATTRIBUTEREF(nounwind_attr, nounwind, 0);
LLVM_DECLARE_ATTRIBUTEREF(readnone_attr, readnone, 0);
LLVM_DECLARE_ATTRIBUTEREF(readonly_attr, readonly, 0);
@@ -606,7 +607,9 @@ static void init_runtime(compile_t* c)
// i32 ponyint_personality_v0(...)
type = LLVMFunctionType(c->i32, NULL, 0, true);
- c->personality = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+ value = LLVMAddFunction(c->module, "ponyint_personality_v0", type);
+ c->personality = value;
+ LLVMAddAttributeAtIndex(value, LLVMAttributeFunctionIndex, uwtable_attr);
// i32 memcmp(i8*, i8*, intptr)
params[0] = c->void_ptr; |
@jemc same boom |
Interestingly, this also segfaults because of the message send that originates in the actor Main
let _out: OutStream
new create(env: Env) =>
_out = env.out
foo("hello")
fun foo(s: String) =>
try
one("h")
error
else
one("e")
end
fun one(s: String) =>
_out.print(s) but this does not: actor Main
let _out: OutStream
new create(env: Env) =>
_out = env.out
foo("hello")
fun foo(s: String) =>
try
one("h")
error
else
one("e")
end
fun one(s: String) =>
s.clone() |
I discussed this with @sylvanc. He took a look at the pony_try llvm ir that we use and said it looked correct and the personality function looked correct as well. He suggested looking at a C++ aarch64 personality function to see if it was different in a way we weren't aware of. I'll be taking a look at the one from the LLVM project: https://github.com/llvm/llvm-project/blob/main/libcxxabi/src/cxa_personality.cpp |
I've found a difference in our personality function from the c++ abi. When we set registers, we do not do what the C++ does: _Unwind_SetGR(context, __builtin_eh_return_data_regno(1),
static_cast<uintptr_t>(results.ttypeIndex)); It is setting context to the type index found from the lsda scan. Whereas we do: _Unwind_SetGR(context, __builtin_eh_return_data_regno(1), 0); Additionally there are differences in ordering in the personality functions which might be because we need the scan results for use elsewhere or simply different way of writing the code. The pony code is clearer, but maybe not correct in addition to the register issue. |
So we get the ttypeIndex from:
plus there's additional logic for deciding it to actually set that in results that I haven't fully worked through. |
I changed this to "triggers release" as this is actually incorrect exception handling on aarch64. |
In case anything here is helpful: https://llvm.org/docs/ExceptionHandling.html#exception-handling-support-on-the-target |
I talked through this with Sylvan and we came to the conclusion that this is a probably a bug in LLVM like I initially thought. In particular a bug in the aarch64 code that happens to get fixed by some optimization. Because the LLVM ir is the same for x86 debug and aarch64 debug besides the platform things you'd expect to be different and we aren't doing anything special for aarch64 vs x86 so it isn't a bug in what we are doing AND as joe has seen, if you create bitcode and link with the llvm tools, the bug also goes away so... bug somewhere in llvm for aarch64. |
I've been thinking about this issue and something new occurs to me. Why does this only happen with a release/optimized version of the runtime? Why is it fine with a debug runtime + debug pony binary? That seems very odd. I have a feeling that perhaps there's something there that will help us find the issue. |
I retested this and it looks like it effects both debug and release versions of the runtime, which was either overlooked when i first opened this issue or is new. Either way, it makes me feel a little better that it isn't only happening with release versions of the runtime. |
After extensive investigation of the weirdness that is #4369, Joe and I believe the issue is most likely an LLVM bug. However, we don't have time to pursue further. During the investigation, I came to believe that the change to use CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations would be prudent. It prevents the known instances of bugs like #4369 which is nice but not key to this change. Changes things because they make symptoms go away isn't a good strategy if you don't understand why. We do not "fully" understand the impact of this change on bugs like #4369. However, it appears that CodeGenOpt::None is a code path that sees little use or testing and is more prone to bugs/regression. Given that we have seen more than 1 bug that "smells like" #4369, we felt it was wise to switch to Default from None at least until such time as we understand #4369 and #3874. LLVM is a complicated beast and using code paths that are little used can be a dangerous proposition. This commit doesn't include release notes as we addresses to user facing instance of #4369 already.
After extensive investigation of the weirdness that is #4369, Joe and I believe the issue is most likely an LLVM bug. However, we don't have time to pursue further. During the investigation, I came to believe that the change to use CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations would be prudent. It prevents the known instances of bugs like #4369 which is nice but not key to this change. Changes things because they make symptoms go away isn't a good strategy if you don't understand why. We do not "fully" understand the impact of this change on bugs like #4369. However, it appears that CodeGenOpt::None is a code path that sees little use or testing and is more prone to bugs/regression. Given that we have seen more than 1 bug that "smells like" #4369, we felt it was wise to switch to Default from None at least until such time as we understand #4369 and #3874. LLVM is a complicated beast and using code paths that are little used can be a dangerous proposition. This commit doesn't include release notes as we addresses to user facing instance of #4369 already.
This problem is also seen with ARM32 using --debug option, however for me the unwinding gets stuck looping. When not using --debug, then exceptions work. Another thing that is worth noting that, without adding -funwind-tables when compiling Pony runtime, the exceptions for ARM32 doesn't work at all and it unwinds to the stack end. Clang doesn't add unwind tables by default to C code which is a possible difference to GCC. |
While getting pony working on 64-bit Arm, we found a bug where the
test-stdlib-debug
tests would segfault. Everything worked fine withrelease
builds, butdebug
builds would crash.After investigation, we've found that in
codegen_machine
inhost.cc
if theopt_level
is get to eitherCodeGenOpt::Default
orCodeGenOpt::Aggressive
instead ofCodeGenOpt::None
the problem goes away.It seems likely that this is an LLVM bug, but that hasn't been established. We've committed a fix in
host.cc
that when the target isarm
(which is 64-bit arm) that we useCodeGenOpt::Default
. This is far from ideal, but gets us working compilations until we can investigate further.The end result of this should either be a patch that we upstream to LLVM or possibly, the discovery that we are doing something wrong with our Arm code generation that optimization manages to fix.
The text was updated successfully, but these errors were encountered: