-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Targeting process aborts on Talos II systems #173
Comments
0505 is a task crash, the error log should contain the printk output if you can't get it directly from memory (not sure what debug environment you have available). Just to confirm, you are using HBB/HBI/HBD from the same build, right? There are issues if you try to mix any of these. |
What exactly does this comment mean? |
It means that the fault originated in late 2018, and persisted to the current master version. After bisect, 63a9aa5 is the culprit. Working on a fix. |
@dcrowell77 4288e39 was a temporary breakage that interfered with bisect, the actual culprit was 63a9aa5. As far as I can tell there is still no open errl utility (#174), so the stack trace in HBEL is relatively useless to us. |
The stack trace is manually parseable from the hbicore.list.bz2. Since you are crashing the backtrace is in the printk (as ascii text in the error log) so you should be able to get all the numbers. Even with a esel parser, you won't get more than that for this type of error. I'm very confused by how 63a9aa5 could be causing the issues you are seeing unless you are trying to apply different levels of HB code on top of an older HBD partition. There would definitely be a dependency caused by this change between the contents of HBD and the HBB/HBI code. Be sure to do a machine-xml-rebuild while you are narrowing things down. |
@dcrowell77 Yes, I'm doing a full build inside op-build each time. I've also spent a bit of time earlier today piecing together a hack-around not having errl, which I've started some initial documentation on at https://wiki.raptorcs.com/wiki/Hostboot_Debug_Howto. The failure in master is in:
I'm trying to get more detailed information, but it does seem to be the same crash that 63a9aa5 introduces. It could very well be a misbuilt table, e.g. if something is messing up in the Python xmltohb code, but I'm not sure yet. |
While the patch in #175 allows the system to not immediately crash during early IPL, we still run into a crash later on that looks to be attribute related again:
Has hostboot been fully built and tested on a Romulus in the past ~5 months or so? |
Yes, we build and run on a Romulus more or less every day.
Having that trace right before the crash makes me suspect we've got a null target situation maybe. Re: your debug wiki - a couple comments:
|
@dcrowell77 Thanks for the info on the BMC and the hbicore.list.bz2 file! Before we duplicate effort, is there an existing tool that matches address to line number without errl, or is this something that we should go ahead and write a tool to do? WIth regard to this particular crash, here's the final analysis (we now have a Talos II booting on the latest hostboot):
We'll need to test upgrading a variety of machines from the older SBE version on this side, as there is significant concern of machines in the wild not being upgradeable without some way of forcibly writing a new SBE to the processor modules. |
https://github.com/open-power/hostboot/blob/master/src/build/debug/Hostboot/Ps.pm is what we use to decode the stack traces for all of our tasks if we can get at all of memory. That does lookups in the listing file and decodes it to source lines. We've never had time to adapt the logic to run against the error log contents from printk. https://github.com/open-power/hostboot/blob/master/src/usr/errl/backtrace.C has similar logic that is used when decoding a backtrace from an error log. All the pieces should be there for you between those. |
This is something we try to avoid, but we've definitely had issues before if something slips through. We do realize that if you can't get into Hostboot to do the update that you are quite stuck... I'm not quite seeing how the PEER_TARGET piece fits in yet though. Do you know what level the SBE was in the fail case? |
@dcrowell77 Previous SBE would have been open-power/sbe@3c633a6 . I honestly don't know why the hostboot changes would have caused the SBE / HBBL to halt, but as long as we don't see it when the new production firmware files are ready for release (and can therefore update the SBE to a more modern version in situ) it's probably not worth tracking down. The old SBE in question has very limited debugging capability compared to the new one. |
@rajadas2 - Please help take a look at the SBE differences. |
@dcrowell77 I was just informed today that at least some of the CPUs we have in inventory have an SBE version preloaded from the factory that is failing to IPL with our latest firmware. This is a serious problem; we have no way to feasibly mass-update the SBE version on the CPUs in inventory. Was there any progress on finding the fault? EDIT: If it helps I can give date codes for the working & non-working CPUs; I assume there is some way to associate that with the exact factory SBE revision? EDIT2: We just found out that the date code on the box has no relation to the SBE version inside the box. Two boxes with the same datecode had wildly varying SBE versions inside, so it looks like this would have to be traced via processor serial number. |
Can you post the console output from a failing boot? That should give us the SBE and Hostboot code levels that aren't working well together. |
@dcrowell77 The only output is the hostboot initialization w/ version, then "Error created : PLID=90000002". Each time the system reboots the PLID increments by two, but my understanding is this is related to the error log count being incremented. The SBE versions that cause the problem are not configured from the factory with LPC debug enabled; I suspect they predate SBE console support entirely. |
I'd like to see all the trace output up to the fail. There might be something hiding in there that triggers a thought. Can you save it off to a file and attach it? |
On Talos II systems (similar to Romulus), the targeting process aborts during IPL:
This regression appears to date from late 2018.
The text was updated successfully, but these errors were encountered: