-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What does "Maintenance UE" mean #214
Comments
Address uncorrectable errors (AUEs) and uncorrectable errors (UEs) detected on a maintenance read of memory. I believe both will callout the rank of memory the error was found on which would probably end up as a dimm replacement. The AUE also calls out the port at lower priority. The logs should specify what hardware was called out. |
If this is on the initial bring up, you could try checking if the memory DIMMs are seated properly in their slots. Otherwise, replacement may be necessary. |
When I reduce the memory frequency to 2666MHz, this error will not be reported. I think this may be a signal integrity problem. |
Excuse me?
|
@esteban012 might be able to help. There are many many knobs to turn on both the P10 MC and also the Explorer logic but I definitely don't know what they are personally. |
All current workarounds are included within the latest firmware available on the OpenPOWER github. If you think that the error is caused by a signal integrity issue, then I would strongly recommend running shmoo tests to verify that the margins are sufficient. |
Thanks for your reply. Our shmoo test on Rainier is currently in the process of trial. There are two more questions to confirm:
|
|
We find that the problem is due to the value of the following registers that the PRDF locates: EXPLR_ RDF_ FIR=0x8011C00 (its value is 0x000000809e0000000). I understand that this is the RDF FIR register of Explorer. Is there any more information available? Would you please explain more on "RDF"?
|
I'm seeing that the value sent above is 68 bits of data: 0x000000809e0000000 Which of the following is correct:
|
This is the correct value: 0x00000809e0000000 |
The RDF_FIR register is reporting errors found during the maintenance commands: mark place error on rank 0 special uncorrectable error on a maintenance read My recommendation is to run MCBIST and shmoo to see if a simpler test case fails to do memory writes and reads. |
This question comes from @Grubby0624: |
@liuxiwei1013 is that schmo result the same explorer reporting the CE's ? How fast are you running the interface? |
|
You are using industry standard dimms correct? Does the MCE get reported when it crashes or is it already there during the long run you describe? I don't see how the MCE could cause the HTX to stop. I am still looking to get the margins expected from the schmoo. The eye seems small, but I want to make sure. What did you do to fix the address bus signal issue? Could that still be borderline? Thank you @Grubby0624 |
|
I tried to adjust the values of the following parameters: atxImpedance/atxSlewRate/cktxImpedance, and finally found the best ATX-0 eye width of 254ps. Afterwards, we made many modifications to the SI parameters, but the results did not improve. May I ask:
Thank you! |
hostboot/src/usr/diag/prdf/common/plat/mem/prdfMemExtraSig.H
Line 43 in 9c7a138
When we bring up on the Rainier platform, the following error was reported in istep14.1:
"Maintenance AUE" and "Maintenance UE"
Excuse me?
What do the two errors mean respectively? Is there a suggested debug direction
The text was updated successfully, but these errors were encountered: