What does "Maintenance UE" mean #214

Grubby0624 · 2022-12-21T11:29:29Z

hostboot/src/usr/diag/prdf/common/plat/mem/prdfMemExtraSig.H

Line 43 in 9c7a138

PRDR_ERROR_SIGNATURE(MaintUE, 0xffff0010, "", "Maintenance UE");

When we bring up on the Rainier platform, the following error was reported in istep14.1:
"Maintenance AUE" and "Maintenance UE"
Excuse me?
What do the two errors mean respectively? Is there a suggested debug direction

mabaiocchi · 2022-12-21T16:44:23Z

I'm going to yield to @zane131 or @cnpalmer to better answer this question.

cnpalmer · 2022-12-21T17:16:40Z

Address uncorrectable errors (AUEs) and uncorrectable errors (UEs) detected on a maintenance read of memory. I believe both will callout the rank of memory the error was found on which would probably end up as a dimm replacement. The AUE also calls out the port at lower priority. The logs should specify what hardware was called out.

zane131 · 2022-12-21T20:45:45Z

If this is on the initial bring up, you could try checking if the memory DIMMs are seated properly in their slots. Otherwise, replacement may be necessary.

Grubby0624 · 2022-12-28T00:32:45Z

When I reduce the memory frequency to 2666MHz, this error will not be reported. I think this may be a signal integrity problem.

Grubby0624 · 2022-12-28T11:48:56Z

Excuse me?

Are there any suggestions for testing and improving the signal integrity of this DIMM
I remember how P9 can improve signal integrity by directly modifying registers when debugging LRDIMMs. Is there a corresponding method in P10:

hostboot/src/import/chips/p9/procedures/hwp/memory/lib/phy/mss_lrdimm_training_helper.C

Line 802 in 859507a

fapi2::ReturnCode timing_workaround_helper( const fapi2::Target<fapi2::TARGET_TYPE_MCA>& i_target,

dcrowell77 · 2022-12-30T06:02:32Z

@esteban012 might be able to help. There are many many knobs to turn on both the P10 MC and also the Explorer logic but I definitely don't know what they are personally.

sglancy6 · 2023-01-03T14:01:27Z

All current workarounds are included within the latest firmware available on the OpenPOWER github. If you think that the error is caused by a signal integrity issue, then I would strongly recommend running shmoo tests to verify that the margins are sufficient.

Grubby0624 · 2023-01-05T09:55:18Z

Thanks for your reply. Our shmoo test on Rainier is currently in the process of trial. There are two more questions to confirm:

An error is reported at istep 14.1 at 3200MHz. DDR training has run before, and all LANEs returned to us by our print explorer are PASSING. Is it still possible that the problem is signal integrity
What is the specific process of MCBIST? Is it a bist to MC or a bist to DRAM? Is it the same thing as the mbist in DDR?

sglancy6 · 2023-01-05T14:09:59Z

It's possible. Training runs on a limited subset of the addresses.
MCBIST is run on the memory controller and runs traffic over to the DRAM. It is not the same as mbist on the DDR.

Grubby0624 · 2023-01-19T06:45:26Z

We find that the problem is due to the value of the following registers that the PRDF locates:

EXPLR_ RDF_ FIR=0x8011C00 (its value is 0x000000809e0000000). I understand that this is the RDF FIR register of Explorer. Is there any more information available? Would you please explain more on "RDF"?
In addition, I have a limited number of DDIMMs, so I want to try to repair this DDIMM through firmware configuration, based on above, would you please help advice on below questions? Thank you!

Is there any possibility of firmware repair for this error
If so, could you give me some suggestions for debugging

sglancy6 · 2023-01-19T15:24:47Z

I'm seeing that the value sent above is 68 bits of data: 0x000000809e0000000

Which of the following is correct:

0x00000809e0000000
0x000000809e000000

Grubby0624 · 2023-01-28T00:31:21Z

This is the correct value: 0x00000809e0000000

sglancy6 · 2023-01-30T14:24:31Z

The RDF_FIR register is reporting errors found during the maintenance commands:

mark place error on rank 0
new correctable error detected on a maintenance read
chip mark corrected error on a maintenance read

special uncorrectable error on a maintenance read
address uncorrectable error on a maintenance read
uncorrectable error on a maintenance read

My recommendation is to run MCBIST and shmoo to see if a simpler test case fails to do memory writes and reads.

liuxiwei1013 · 2024-01-03T07:55:53Z

This question comes from @Grubby0624:
I think I have resolved the error in "Maintenance UE", which was caused by the Address signal quality not meeting the standard. After adjusting the value of "ATxDly-A/B", it can be turned on, but after running HTX for a long time, it still reports an MCE error. I ran the shmoo test on ATX0 and found that its result is smaller than other DDIMM values:
2024-01-03 T09:32:46.913845697, [TRC]: Minimum margins: explorer: k0: n0: s0: p02 Minimum_ Setup_ PS: 107, Minimum_ Hold_ PS: 49, Minimum_ Eye_ PS: 156
I would like to ask for more advice:
1) What is the standard for ATX0 test pass? What are the minimum setup/hold times
2) I think the Eye of this memory is not up to standard. If I want to increase this value, which parameter should I adjust?

ecorderoibm · 2024-01-09T13:05:01Z

@liuxiwei1013 is that schmo result the same explorer reporting the CE's ? How fast are you running the interface?

Grubby0624 · 2024-01-10T00:34:07Z

The system can boot up to the OS with only this one DDIMM, but running HTX for a long time will report MCE, HTX will terminate, but the system will not crash
"How fast are you running the interface?" Are you referring to the rate of DDR? 3200MHz

ecorderoibm · 2024-01-10T03:26:34Z

You are using industry standard dimms correct? Does the MCE get reported when it crashes or is it already there during the long run you describe? I don't see how the MCE could cause the HTX to stop. I am still looking to get the margins expected from the schmoo. The eye seems small, but I want to make sure. What did you do to fix the address bus signal issue? Could that still be borderline? Thank you @Grubby0624

Grubby0624 · 2024-01-11T09:33:14Z

MCE get reported when it crash，According to my understanding, if MCE occurs while running HTX mdt.mem, HTX should stop.
I modified the parameter values sent to OCMB in the PHY_INIT command, based on Shmoo testing, show that the eye width of Address/Cmd is relatively small. I have modified the following parameters:
AtxImpedance atxSlewRate ATxDly_ A/B [0-7]
I'm not sure if it's a critical point or if there are other reasons that are causing MCE errors now

Grubby0624 · 2024-02-01T05:59:32Z

I tried to adjust the values of the following parameters: atxImpedance/atxSlewRate/cktxImpedance, and finally found the best ATX-0 eye width of 254ps. Afterwards, we made many modifications to the SI parameters, but the results did not improve. May I ask:

What was done specifically for this ATX test? What signal is it based on and what is the result of sampling?
What may be the reason for the small eye size of ATX, and what methods do we have to improve this result.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does "Maintenance UE" mean #214

What does "Maintenance UE" mean #214

Grubby0624 commented Dec 21, 2022

mabaiocchi commented Dec 21, 2022

cnpalmer commented Dec 21, 2022

zane131 commented Dec 21, 2022

Grubby0624 commented Dec 28, 2022

Grubby0624 commented Dec 28, 2022

dcrowell77 commented Dec 30, 2022

sglancy6 commented Jan 3, 2023

Grubby0624 commented Jan 5, 2023

sglancy6 commented Jan 5, 2023

Grubby0624 commented Jan 19, 2023

sglancy6 commented Jan 19, 2023

Grubby0624 commented Jan 28, 2023

sglancy6 commented Jan 30, 2023

liuxiwei1013 commented Jan 3, 2024 •

edited

Loading

ecorderoibm commented Jan 9, 2024

Grubby0624 commented Jan 10, 2024

ecorderoibm commented Jan 10, 2024

Grubby0624 commented Jan 11, 2024

Grubby0624 commented Feb 1, 2024

What does "Maintenance UE" mean #214

What does "Maintenance UE" mean #214

Comments

Grubby0624 commented Dec 21, 2022

mabaiocchi commented Dec 21, 2022

cnpalmer commented Dec 21, 2022

zane131 commented Dec 21, 2022

Grubby0624 commented Dec 28, 2022

Grubby0624 commented Dec 28, 2022

dcrowell77 commented Dec 30, 2022

sglancy6 commented Jan 3, 2023

Grubby0624 commented Jan 5, 2023

sglancy6 commented Jan 5, 2023

Grubby0624 commented Jan 19, 2023

sglancy6 commented Jan 19, 2023

Grubby0624 commented Jan 28, 2023

sglancy6 commented Jan 30, 2023

liuxiwei1013 commented Jan 3, 2024 • edited Loading

ecorderoibm commented Jan 9, 2024

Grubby0624 commented Jan 10, 2024

ecorderoibm commented Jan 10, 2024

Grubby0624 commented Jan 11, 2024

Grubby0624 commented Feb 1, 2024

liuxiwei1013 commented Jan 3, 2024 •

edited

Loading