-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X18 and X24 disks frequently reset with SAS3008 HBAs under heavy write load #162
Comments
Hi @putnam, Sorry you are having issues in your system.
Is this correct? From the standards, disabling EPC should hold across resets and power cycles. As for firmware updates, sometimes those can help (both HBA side and drive side). From the Seagate support site there is a Firmware update finder that you can provide a serial number to check for new firmware. You don't need the other Windows only tool (it basically scans and opens that webpage for you with the SN already loaded). I am asking around to see if any of the customer support engineers have run into this as well, but I have not heard anything yet. |
Thanks so much for the response. I edited my original ticket a lot, so I think you're responding to the initial version. I realized, looking at bash history and the state of the disks, that:
I'm sure this is now outside the scope of this repo, but you guys have been so useful in the past when reporting possible firmware bugs. Maybe it's useful to have shared it here anyway. I'm not an enterprise customer, just an end user, so it's hard to get a line to someone with inside engineering connections. I can repro more consistently now by just copying a lot of data to the disks. I have found very little info on these particular 20TB models since I understand they're technically binned/refurbed X24 HAMR disks. It may well be an issue with the LSI/Broadcom firmware or even mpt3sas, but again it doesn't repro on my 60+ HGST/WD disks or the X16's on their own. Since we're almost certainly outside the scope of openSeaChest here feel free to close but if it's something you guys are open to pursuing with more debug data and info I could share it here or over email privately. Regarding firmware on the end user portal there's no update available for these yet. |
I did pass this issue along to some people internally to see if they've seen similar problems before with these drives and hardware, but I have not heard anything yet. If you dump the SATA phy event counters, are you seeing those increase at all? If these are increasing (not just the reset counter, but others) if can point towards a cabling issue. I'll see if there is anything else I can think of trying that might also help debug this. |
Thanks for the reply! OK, so here are the PHY counters from Anyway, the resets I see now are specifically when ZFS is copying a large amount of data to the pool and is lighting up the vdevs made up of Seagate devices for a sustained amount of time. Eventually, you see the same message about the HBA resetting with the same fault code in mpt3sas. I did some digging in the mpt3sas driver hoping to find some bitflags or something to identify the fault code but it looks to be internal/proprietary to Broadcom/LSI. 20TB X24 Disks (Newer)
16TB X18 Disks (Older, pre-existing without resets)
|
For this page it continues counting until you reset the counters on the page. I don't remember if we put that in as an option in openSeaChest yet. I will have to review the code. The reason I mentioned the CRC errors is due to some of my own past experience trying to troubleshoot some issues other customers have seen. I have also had some long conversations with one of the Seagate engineers who works on the phy level with the goal of figuring out a way to write a test for detecting a bad cable. It's not an easy task 😆 but we did come up with some ideas including using these logs. I have not had time to implement it yet, but it will be an expanded version of the One thing I learned from him was that the faster the interface is running (6Gb/s vs 3Gb/s) the sooner you notice signaling issues. The most common is seeing the CRC counters increasing. This is often increasing due to a cabling problem....not always, but in your case I suspect it is since it's happening on multiple different drives, even drives that were not previously having an issue. It's possible that these new drives have a slightly different phy behavior that managed to bring this out. Another thing that can happen (and I have experienced myself) is similar things happen as the backplane connectors wear out from plugging and unplugging drives. Eventually all connectors will fail but as you approach the insertion count limit you can start to see these kinds of issues too. I don't know if any of these will solve the issue, but you can try these things:
openSeaChest_Configure also has an option to set the phy speed lower as well, which you can also try but it may limit your maximum sequential read/write on more modern drives. One last thing I want to mention is that if you can check for updates on the HBA firmware that may also help. I have seen that resolve odd behavior issues as well due to fixes made to the HBA's firmware. I have seen some past Broadcom HBA's resolve some odd phy issues before, but I don't know if that is affecting this specific case. Let me know if this helps. I'll see if I can talk to that signal engineer I mentioned about this to see if he has any other ideas. |
Thanks. Will go over and try. Regarding the HBA, it's a pretty common SAS3008 HBA and on latest firmware (16.00.14.00). The backplane hasn't had a ton of insertion cycles, but reseating can't hurt. I will swap to a new-in-bag Amphenol cable set + reseat disks and see if I can repro again and report back. |
I have a bunch (11 each) of ST24000NM000C and ST16000NM001G drives that cause major issues with my SAS3008-based HBA (the onboard HBA on the Supermicro H12SSL-CT, but also just on a regular 9300-8i). Specifically the HBA hits some failure mode under heavy write loads to these new X24's and the driver triggers a whole HBA reset. Heavy reads seem to not be affected.
The X18 default EPC settings vary vs. the X24's. They seem to have Idle_A set to 1 and Idle_B set to 1200; the X24 firmware only has Idle_A set to 1. The first time I saw this occur, I disabled EPC on the new X24's with --EPCfeature disable, and I thought it was resolved, but the next time I had a pretty sustained write load it happened again.
I didn't have this issue when it was purely the X18 disks on this adapter. It was only once the X24s were added to the mix that I saw this occur. It also does not occur with HGST/WD disks.
All X18 disks are on SN02, except one RMA refurbed ST16000NM000J on SN04.
All X24 disks are on SN02.
The SAS3008 HBA is on 16.00.14.00. It is actively cooled and temp is monitored and not overheating.
Disks are all attached on a Supermicro 846 SAS3 backplane/LSI expander on 66.16.11.00.
Kernel is 6.10.11-amd64, current Debian testing/trixie.
Here's dmesg during a heavy write load triggering the problem:
I contacted Seagate support and uh, they told me to install some Windows-only software to monitor for firmware updates and didn't know how to respond to anything technical at all. So I hope maybe through you guys this info is useful.
The text was updated successfully, but these errors were encountered: