Improving the Storage Hardware Tests


#1

Hi all,
I mentioned a while ago that I thought that the smartctl based hardware tests generated some false positives and I’ve finally got around to going through all my failed hardware tests and working what the source of the failure is. While I don’t have Backblaze quantities of hard drives, I do have a bad habit of treating my personal hardware somewhat more harshly than my employers hardware, so I have more “test data” than I might. The smartctl exit codes are bit significant and so I’ve list these by bit number.

Bit 2: Some SMART or other ATA command to the disk failed

Some of my hard disks don’t support SCT and so they trigger an error when smartctl is run with --xall.

Bit 5: Attributes have been <= threshold at some time in the past

In five out of six cases, this was because the drive’s temperature had exceeded the maximum and in four of those five, it appears to have only been by one degree.
The remaining case was an “End-to-End_Error” indicating excessive checksum/parity errors.

Bit 6: The device error log contains records of errors

These errors were all unique.

  1. I appear to have aborted some drive self tests by rebooting the PC.
  2. Uncorrectable read errors. This drive was broken.
  3. A single uncorrectable drive error at LBA 0. This was the same drive that gave me the “End-to-End_Error” threshold alert. This doesn’t look healthy.

It would be nice if we would “acquit” a device of any errors recorded and it would also be nice if bit 2 was simply ignored by MAAS.

I can image a UI that tells the user for bit 5 that drive limits have been exceeded in the past, but are currently OK, and if the user is happy with the state of the device after reviewing the logs, then they should feel free to “acquit” the hardware.

I can also image a UI that tells the user for bit 6 that after reviewing the logs they can “acquit” the hardware, but they should think carefully about this. I suspect that bit 3 (DISK FAILING) and bit 4 (prefail attributes <= threshold) should also be treated this way.

My list of errors is here https://docs.google.com/spreadsheets/d/13Io4LbgWt9PJDET_zGMjs-DfamLnO5Zh471xJafJPWs/edit?usp=sharing