Currently there is no clean method for removing failed tests or their results. For instance, if a long running badblocks-destructive test times out, said node will be forever marked as having a failed test. Similar thing applies to tests that are skipped, such as smartctl-validate test bailing out on a hardware raid array.
While re-running a test is sometimes an option, other times (such as smartctl-validate) it will always fail. A given node is then marked forever as potentially problematic.
It would be great to have a menu option that would allow for simply removing a given test history/results/etc, and consequently show a fully working node.
Thanks for the feedback. You may already know this, but (as a workaround) you can browse to the node in the Web UI and select Override failed testing from the menu to allow the node to be used.
I agree it might be nice to delete known-to-fail test results in some cases. Iāve seen this issue in heavily-firewalled environments with the internet connectivity tests.
Thanks! Unfortunately āoverride failed testingā seems to be available only under specific conditions. For example, see how it doesnāt appear in any context menus here:
Thanks; that looks like a bug. If you could file a bug on that, that would be appreciated.
For the record, (on the MAAS Iām testing on now) I was able to find the option from the machine listing (but not on a specific machine - but I donāt have any in Failed Testing state at the moment).
Edit: My mistake; I didnāt notice the machine was still in āReadyā state. I still think your idea is good, but I agree with Andres that it is not a bug.
This is actually not a bug. The āOverride failed testingā is an action only available when the machine is in āFailed testingā state, and not when the machine is in āReadyā state.
Please note that such action will not make failed tests (or icons) disappear. The action is intended to allow the user to use a machine that has had failed tests. That said, since the machine failed testing anyway, MAAS will continue to show the error icons because the machine never actually passed the tests.
The only way to make those icons disappear is if the tests are re-run and they complete successfully.
Thank you for clarification. Perhaps the wording behind āOverride Failed ā¦ā menu option led me to believe it would clear individual tests.
This brings the discussion full circle to the original point: creating ability to clear individual tests, without having to re-run them. Something like badblocks-destructive on a larger storage subsystem can take significant amount of time, and often is not necessary to repeat.
Please note that all actions in the action menu affect to machines as a whole, and not sections inside a machine. e.g.:
Actions - you have action to deploy, release, power on, power off, etc.
But if you want to add interfaces, change settings, etc, thatās done inline.
That said, you can use the API to delete test runs, but again, we discourage it provided that if a test fail, it is a big indication thatās something wrong with the machine and you are using it with a risk. Either way, you can delete the results for a machine with something like:
maas < user > node-script-result delete < machine system_id > < id of the result entry >
You can list all results for a machine with:
maas < user > node-script-results read < machine system_id >
It really would be good to be able to make the icons go away sometimes.
For instance, SMARTS tests reports all errors that the disk drive has seen, which is not the same as all errors that have occurred on the disk drive. I have some drives with communications errors in their SMARTS logs because of faulty SATA cables. Iāve replaced the cables, but now Iām stuck with a permanent alert in MAAS. Iāve also seen problems with Dell hot-plug SAS backplanes.
On the other hand, Iām very happy to have SMARTS media errors stick around in the MAAS display. Iāve got a few of them too and having them drowned by false alarms isnāt good. I might have to think more about what can be done with the information from SMARTS.
After you changed the cable, would smarctl now say that the error is fixed, but since there were previous errors it still returns a non-zero return code?
Iāve reviewed the smartctl logs and I did misread the PHY error logs. I assumed they were errors because Iāve had problems with SATA cables, but the PHY error logs are counters and all the counts are zero.
The actual problem I had in the past was drives overheating (not by much). The exit code from smartctl has bit 5 set to indicate that the drive is fine now, but something bad happened in the past. smartctl can also set bit 2 when it issues a command that doesnāt work.
I think that divining exactly what triggers all the bits in the smartctl exit status could be a bit of a black art. On the other hand, if you want a variety of dodgy hard drives to run sample code against, I can help.
Here is the output from one of my disks, complete with genuine spelling mistake on line 1.
smartctl with exit status 32
INFO: Veriying SMART support for the following drive: /dev/sde
INFO: Running command: sudo -n smartctl --all /dev/sdeI
INFO: SMART support is available; continuing...I
INFO: Verifying and/or validating SMART tests...I
INFO: Running command: sudo -n smartctl --xall /dev/sdeI
FAILURE: SMART tests have FAILED for: /dev/sdeF
The test exited with return code 32! See the smarctl manpage for information on the return code meaning. For more information on the test failures, review the test output provided below.T
----------------------------------------------------
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-128-generic] (local build)s
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.orgC
=== START OF INFORMATION SECTION ====
Model Family: Seagate NAS HDDM
Device Model: ST4000VN000-1H4168D
Serial Number: Z300QPXZS
LU WWN Device Id: 5 000c50 0645cd965L
Firmware Version: SC43F
User Capacity: 4,000,787,030,016 bytes [4.00 TB]U
Sector Sizes: 512 bytes logical, 4096 bytes physicalS
Rotation Rate: 5900 rpmR
Form Factor: 3.5 inchesF
Device is: In smartctl database [for details use: -P show]D
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3bA
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)S
Local Time is: Tue Jun 19 08:28:15 2018 UTCL
SMART support is: Available - device has SMART capability.S
SMART support is: EnabledS
AAM feature is: UnavailableA
APM level is: 254 (maximum performance)A
Rd look-ahead is: EnabledR
Write cache is: EnabledW
ATA Security is: Disabled, frozen [SEC2]A
Wt Cache Reorder: EnabledW
=== START OF READ SMART DATA SECTION ====
SMART overall-health self-assessment test result: PASSEDS
See vendor-specific Attribute list for marginal Attributes.S
General SMART Values:G
Offline data collection status: (0x82) Offline data collection activityO
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completedS
without error or no self-test has ever
been run.
Total time to complete Offline T
data collection: ( 107) seconds.d
Offline data collectionO
capabilities: (0x7b) SMART execute Offline immediate.c
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before enteringS
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.E
General Purpose Logging supported.
Short self-test routine S
recommended polling time: ( 1) minutes.r
Extended self-test routineE
recommended polling time: ( 510) minutes.r
Conveyance self-test routineC
recommended polling time: ( 2) minutes.r
SCT capabilities: (0x10bd) SCT Status supported.S
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10S
Vendor Specific SMART Attributes with Thresholds:V
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUEI
1 Raw_Read_Error_Rate POSR-- 118 099 006 - 187129512
3 Spin_Up_Time PO---- 092 091 000 - 0
4 Start_Stop_Count -O--CK 085 085 020 - 15853
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 078 060 030 - 68242999
9 Power_On_Hours -O--CK 080 080 000 - 18296
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 099 099 020 - 1068
184 End-to-End_Error -O--CK 100 100 099 - 01
187 Reported_Uncorrect -O--CK 100 100 000 - 01
188 Command_Timeout -O--CK 100 099 000 - 85900656661
189 High_Fly_Writes -O-RCK 001 001 000 - 7031
190 Airflow_Temperature_Cel -O---K 075 043 045 Past 25 (0 6 25 24 0)1
191 G-Sense_Error_Rate -O--CK 100 100 000 - 01
192 Power-Off_Retract_Count -O--CK 100 100 000 - 8931
193 Load_Cycle_Count -O--CK 092 092 000 - 165861
194 Temperature_Celsius -O---K 025 057 000 - 25 (0 9 0 0 0)1
197 Current_Pending_Sector -O--C- 100 100 000 - 01
198 Offline_Uncorrectable ----C- 100 100 000 - 01
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 01
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1G
SMART Log Directory Version 1 [multi-sector log support]S
Address Access R/W Size DescriptionA
0x00 GPL,SL R/O 1 Log Directory0
0x01 SL R/O 1 Summary SMART error log0
0x02 SL R/O 5 Comprehensive SMART error log0
0x03 GPL R/O 5 Ext. Comprehensive SMART error log0
0x06 SL R/O 1 SMART self-test log0
0x07 GPL R/O 1 Extended self-test log0
0x09 SL R/W 1 Selective self-test log0
0x10 GPL R/O 1 SATA NCQ Queued Error log0
0x11 GPL R/O 1 SATA Phy Event Counters log0
0x21 GPL R/O 1 Write stream error log0
0x22 GPL R/O 1 Read stream error log0
0x24 GPL R/O 1223 Current Device Internal Status Data log0
0x25 GPL R/O 1223 Saved Device Internal Status Data log0
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log0
0x80-0x9f GPL,SL R/W 16 Host vendor specific log0
0xa1 GPL,SL VS 20 Device vendor specific log0
0xa2 GPL VS 4496 Device vendor specific log0
0xa8 GPL,SL VS 129 Device vendor specific log0
0xa9 GPL,SL VS 1 Device vendor specific log0
0xab GPL VS 1 Device vendor specific log0
0xb0 GPL VS 5176 Device vendor specific log0
0xbe-0xbf GPL VS 65535 Device vendor specific log0
0xc1 GPL,SL VS 10 Device vendor specific log0
0xc3 GPL,SL VS 8 Device vendor specific log0
0xc4 GPL,SL VS 5 Device vendor specific log0
0xe0 GPL,SL R/W 1 SCT Command/Status0
0xe1 GPL,SL R/W 1 SCT Data Transfer0
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)S
No Errors LoggedN
SMART Extended Self-test Log Version: 1 (1 sectors)S
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_errorN
# 1 Short offline Aborted by host 90% 9842 -#
# 2 Short offline Aborted by host 90% 9841 -#
# 3 Short offline Aborted by host 90% 9841 -#
# 4 Short offline Aborted by host 90% 9841 -#
# 5 Short offline Aborted by host 90% 9841 -#
# 6 Short offline Aborted by host 90% 9841 -#
# 7 Short offline Aborted by host 90% 9841 -#
# 8 Short offline Aborted by host 90% 9841 -#
# 9 Short offline Aborted by host 90% 9841 -#
#10 Short offline Aborted by host 90% 7716 -#
#11 Short offline Aborted by host 90% 7716 -#
#12 Short offline Aborted by host 90% 7716 -#
#13 Short offline Aborted by host 90% 7716 -#
#14 Short offline Aborted by host 90% 7716 -#
#15 Short offline Aborted by host 90% 7716 -#
#16 Short offline Aborted by host 90% 6903 -#
#17 Short offline Aborted by host 90% 6903 -#
#18 Short offline Aborted by host 90% 6903 -#
#19 Short offline Aborted by host 90% 6903 -#
SMART Selective self-test log data structure revision number 1S
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):S
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.I
SCT Status Version: 3S
SCT Version (vendor specific): 522 (0x020a)S
SCT Support Level: 1S
Device State: Active (0)D
Current Temperature: 25 CelsiusC
Power Cycle Min/Max Temperature: 24/25 CelsiusP
Lifetime Min/Max Temperature: 9/56 CelsiusL
Under/Over Temperature Limit Count: 0/0U
SCT Temperature History Version: 2S
Temperature Sampling Period: 1 minuteT
Temperature Logging Interval: 94 minutesT
Min/Max recommended Temperature: 1/61 CelsiusM
Min/Max Temperature Limit: 2/60 CelsiusM
Temperature History Size (Index): 128 (74)T
Index Estimated Time Temperature CelsiusI
75 2018-06-11 00:28 26 *******
76 2018-06-11 02:02 25 ******
... ..( 2 skipped). .. ******
79 2018-06-11 06:44 25 ******
80 2018-06-11 08:18 26 *******
81 2018-06-11 09:52 25 ******
82 2018-06-11 11:26 26 *******
83 2018-06-11 13:00 26 *******
84 2018-06-11 14:34 26 *******
85 2018-06-11 16:08 25 ******
... ..( 9 skipped). .. ******
95 2018-06-12 07:48 25 ******
96 2018-06-12 09:22 26 *******
... ..( 44 skipped). .. *******
13 2018-06-15 07:52 26 *******
14 2018-06-15 09:26 27 ********
... ..( 8 skipped). .. ********
23 2018-06-15 23:32 27 ********
24 2018-06-16 01:06 26 *******
... ..( 6 skipped). .. *******
31 2018-06-16 12:04 26 *******
32 2018-06-16 13:38 25 ******
... ..( 2 skipped). .. ******
35 2018-06-16 18:20 25 ******
36 2018-06-16 19:54 26 *******
... ..( 5 skipped). .. *******
42 2018-06-17 05:18 26 *******
43 2018-06-17 06:52 27 ********
... ..( 2 skipped). .. ********
46 2018-06-17 11:34 27 ********
47 2018-06-17 13:08 26 *******
... ..( 17 skipped). .. *******
65 2018-06-18 17:20 26 *******
66 2018-06-18 18:54 25 ******
... ..( 5 skipped). .. ******
72 2018-06-19 04:18 25 ******
73 2018-06-19 05:52 ? -
74 2018-06-19 07:26 24 *****
SCT Error Recovery Control:S
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
Device Statistics (GP/SMART Log 0x04) not supportedD
SATA Phy Event Counters (GP Log 0x11)S
ID Size Value DescriptionI
0x000a 2 6 Device-to-host register FISes sent due to a COMRESET0
0x0001 2 0 Command failed due to ICRC error0
0x0003 2 0 R_ERR response for device-to-host data FIS0
0x0004 2 0 R_ERR response for host-to-device data FIS0
0x0006 2 0 R_ERR response for device-to-host non-data FIS0
0x0007 2 0 R_ERR response for host-to-device non-data FIS0