Not able to get past smartctl validation

Hello Everyone,

I’m on MaaS 2.7 and looks like I’m pretty far in…I have Ubuntu 20 and 18 as bootable images, and my client manages to start booting (I think it’s set to 20 right now). However, smartctl is throwing some error that causes the client to immediately poweroff and MaaS reports an error.

For testing purposes, my boot-client is an older dell desktop that supports PXE booting but not anything fancy. It looks like when I turn the client on, MaaS sees it, and the client starts doing a ton of stuff, it appears to be commissioning. Commissioning then fails because of something with smartctl - it goes by too quick and I can’t find it in the logs yet. MaaS is able to collect a good deal of information about the client system, including visibility of the 256GB ssd.

I did some google sleuthing and saw that in 2018 (MaaS version 2.1.x I think) had a bug where someone had very similar behavior to what I’m seeing, and the solution was to update MaaS, but I think I have the latest version?

Wondering if it was some random hardware misfortune, we tried the same experiment entirely in virtualbox and had the same issue.

My assumption is that there’s something strange going on with the script that runs the smartctl test. Strangely, when commissioning fails, I can tell MaaS to basically ignore the “error” and try to deploy anyway, but then it still fails in what I think is the same way.

Outside of the /var/logs and the obvious “Logs” in the webUI of MaaS, I’m not sure where to go digging for clues.

Any guidance would be greatly appreciated

I am on the same maas and trying 20 LTS. The node with the SSD fails the same way. The smartctl-validate is not even run.

Following instructions from, When I ssh into the node after testing failure. I can run smartctl-validate /dev/sda and everything works. smartctl --scan, smartctl /dev/sda all work.

However, if I run /tmp/ --no-download /tmp/`, fails with the same error:

The actual error log in MAAS UI is

Unable to run 'smartctl-validate': Storage device 'SAMSUNG MZ7TE256' with serial 'S1K7NSAG410824' not found!
This indicates the storage device has been removed or the OS is unable to find it due to a hardware failure. Please re-commission this node to re-discover the storage devices, or delete this device manually.
Given parameters:
{'storage': {'argument_format': '{path}', 'type': 'storage', 'value': {'id_path': '/dev/disk/by-id/wwn-0x5002538844584d30', 'model': 'SAMSUNG MZ7TE256', 'name': 'sda', 'physical_blockdevice_id': 2, 'serial': 'S1K7NSAG410824'}}}
Discovered storage devices:
[{'NAME': 'sda', 'MODEL': 'SAMSUNG_MZ7TE256', 'SERIAL': 'S1K7NSAG410824'}]
Discovered interfaces:
{'00:23:24:90:22:0d': 'eno1'}

Started looking at the code for maas-run-remote-scripts but looks like I’ll have to step through to figure things out. Will see if I find something useful.

On another node, I’ll retry with 18.04 and see if that fares any better.

This is a known issue as reported in LP:1869116. What is happening is commissioning is detecting the storage name as ‘SAMSUNG MZ7TE256’ via LXD while testing is detecting the name as ‘SAMSUNG_MZ7TE256’ via lsblk.

You can try skipping storage testing but may run into the same problem when deploying.