HPE machines shuts down instead of completing commissioning

For the first dozen HPE DL360/BL460 I installed with MAAS (2.9, a few months back), I just needed to configure the machine to PXE boot and enable IPMI in the ILO, and MAAS would automatically add a user account to the ILO and the commissioning would work. At some point this seems to have stopped working, but it’s hard to tell when. I made several changes to the subnets configuration in MAAS, at times I was trying to commission a host without having enabled IPMI in ILO, at some point I was removing hosts and re-discovering them due to memory and disk changes and this seemed to be the easiest method to get the new parameters detected by MAAS. But MAAS kept on working for deploying/releasing machines.
But now I have a bunch of new machines, and after booting them once manually, they show up in “Commissioning” state with a generated name, but after you see the ubuntu login (maas-enlisting-node), the system shuts down again, while MAAS still shows status “Commissioning”. There is no maas user in ILO, and no parameters for the BMC are configured in MAAS on the new machines in “commissioning” state. There is only 1 event for the host (Node changed status - From ‘New’ to ‘Commissioning’), all items in the commissioning tab are “pending” state. The snap logs command shows only the dhcp leases given to the hosts. (PS: I upgraded to MAAS 3.0 from snap, tried to use ubuntu 18.04 instead of 20.04, but the result didn’t change).

A second (also manually initiated) boot, will do a very similar thing: it will boot up the host, this time it shows the maas-assigned hostname, but shortly after it goes into shutdown again. The state in MAAS seems to move from just “Commissioning” to “Commissioning - loading ephemeral”, it won’t end up in ready state, it will boot and shutdown within minutes. The logs scrolling by seem to indicate downloading packages, executing cloud-init, applying the network config etc.
I have video of the output if that could pin down the problem: https://youtu.be/5igJpJhi5B0

The only way to get the hosts ready is to enter an IPMI config in MAAS manually (and of course creating a matching user in ILO). Then I can use the “Abort” in MAAS, hit “Commission” again and everything will work, now all the details of the host are discovered, and I can continue deploying an operating system.

But how to fix the automatic commissioning?

@tom-mercelis, did you ever figure this one out?

Hi,

No, we still use the workaround. Luckily we re-install machines much more often then onboard new ones, so it’s not a show stopper for using MaaS.
I think at some point I found a log somewhere that showed that it was the configuration of the BMC that failed during the first commissioning attempt, which would match the observation. But which doesn’t explain why the status in MaaS remains “Commissioning” instead of “Failed”.
Now you bring it up again, I’m thinking maybe a firmware update on the ILO modules is the thing that changed between the moment it was working and it stopped working; if I could run the BMC configuration script from eg, the rescue environment I could do some tests. If you have hints on how to do that (mostly: where are scripts located, what’s needed to invoke them as if they where used during commissioning). And we’re still using the MaaS 3.0, maybe we should upgrade first.

first, glad you have a workaround. i’ll see what i can figure out on running the BMC script from rescue mode. i know i run the BMC script in weird configurations, but (1) it involves mucking around with the database, and (2) i’m on the MAAS engineering team, so i have less to lose if i totally hose up a MAAS host, which is very possible when screwing around with the DB. there’s gotta be another way. lemme think about and get back.