MAAS stuck on grub prompt during second reboot on deploy

can you also check the maas-agent logs?

nothing significant in maas-agent logs, only logs containing power activity.

can you add debug: true in /var/snap/maas/current/rackd.conf, restart MAAS and reproduce it again?

Also, can you always reproduce it? Does it work with other machines?

debug: true is already enabled in /etc/maas/rackd.conf. yes, i can always reproduce it with this set of machines.

what machine is it? Do you have other machines that are working fine?

it’s a HPE ProLiant DL380 Gen11. i have another set of working servers that are Gen10 Plus.

Are they working fine? Have you tried to update all the firmwares?

yes the Gen10s are working fine.

are you using the latest greatest firmware versions?

both gen11 and 10 are not on the latest firmware versions

please upgrade them first

thanks for the quick response, let me try upgrading them first.

it’s the same after upgrading to the latest firmware.

could this be due to cloud-init failing to report back to MAAS and therefore, stuck at rebooting?

i was comparing the packets captured by the successful deployment (gen10) and the one that failed (gen11). after block 4476, the successful deployment made POST calls to the metadata service and ran cloud-init init-network stage.

Read Request, File: grubx64.efi
Acknowledgement, Block: 0
Acknowledgement, Block 4476
POST /MAAS/metadata/status/id
{"name": "init-network /check-cache", "description": "no cache found", "event_type": "finish", "origin": "cloud-init", "result": "SUCCESS"}
POST /MAAS/metadata/status/id
{"name": "init-network /search-MAAS", "description": "searching for network data from DataSourceMAAS", "event_type": "start", "origin": "cloud-init"}
POST /MAAS/metadata/status/id
{"name": "init-network /search-MAAS", "description": "found network data from DataSourceMAAS", "event_type": "finish", "origin": "cloud-init", "result": "SUCCESS"}

based on what I have observed so far, curtin installs the target OS successfully and reboots the machine. however, after the reboot, the machine does not boot into the installed OS. it re-initiates PXE boot and gets stuck at the GRUB prompt. i have to manually reboot the machine, after which the deployment completes.

i have a few questions:

  1. what exactly happens after Curtin finishes installation and reboots the machine?
  2. how does MAAS determine if the OS booted successfully?
  3. how or where can I troubleshoot machines that stall after curtin reboots them?

any insights or suggestions for deeper debugging would be appreciated, thanks!

what exactly happens after Curtin finishes and reboots the machine?

Curtin installs the image on the disk and the machine is rebooted. The machine PXE boots but MAAS instructs the machine to boot from the disk.

how does MAAS determine if the OS booted successfully?

The machine reboots and makes some requests to MAAS. If this happens MAAS considers the machine as deployed successfully

how or where can I troubleshoot machines that stall after curtin reboots them?

  • it might be an issue with secure boot. Try to disable it.
  • It might be that MAAS thinks the machine is using UEFI or legacy BIOS and this is not actually the case.