MAAS changing boot order during deployment

While we are currently able to deploy to some nodes just fine (PowerEdge M630 with 1Gbps NIC), when we try to deploy the same image to a different machine (PowerEdge R450 with 25Gbps NIC), MAAS seems to change the boot order of the node during deployment to put PXE first. Because of this, when the deployment is done and the node reboots, it goes right back to PXE and deploys all over again. This continues indefinitely until we manually fix the boot order. But sometimes, in doing so, MAAS doesn’t recognize that the deployment has finished and so it marks the node as “failed deployment.” Is there some way to prevent MAAS from rearranging the boot order during deployment, or to at least put it back before rebooting at the end of the deployment?

Hi @dcaunt42

This is a normal behaviour for MAAS to change the boot order (having PXE/Netboot as the first option). Why you would like to revert back the boot order when machine is deployed?

Once machine is deployed and powered on, it will try to PXE, but since MAAS knows that machine is deployed, it will return grub.cfg telling machine to boot from disk.

Probably your continuous PXE is a symptom of a different issue. Is there anything interesting in the logs?

It is also worth checking logs on the aforementioned machine at /var/log/cloud-init.log

Thank you for this @troyanov. I didn’t realize that this was expected behavior since the M630s that we tested first did not have their boot order changed. Using this knowledge, I will look into this again and see what I can find. Thank you.

@dcaunt42, did you ever resolve this one?

I had to give up on troubleshooting the node I was working on at the time, but I am having the same problem again with a new node. I will post details and log files shortly.

We are currently attempting to deploy to a Lenovo ThinkSystem SD650. The deployment attempt looks something like:

  • Node commissions successfully
  • We initiate a deployment to it
  • It looks as though it deploys successfully, but afterwards it reboots and seems to start deploying again from the beginning.
  • This time, the deployment fails (see screenshot ending with “reboot: Restarting system”)
  • The node then reboots one more time, attempts to boot under MAAS direction, gives the errors “error: invalid magic number” and “error: you need to load the kernel first” and then stops at the GRUB screen (see second screenshot)

Linked is the messages file from /var/snap/maas/common/log/rsyslog on the region controller. This log file should contain everything from the commissioning, initial deployment, second deployment attempt, etc.

We ran another test today to make sure that the problem was not related to our custom Rocky8 image that we’re attempting to deploy. Even though this image has deployed fine on Dell machines before, we ran another deployment on the Lenovo host with the Ubuntu 22.04 image in MAAS. The results were slightly different so we’re hoping this output may help shed some light on the issue. Or it may look more familiar to those who have more experience deploying Ubuntu.

The deployment attempt looks like:

  • Node commissions successfully
  • We initiate a deployment of Ubuntu 22.04 to it
  • It looks as through it deploys successfully, but afterwards it reboots, attempts to boot from disk, and shows the output below (see screenshot)
  • After the screen below, we get “no boot device available” and it reboots again (and tries to boot from disk again, and loops like this forever)

Linked is the messages file from /var/snap/maas/common/log/rsyslog on the region controller.

Any advice would be greatly appreciated. Thank you!

@billwear any suggestions on how we can get some help resolving these issues?

we have thousands of physical lenovo nodes we need to provision and are kind of stuck.

@dcaunt42 is your custom Rocky image based on our packer template? If not, it seems the image may be missing a boot loader / isn’t properly installing one.

@rockpapergoat, see the recent reply from one of our engineers

1 Like

yes, we based ours off the ones published here. they’re basically identical except for installing a couple of packages. we do all the rest with cloud-init, driven by terraform.

for what it’s worth, ubuntu 22.04 stock images also don’t install on any of our lenovo hardware, even with updated bios firmware. @dcaunt42 has the most context, as described above.

rocky and ubuntu install fine on older dell hardware in our test cluster. something’s missing here, and it’s hard to determine if it’s hardware, config, networking, or maas at this point.

This is sounding like it may be machine specific, at which point we’d recommend you look into our support. But to confirm, this is an issue with a specific type of machine and these images do boot on other hardware?

Yes, this particular image does boot on other machines, specifically Dell brand servers.
We’ve been trying to work out a support contract with your support specialists since November. It has been slow-going, so we’ve been using these support forums while we wait for that.

from what we heard, a professional services engagement for a specific project scope seemed like what you offer, but it’s not really what we need. we have patterns in place on this side already but are getting stuck with hardware and config issues.

personally, i’d like to see more discussion in here around patterns and usage, but it seems like mostly break/fix talk. we’re most likely going to release our terraform module at some point and would like to collaborate with folks on making this more useful. even though MaaS has been out for awhile, it’s been difficult to find people who are using it in production to talk shop and trade patterns.

that said, we appreciate your help here and recognize this is a best effort, community driven forum.