MAAS is changing my boot order!

My MAAS deployment is essentially dead in the water.

I’ve been troubleshooting this for the last week or so and I’ve finally uncovered the problem: MAAS is somehow changing my BIOS boot order, and machines are failing deployment because of it.

According to @ltrager,

However, searching through Launchpad, I might have found that I’m not the only one.

Here is my experience:

  1. I double check my boot order

  2. I commission the machine, it gathers the correct storage info

  3. Check BIOS, remains unchanged…

  4. I deploy the machine, and get “Failed deployment”, although the machine finishes deployment.

     BootCurrent: 0006
     Timeout: 0 seconds
     BootOrder: 0002
     Boot0000* EFI Network 1
     Boot0001* EFI Network 2
     Boot0002* EFI Fixed Disk Boot Device 1
     Boot0003* EFI Fixed Disk Boot Device 1
     Boot0004* EFI Fixed Disk Boot Device 1
     Boot0005* EFI Fixed Disk Boot Device 1
     Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
     TIMED subp(['udevadm', 'settle']): 0.008
     Running command ['umount', '/tmp/tmpj7vthvop/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/sys'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/run'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/proc'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/dev'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/dev', '/tmp/tmpj7vthvop/target/dev'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/proc', '/tmp/tmpj7vthvop/target/proc'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/run', '/tmp/tmpj7vthvop/target/run'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/sys', '/tmp/tmpj7vthvop/target/sys'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/sys/firmware/efi/efivars', '/tmp/tmpj7vthvop/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
     Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpj7vthvop/target', 'efibootmgr', '-v'] with allowed return codes [0] (capture=True)
     Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
     TIMED subp(['udevadm', 'settle']): 0.010
     Running command ['umount', '/tmp/tmpj7vthvop/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/sys'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/run'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/proc'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/dev'] with allowed return codes [0] (capture=False)
     Setting currently booted 0006 as the first UEFI loader.
     New UEFI boot order: 0006,0002
     Running command ['mount', '--bind', '/dev', '/tmp/tmpj7vthvop/target/dev'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/proc', '/tmp/tmpj7vthvop/target/proc'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/run', '/tmp/tmpj7vthvop/target/run'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/sys', '/tmp/tmpj7vthvop/target/sys'] with allowed return codes [0] (capture=False)
     Running command ['mount', '--bind', '/sys/firmware/efi/efivars', '/tmp/tmpj7vthvop/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
     Running command ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpj7vthvop/target', 'efibootmgr', '-o', '0006,0002'] with allowed return codes [0] (capture=False)
     Invalid BootOrder order entry value0006
                                          ^
     efibootmgr: entry 0006 does not exist
     Running command ['udevadm', 'settle'] with allowed return codes [0] (capture=False)
     TIMED subp(['udevadm', 'settle']): 0.010
     Running command ['umount', '/tmp/tmpj7vthvop/target/sys/firmware/efi/efivars'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/sys'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/run'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/proc'] with allowed return codes [0] (capture=False)
     Running command ['umount', '/tmp/tmpj7vthvop/target/dev'] with allowed return codes [0] (capture=False)
     finish: cmd-install/stage-curthooks/builtin/cmd-curthooks/install-grub: FAIL: installing grub to target devices
     finish: cmd-install/stage-curthooks/builtin/cmd-curthooks/configuring-bootloader: FAIL: configuring target system bootloader
     finish: cmd-install/stage-curthooks/builtin/cmd-curthooks: FAIL: curtin command curthooks
     Traceback (most recent call last):
       File "/curtin/curtin/commands/main.py", line 202, in main
         ret = args.func(args)
       File "/curtin/curtin/commands/curthooks.py", line 1770, in curthooks
         builtin_curthooks(cfg, target, state)
       File "/curtin/curtin/commands/curthooks.py", line 1736, in builtin_curthooks
         setup_grub(cfg, target, osfamily=osfamily)
       File "/curtin/curtin/commands/curthooks.py", line 701, in setup_grub
         uefi_reorder_loaders(grubcfg, target)
       File "/curtin/curtin/commands/curthooks.py", line 462, in uefi_reorder_loaders
         in_chroot.subp(['efibootmgr', '-o', new_boot_order])
       File "/curtin/curtin/util.py", line 708, in subp
         return subp(*args, **kwargs)
       File "/curtin/curtin/util.py", line 275, in subp
         return _subp(*args, **kwargs)
       File "/curtin/curtin/util.py", line 141, in _subp
         cmd=args)
     curtin.util.ProcessExecutionError: Unexpected error while running command.
     Command: ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpj7vthvop/target', 'efibootmgr', '-o', '0006,0002']
     Exit code: 8
     Reason: -
     Stdout: ''
     Stderr: ''
     Unexpected error while running command.
     Command: ['unshare', '--fork', '--pid', '--', 'chroot', '/tmp/tmpj7vthvop/target', 'efibootmgr', '-o', '0006,0002']
     Exit code: 8
     Reason: -
     Stdout: ''
     Stderr: ''
    
  5. Check BIOS again, boot order is different

  6. Finally, since the machine actually deploys, even though it reports “Failed deployment”, I can check:

    root@4-R420:~# efibootmgr 
    BootCurrent: 0008
    Timeout: 0 seconds
    BootOrder: 0008,0002,0006,0007
    Boot0000* EFI Network 1
    Boot0001* EFI Network 2
    Boot0002* EFI Fixed Disk Boot Device 1
    Boot0003* EFI Fixed Disk Boot Device 1
    Boot0004* EFI Fixed Disk Boot Device 1
    Boot0005* EFI Fixed Disk Boot Device 1
    Boot0006* EFI Network 1
    Boot0007* EFI Network 2
    Boot0008* ubuntu
    

One of the bug reports say that a fix has been released for curtin:

I’m running 2.8

root@controller:~# snap list
Name      Version                 Rev   Tracking       Publisher   Notes
core18    20200724                1885  latest/stable  canonical✓  base
maas      2.8.2-8577-g.a3e674063  8980  2.8/stable     canonical✓  -
maas-cli  0.6.5                   13    latest/stable  canonical✓  -
snapd     2.46.1                  9279  latest/stable  canonical✓  snapd

With all that said, my final question is: when will this fix get integrated to the next MAAS Snap?

1 Like

Based on my tests, MaaS always change the UEFI boot order. It put as first entry the current boot device and second entry the OS installed.

I’m having similar issue due the older version of curtin, Maas 2.8.2 don’t add the OS entry.

The actual 2.8.2 and the beta (8578), has the same issue. But, the older 2.8.1 beta doesn’t have.

Right, in desperation I tested with 2.9.0~beta5-9001-g.f7b0390fa and the problem still exists.

Curtin 20.2 it’s already launched with this fixes, but MaaS snap uses the 20.1-2-g42a9667f.

In/var/lib/snapd/snap/maas/current/usr/lib/python3/dist-packages/curtin/version.py shows that maas snap uses ubuntu 18.04 (Bionic) curtin package.
And in the install log shows: curtin: Installation started. (20.1-2-g42a9667f-0ubuntu1~18.04.1)
The actual stable version for Bionic is 20.1-2 .

Looking at the release history, it seems that it will not be updated until February 2021:
https://launchpad.net/ubuntu/bionic/+source/curtin/+builds

1 Like

Feb 2021 :exploding_head:

This is a total show stopper for me. I have 6 servers that I can’t deploy now.

I’ve been looking for a work around, and the only thing I can think of is to cobble together my own Snap with the latest version of python3-curtin.

https://github.com/maas/maas/blob/master/snap/snapcraft.yaml

But then I’ll be outside the official MAAS Snap channel…

@sparkiegeek is there any chance you could weigh in on this issue?

1 Like

I see in the snapcraft.yaml that they move from base: core18 to base: core20, maybe Focal has a update this November as the 2019.

Hello @nateybobo, @adolfo94

Sorry you’re hitting this issues. As you have correctly identified, it’s a problem in curtin, which has subsequently been fixed, but not yet made it to the MAAS snap.

I have re-built the snap with curtin pulled from the stable ppa, and pushed it to a snap branch.

snap refresh maas --channel=latest/edge/curtin-stable

This revision is built with MAAS from master as of yesterday, so is considered pre-release, but it seems you’re comfortable with using different versions.

Please test it out, and share feedback you have.

3 Likes

@sparkiegeek

Looks like that did the trick!

snap refresh maas --channel=latest/edge/curtin-stable

root@controller:~# snap list
Name      Version                       Rev   Tracking       Publisher   Notes
core18    20200724                      1885  latest/stable  canonical✓  base
core20    20                            634   latest/stable  canonical✓  base
maas      2.9.0~beta5-9002-g.2a342196f  9821  latest/edge/…  canonical✓  -
maas-cli  0.6.5                         13    latest/stable  canonical✓  -
snapd     2.46.1                        9279  latest/stable  canonical✓  snapd

I tested with this workflow:

  1. commission - good
  2. deploy - good
  3. power off - verify boot order - good
  4. release - good
  5. deploy - good
  6. power off - verify boot order - good

Log: https://paste.ubuntu.com/p/CgQyM24Hzt/

3 Likes

@sparkiegeek @nateybobo - And here we were, thinking this was some in-built behavior of Ubuntu Server itself, as part of the update from 20.04 to 20.04.1 (kernel update 5.4.0-47).

Prior to any proposal for wider adoption within the org, we’re pre-flighting on consumer-grade hardware - which means no BMC (and thus no IPMI). And due to the “homelab” nature of this cluster, there’s physically no space in the enclosures for a GPU. So imagine our surprise when “resetting” the cluster, only to find that none of the nodes would surface via PXE - instead booting to Ubuntu directly (discovered only after much troubleshooting and tinkering, finally including opening up each node’s enclosure just to temporarily pop in a low-profile GPU simply to check UEFI-BIOS settings).

If this is genuinely the root cause, I implore @ltrager (et al) to consider this as part of either the full release of 2.9 or a soon-to-follow point release, as the timing for many orgs couldn’t be more critical - what with budget being available at the start of the new year, and thus more machines in need of provisioning coming online.

1 Like

Very solid points. I spent a quite a while trying to figure out what the real problem was, because no one ever expects “BIOS” type settings to be altered by the OS! I didn’t even know that was possible lol

I’m away from home for the entirety of the next year, and if I can’t provision anything in my stack I’m pretty screwed. Thank goodness for iDrac, but still.

FYI - if you plan on bouncing between different Snap versions of MAAS, use the snapshot feature for Snaps to backup your MAAS and Postgresql. I found that going to 2.9 made breaking changes to my Postgresql database and MAAS couldn’t use it once I tried going back to 2.8.

1 Like

@nateybobo, can you be more specific about the “breaking changes”? That’s at least worth documenting in the release notes!

@billwear I don’t have an exhaustive list, but when i moved from 2.8 → 2.9, and then tried to revert back my 2.8 snapshot, MAAS wouldn’t start because the Postgresql schema had been changed. I’m a dummy and forgot snapshot my Postgresql snap :frowning:

I had to manually re-create a column called skip_bmc_config under the maasserver_node table to get MAAS to startup correctly. I’m going to migrate to a fresh MAAS controller once all my boot order woes get addressed.

2 Likes

@nateybobo, yeah, that’s a risk of upgrading. the doc should probably warn to make backups. glad you got the mad skills to fix it, tho! that’s really cool.

2 Likes

@billwear @sparkiegeek

FYI - I refreshed to Snap 2.9.0~beta6-9039-g.5a2ba747e and it looks like my boot order woes are fixed now.

Hopefully there is no regression once 2.9 goes stable!

3 Likes

excellent. let me know if you still have problems when 2.9 goes stable.

1 Like

After testing in our CI confirmed no regressions, I’ve included the 20.2 Curtin release in the next beta release of MAAS 2.9.

Thanks everyone for your feedback and confirmation of fixes.

2 Likes

Note that there’s another issue that could cause these symptoms:
LP: #1899993. That requires a kernel update that should be available in early December.

See Comment #9 for a way to figure out if this impacts you (the Input/Output Error)

1 Like