MAAS stuck on grub prompt during second reboot on deploy

alex-arm · 12 May 2025 16:41

Hi,

I’ve been trying to get my arm64 servers deployed through MAAS and am running into an issue where it keeps getting stuck in grub on a prompt and requires manual intervention (reboot cmd via kvm). Once rebooted it seems to then proceed as it should.

In the UI logs I notice when it fails only sends the 3 files below:
TFTP Request - bootaa64.efi
TFTP Request - bootaa64.efi
TFTP Request - grubaa64.efi ← fails here and ends up in grub prompt

On reboot/successful run it proceeds to additionally send the below files:
TFTP Request - /grub/grub.cfg-xx:xx:xx:xx:xx:xx
TFTP Request - /grub/grub.cfg
TFTP Request - /grub/arm64-efi/terminal.lst
/grub/arm64-efi/crypto.lst
/grub/arm64-efi/fs.lst
/grub/arm64-efi/command.lst

Any ideas why it may be failing/getting stuck here?

Stuck:
chrome_Cs47IMsnsW

Success:
chrome_3uP1ZigkjH

MAAS UI Log (Note timestamp):

zmance · 12 May 2025 17:26

Are your machines set to UEFI? Is your MAAS ipmi boot type set to UEFI? These are my initial sanity checks when I run into this issue.

alex-arm · 13 May 2025 09:26

I believe it is already set to UEFI. MAAS also correctly set the power boot type as EFI when it commissions the machine.

zmance · 13 May 2025 23:17

Are there any BMC f/w updates available? What NIC are you using?

alex-arm · 15 May 2025 13:00

It’s an intel I350 onboard NIC. I was going to test on a different connected NIC but strangly enough, changing the connection to the 2nd eth port seems to have “fixed” this issue. I’m not wholly certain why this is the case.

zmance · 15 May 2025 20:56

Could’ve been a bad onboard port. That’s the story I’m going to run with =)
Glad you got it working!

yings17 · 28 May 2025 12:44

I’m facing the same issue, and the machines are already configured to use UEFI. Is there a way to enable additional logging to help troubleshoot why it fails after the reboot?

r00ta · 28 May 2025 13:00

does it fail with the same message on the screen?

yings17 · 29 May 2025 02:30

It’s stuck at the grub prompt with a slightly different message, it shows that its trying to PXE boot and the last line is stuck at “Fetching Netboot Image”.

trsoumi88 · 29 May 2025 02:36

Could you describe the network port configuration of the machine currently being deployed? Specifically, is it using a single network interface card (NIC), or is it configured with dual NICs and port channeling?

I suspect the issue might be related to the machine’s network ports—perhaps a loose connection or a faulty cable. Running a tcpdump on the MAAS instance might also help us diagnose the traffic flow.

yings17 · 29 May 2025 02:49

It’s using dual NICs configured with balance-alb bonding mode.

trsoumi88 · 29 May 2025 03:15

It would be worth disabling one NIC port at a time, and retrying a commission/deploy, to rule out any involvement of a faulty NIC/Cable.

Also, is this happening only on one machine, or all the machines?

troyanov · 29 May 2025 05:51

I’d recommend capturing tcpdump from the rackd side and check what is happening on the wire between rackd and target machine.

yings17 · 11 June 2025 08:31

I tried disabling one NIC port at a time to no avail. so I tried capturing packets with wireshark and noticed the difference between a successful one and a problematic one.

Read Request, File: bootx64.efi 
Acknowledgement, Block: 0
Acknowledgement, Block: 4000+ 
Read Request, File: /grub/x86_64-efi/command.lst
Read Request, File: /grub/x86_64-efi/fs.lst
Read Request, File: /grub/x86_64-efi/crypto.lst
Read Request, File: /grub/x86_64-efi/terminal.lst
Read Request, File: /grub/grub.cfg
Read Request, File: /grub/grub.cfg-<mac_address>
Read Request, File: bootx64.efi 
Acknowledgement, Block: 0
Acknowledgement, Block: 600+ 
Read Request, File: grubx64.efi
Acknowledgement, Block: 0
Acknowledgement, Block 4476
########## problematic machine stops here #########
Read Request, File: /grub/x86_64-efi/command.lst
Read Request, File: /grub/x86_64-efi/fs.lst
Read Request, File: /grub/x86_64-efi/crypto.lst
Read Request, File: /grub/x86_64-efi/terminal.lst
Read Request, File: /grub/grub.cfg
Read Request, File: /grub/grub.cfg-<mac_address>

any idea why it stopped requesting for the grub files after block 4476?

r00ta · 11 June 2025 10:04

Can you retry without bonding?

yings17 · 11 June 2025 10:41

i tried without bonding, it doesn’t work too.

r00ta · 11 June 2025 10:44

Can you share the tcpdump? Also, any interesting log in the rackd?

yings17 · 12 June 2025 06:39

unfortunately, i am unable to share the tcpdump… i saw something in rackd logs, not sure if this is normal:

Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by x.x.x.x
Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by 127.0.0.1
Jun 12 10:32:00 maas rackd[2223964]: tftp.bootstrap: [debug] Got error: <tftp.datagram.ERRORDatagram object at 0x76143b1d78c0>
Jun 12 10:32:00 maas rackd[2223964]: tftp.protocol: [debug] Datagram received from ('x.x.x.x', 1712): <RRQDatagram(filename=b'bootx64.efi', mode=b'octet', options=OrderedDict({b'blksize: b'1468', b'windows': b'4'}))>
Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by x.x.x.x
Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by 127.0.0.1

r00ta · 12 June 2025 07:30

What’s the network topology? Do you use VXLAN or other protocols that are doing encapsulation? What’s the MTU of the interfaces and the switches?

yings17 · 12 June 2025 08:32

no VXLAN or encapsulation is in use. MTU of the interfaces and switches is 1500.