MAAS stuck on grub prompt during second reboot on deploy

Hi,

I’ve been trying to get my arm64 servers deployed through MAAS and am running into an issue where it keeps getting stuck in grub on a prompt and requires manual intervention (reboot cmd via kvm). Once rebooted it seems to then proceed as it should.

In the UI logs I notice when it fails only sends the 3 files below:
TFTP Request - bootaa64.efi
TFTP Request - bootaa64.efi
TFTP Request - grubaa64.efi ← fails here and ends up in grub prompt

On reboot/successful run it proceeds to additionally send the below files:
TFTP Request - /grub/grub.cfg-xx:xx:xx:xx:xx:xx
TFTP Request - /grub/grub.cfg
TFTP Request - /grub/arm64-efi/terminal.lst
/grub/arm64-efi/crypto.lst
/grub/arm64-efi/fs.lst
/grub/arm64-efi/command.lst

Any ideas why it may be failing/getting stuck here?

Stuck:
chrome_Cs47IMsnsW

Success:
chrome_3uP1ZigkjH

MAAS UI Log (Note timestamp):

Are your machines set to UEFI? Is your MAAS ipmi boot type set to UEFI? These are my initial sanity checks when I run into this issue.

I believe it is already set to UEFI. MAAS also correctly set the power boot type as EFI when it commissions the machine.

image

image

Are there any BMC f/w updates available? What NIC are you using?

It’s an intel I350 onboard NIC. I was going to test on a different connected NIC but strangly enough, changing the connection to the 2nd eth port seems to have “fixed” this issue. I’m not wholly certain why this is the case.

Could’ve been a bad onboard port. That’s the story I’m going to run with =)
Glad you got it working!

I’m facing the same issue, and the machines are already configured to use UEFI. Is there a way to enable additional logging to help troubleshoot why it fails after the reboot?

does it fail with the same message on the screen?

It’s stuck at the grub prompt with a slightly different message, it shows that its trying to PXE boot and the last line is stuck at “Fetching Netboot Image”.

Could you describe the network port configuration of the machine currently being deployed? Specifically, is it using a single network interface card (NIC), or is it configured with dual NICs and port channeling?

I suspect the issue might be related to the machine’s network ports—perhaps a loose connection or a faulty cable. Running a tcpdump on the MAAS instance might also help us diagnose the traffic flow.

It’s using dual NICs configured with balance-alb bonding mode.

It would be worth disabling one NIC port at a time, and retrying a commission/deploy, to rule out any involvement of a faulty NIC/Cable.

Also, is this happening only on one machine, or all the machines?

I’d recommend capturing tcpdump from the rackd side and check what is happening on the wire between rackd and target machine.

I tried disabling one NIC port at a time to no avail. so I tried capturing packets with wireshark and noticed the difference between a successful one and a problematic one.

Read Request, File: bootx64.efi 
Acknowledgement, Block: 0
Acknowledgement, Block: 4000+ 
Read Request, File: /grub/x86_64-efi/command.lst
Read Request, File: /grub/x86_64-efi/fs.lst
Read Request, File: /grub/x86_64-efi/crypto.lst
Read Request, File: /grub/x86_64-efi/terminal.lst
Read Request, File: /grub/grub.cfg
Read Request, File: /grub/grub.cfg-<mac_address>
Read Request, File: bootx64.efi 
Acknowledgement, Block: 0
Acknowledgement, Block: 600+ 
Read Request, File: grubx64.efi
Acknowledgement, Block: 0
Acknowledgement, Block 4476
########## problematic machine stops here #########
Read Request, File: /grub/x86_64-efi/command.lst
Read Request, File: /grub/x86_64-efi/fs.lst
Read Request, File: /grub/x86_64-efi/crypto.lst
Read Request, File: /grub/x86_64-efi/terminal.lst
Read Request, File: /grub/grub.cfg
Read Request, File: /grub/grub.cfg-<mac_address>

any idea why it stopped requesting for the grub files after block 4476?

Can you retry without bonding?

i tried without bonding, it doesn’t work too.

Can you share the tcpdump? Also, any interesting log in the rackd?

unfortunately, i am unable to share the tcpdump… i saw something in rackd logs, not sure if this is normal:

Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by x.x.x.x
Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by 127.0.0.1
Jun 12 10:32:00 maas rackd[2223964]: tftp.bootstrap: [debug] Got error: <tftp.datagram.ERRORDatagram object at 0x76143b1d78c0>
Jun 12 10:32:00 maas rackd[2223964]: tftp.protocol: [debug] Datagram received from ('x.x.x.x', 1712): <RRQDatagram(filename=b'bootx64.efi', mode=b'octet', options=OrderedDict({b'blksize: b'1468', b'windows': b'4'}))>
Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by x.x.x.x
Jun 12 10:32:00 maas rackd[2223964]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by 127.0.0.1

What’s the network topology? Do you use VXLAN or other protocols that are doing encapsulation? What’s the MTU of the interfaces and the switches?

no VXLAN or encapsulation is in use. MTU of the interfaces and switches is 1500.