Marking node failed - Installation failed (refer to the installation log for more information)

chinsu50g · 31 July 2024 19:56

I am trying to deploy newer machines OptiPlex Small Form Factor Plus 7010. They all are experiencing deployment failure: Marking node failed - Installation failed (refer to the installation log for more information) after the “Loading ephemeral” stage.

OS: Ubuntu 20.04 LTS focal
I tried all of the available kernel versions
Tried a few BIOS versions

Here is one of the logs in regiond

regiond: [info] 127.0.0.1 POST /MAAS/metadata/status/[systemID] HTTP/1.1 --> 204 NO_CONTENT (referrer: -; agent: python-requests/2.22.0)

There is no other information I could find

Any input is appreciated.

Update: Rackd related logs

maasserver.ipc: [info] Worker pid:x lost burst connection to ('10.7.x.x', 5252).

RegionServer,x,::ffff:10.6.x.2: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.7.x.x', port=5252, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.6.x.2', port=38132, flowInfo=0, scopeID=0))

maasserver.ipc: [info] Worker pid:195808x lost burst connection to ('10.7.x.x', 5252).

r00ta · 31 July 2024 20:03

I’d suggest to monitor what happens on the target machine by looking at its serial console

chinsu50g · 31 July 2024 21:48

I’m not sure how to look at its serial console. In the meantime, I have added more logs from rackd to see if it helps narrow down anything.

r00ta · 1 August 2024 05:40

nope, these messages are harmless

chinsu50g · 2 August 2024 16:09

Update: I now encountered these errors when trying to erase the disk.

My MAAS version is 3.4.3 and I am using snap.

HTTP Request - /images/ubuntu/amd64/no-such-kernel/focal/no-such-image/boot-kernel
 Wed, 31 Jul. 2024 20:12:18 Marking node failed - Missing boot image ubuntu/amd64/no-such-kernel/focal.
 Wed, 31 Jul. 2024 20:12:17 Performing PXE boot
 Wed, 31 Jul. 2024 20:12:17 PXE Request - commissioning
 Wed, 31 Jul. 2024 20:12:17 TFTP Request - /grub/grub.cfg
 Wed, 31 Jul. 2024 20:12:17 TFTP Request - /grub/x86_64-efi/terminal.lst
 Wed, 31 Jul. 2024 20:12:17 TFTP Request - /grub/x86_64-efi/fs.lst
 Wed, 31 Jul. 2024 20:12:17 TFTP Request - /grub/grub.cfg-cc:96:xx:32:xx:xx
 Wed, 31 Jul. 2024 20:12:17 TFTP Request - /grub/x86_64-efi/crypto.lst
 Wed, 31 Jul. 2024 20:12:17 TFTP Request - /grub/x86_64-efi/command.lst
 Wed, 31 Jul. 2024 20:12:12 TFTP Request - grubx64.efi
 Wed, 31 Jul. 2024 20:12:12 TFTP Request - bootx64.efi
 Wed, 31 Jul. 2024 20:12:12 TFTP Request - bootx64.efi
 Wed, 31 Jul. 2024 20:11:53 Node powered on
 Wed, 31 Jul. 2024 20:11:46 Power cycling
 Wed, 31 Jul. 2024 20:11:46 Node - Started releasing [device]

I believe this may be similar to this bug here: Bug #2013529 “Nodes stuck in Failed Disk Erasing due to wrong ip...” : Bugs : MAAS as the workstations with the issue are new vPro workstations.

The deployment issue is also similar to https://bugs.launchpad.net/maas/+bug/1908452.

Am I missing something?