I have some machines connected to MaaS which have been working fine for over a year, but now I wanted to wipe them and re-configure, but erase failed… tried deleting the three nodes from MaaS and manually erasing their disks, but it seems like the PXE boot is just failing because although they get DHCP address at first it contacts MaaS and gets the initial bootloader, then it fails to configure network, it fails to get an ip address, uses a default (which is incorrect – wrong subnet) and then it cannot download the ephemeral image to continue with commissioning.
Absolutely nothing has changed in my network here, so I’m assuming it is something to do with the image its using to initialize but I cannot figure out how to force it to try one of the older ones.
I’m using MaaS for DHCP too, not anything external. I just updated to 3.5.1 in case that makes a difference, but it didn’t the behavior is the same. All three nodes do the same thing.
The only place i can see to set parameters is the global kernel parameters box, if I say something like:
ip=192.168.2.16::192.168.2.1:255.255.255.0
then it works… but I have never had to pass such arguments before, and of course that doesn’t scale because I would need to set a different argument for each node. I need to know how to fix it properly.
No, these are amd64 machines … supermicro motherboards if that matters… the NIC cards I could get model numbers too if it matters, but they were working without issue until recently, so I assume they are supported.
Edit: Looks like the NIC it is using is an intel I219-LM and all three of the machines are identical
Ubuntu 22.04 is the target OS eventually… but this is just for commissioning so far … I’ve been using 22.04 also for the installed OS on the hard drive though, but tried 20.04, 22.04 and 24.04 for ephemeral when it wasn’t working for comissioning and none of them work… I don’t see any way to tell it to use an older release of one of those… for example there is a 20240524 release under 22.04 but I don’t see where I could force that one to be used instead of latest… and really it may be the bootloader one that is the issue rather than the ephemeral since it says “Booting under MaaS Direction” and never gets to the loading ephemeral part, but either way I don’t see a way to change which version is chosen from the repo, it just takes the latest.
Whatever the case it seems like something has changed in one or more recent releases… it looks like the last time I installed from scratch and erased the machines was back in february (edit: no, it was april or may actually, feb is just when maas as last reinstalled from scratch on our admin infra).
Oct 16 23:18:28 titus dhcpd[2160]: DHCPDISCOVER from <mac addr redacted> (maas-enlist) via br0
Oct 16 23:18:29 titus maas-http[1952]: 127.0.0.1 - - [16/Oct/2024:23:18:29 +0000] "GET /MAAS/rpc/ HTTP/1.1" 200 188 "-" "provisioningserver.rpc.clusterservice.ClusterClientService"
Oct 16 23:18:29 titus dhcpd[2160]: DHCPOFFER on 192.168.2.16 to <mac addr redacted> via br0
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[0] = /snap/maas/36889/usr/sbin/maas-dhcp-helper
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[1] = notify
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[2] = --action
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[3] = commit
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[4] = --mac
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[5] = <mac addr redacted>
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[6] = --ip-family
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[7] = ipv4
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[8] = --ip
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[9] = 192.168.2.16
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[10] = --lease-time
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[11] = 30
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[12] = --hostname
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[13] = (none)
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[14] = --socket
Oct 16 23:18:32 titus dhcpd[2160]: execute_statement argv[15] = /var/snap/maas/common/maas/dhcpd.sock
Oct 16 23:18:32 titus dhcpd[2160]: DHCPREQUEST for 192.168.2.16 (192.168.2.2) from <mac addr redacted> via br0
Oct 16 23:18:32 titus dhcpd[2160]: DHCPACK on 192.168.2.16 to <mac addr redacted> via br0
That is the intial hit … then some tftp stuff to get the bootloader, etc
And then it gets uefi stuff and grub.conf and then the bootloader wants to do whatever its process is with interfaces so it releases the IP it already had and then this happens:
Oct 16 23:19:02 titus dhcpd[2160]: DHCPDISCOVER from <mac addr redacted> via br0
Oct 16 23:19:02 titus dhcpd[2160]: DHCPOFFER on 192.168.2.16 to <mac addr redacted>(maas-enlist) via br0
Oct 16 23:19:02 titus dhcpd[2160]: DHCPREQUEST for 192.168.1.100 (192.168.1.1) from <mac addr redacted> via br0: ignored (not authoritative).
The is one region/rack… this is a dev environment where I test stuff before doing anything to prod (thank goodess I am testing lol).
There is an admin machine that is just a single instance of maas running, and it is connected to a switch where the are three nodes that I build into a kubernetes cluster … maas deploys the OS, and then juju takes them from there (but uses maas still) to make them k8s nodes.
That switch is has a netgate router above it too which I use to connect to the whole thing remotely. And yes, I’ve verified nothing else is running DHCP on there. MaaS is the only DHCP server.
No, I updated to 3.5.1 after it was already not working, was hoping that maybe there was a bug that would be fixed by the upgrade. It was running 3.4.4 before that and the behavior is identical. 3.4.4 is the version it was running when the nodes were last rebuilt also and it worked fine then, which is why I suspect image issues, since those were updated a few times in the interim.
Could you better describe what’s actually happening on the machine after you see the ‘booting under maas direction’ message? Have you checked in the MAAS logs for anything interesting?
Yes, I posted log snippets above. It is difficult to see the console on the target machine thought because I’m remote, the BMC has a video capture tool but it doesn’t work, I recorded a video with my phone and managed to see stuff that aligns with the log messages but it flashes by so fast its really hard to catch:
on the console it says:
Begin: Running /scripts/init/init-premount
Begin: Running Mounting root filesystem .. Begin: Running /scripts/local-top ... Begin: waiting up to 180 secs for eno1 to become available ... done.
IP-Config: eno1 hardware address <mac addr redacted> mtu 1500 DHCP RARP
IP-Config: no response after 2 secs - giving up
IP-Config: eno1 hardware address <mac addr redacted> mtu 1500 DHCP RARP
[ 9.062249] e1000e 0000:001f.6 eno1: NIC Link is up 1000 Mbps Full Duplex, Flow Control: None
[ 9.068143] IPv6: ADDRCONF (NETDEV_CHANGE): eno1: link becomes ready
hostname apollo IP-Config: no response after 3 secs - giving up
IP-Config: eno1 hardware address <mac addr redacted> mtu 1500 DHCP RARP
hostname aollo hostname apollo hostname apollo IP-Config: no response after 4 secs - giving up
IP-Config: eno1 hardware address <mac addr redacted> mtu 1500 DHCP RARP
hostname aollo hostname apollo hostname apollo IP-Config: eno1 guessed broadcast address 192.168.1.255
IP-Config: eno1
I dunno what happens after that because in the cell phone video I took its all the characters I coud get lol… and then it drops into initramfs.
the images were updated in August and no issues were reported so far. If you want to use older images you can create a mirror of images.maas.io with only some specific images in there and let your MAAS use it
Yeah I eventually did that and was able to finally force a custom image in there using code I found in a public repo that looks like its the same repo used to publish to the official maas ephemeral repo… and with a bit of debug… it turns out there was in fact some other device somewhere on 192.168.1.0/24 subnet that it guesses on after it doesn’t have a response from its proper subnet fast enough for its liking and it gets a dhcp offer from it!! I located the offending device it was connected to one of our other dev machines via a usb port and I was able to get in and disable it remotely, and now it does not get any response from the rogue device… when I was checking for extra DHCP servers before, I hadn’t considered there could be other subnets even.
So the official image works now too… the random guessing of other subnets though is probably not a thing that should be happening, but I dunno if that is new behavior or not.