On first reboot after install, node fails to bring up its network interface

I have MAAS 3.2 set up for a small cluster of 7 physical servers. MAAS has two rack controllers and one region controller. The servers are SuperMicro X9 motherboards with IPMI power control and 10 Gbit NICs.

It appears to be working fine almost entirely. When deploying new servers, the installation completes successfully and the server reboots, however, the server then fails about half the time to obtain its cloud-init config from MAAS and I am left with an unusable server I can’t login to. The rest of the time the installation/reboot works fine (same servers). I.e., if I repeat the deployment enough times, I get a working server, but that is sufficiently annoying that it would be nice to find a fix. Rebooting the server also doesn’t seem to fix it (maybe because MAAS is no longer setting the appropriate boot parameters via IPMI pre-reboot?).

As a potential culprit, I see an error about renaming the NIC in the boot log, and the server hangs for quite awhile on “Wait for Network to be Configured”:

FWIW, I can see from the switch that the server is bringing up and down its network link a couple times during this process. Following that error, the server doesn’t get an IP address and therefore can’t reach MAAS:

I can’t reproduce these issues in rescue mode (the NIC stays up 100% of the time) and the entire installation process succeeds every time without a hitch. The issue only occurs during the first post-installation reboot.

The machine also passes the rack connectivity test every time, however, I do see some connection errors on the console while this test is running (with the retries it succeeds despite the errors… or else it’s trying multiple NICs, I’m not sure):

Any thoughts on what might be causing this or how I could debug it further?

This is still happening and with greater frequency than (I though) it was before. Some machines I can’t boot at all any more, and it’s not limited to the final post-deploy reboot. One machine will not complete the initial pre-commissioning (inventory?) PXE boot at all.

I’ve tried in various versions of Ubuntu, and 18.04 gives a maybe more useful error message than the others (I can’t catch it all due to scrolling in the console, but here are the two frames I was able to catch by taking a screen recording of the IPMI console):

Does anyone know what might be causing this? Shortly before this error, it retrieved the root image just fine, so I am not sure why it could not reach the cloud-init data source at this point…

As a further point of confusion, it looks like the server has an IP address and responds to pings immediately upon configuring its network interface to download the root image, and keeps that IP address the entire time it’s “Waiting for Network…”

The network config appears valid as far as I can see. What could Ubuntu be waiting for here that it does not have already?

I actually do see the same error in this screen recording (20.04) as I saw in 18.04 (perhaps I didn’t wait long enough the first time).

Another clue: The “wait for network to be configured” seems like it might be a problem, but not the main one. On another run, I can see the server is trying to reach MAAS at the server’s IP! At least, 10.30.30.234 is not an IP address of any of my controllers.

I’ve tried looking through the code to see where this bogus URL might be coming from, but I’m running out of ideas. It looks strikingly similar to this Fedora / cloud-init bug from 2018: https://bugzilla.redhat.com/show_bug.cgi?id=1558641

1 Like

Using tcpdump to track what HTTP GET requests the server was making:

tcpdump -i tap102i0 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap102i0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

20:03:02.954744 IP 10.30.30.246.36801 > 10.30.30.10.5248: Flags [P.], seq 1907014505:1907014613, ack 2911629693, win 512, options [nop,nop,TS val 185696 ecr 625900672], length 108
E.......@.!T
...
..
....q..i...}....l......
...`%N|.GET /ipxe.cfg HTTP/1.1
Connection: keep-alive
User-Agent: iPXE/1.20.1+ (g4bd0)
Host: 10.30.30.10:5248


20:03:02.966785 IP 10.30.30.246.36801 > 10.30.30.10.5248: Flags [P.], seq 108:244, ack 396, win 512, options [nop,nop,TS val 185696 ecr 625900680], length 136
E...	...@..6
...
..
....q...........g......
...`%N|.GET /ipxe.cfg-d2%3A74%3A2c%3Af3%3A98%3A61 HTTP/1.1
Connection: keep-alive
User-Agent: iPXE/1.20.1+ (g4bd0)
Host: 10.30.30.10:5248


20:03:03.065582 IP 10.30.30.246.36801 > 10.30.30.10.5248: Flags [P.], seq 244:366, ack 572, win 512, options [nop,nop,TS val 185808 ecr 625900736], length 122
E.......@..A
...
..
....q..]...............
....%N|.GET /ipxe.cfg-default-amd64 HTTP/1.1
Connection: keep-alive
User-Agent: iPXE/1.20.1+ (g4bd0)
Host: 10.30.30.10:5248


20:03:03.200964 IP 10.30.30.246.36801 > 10.30.30.10.5248: Flags [P.], seq 366:519, ack 1332, win 512, options [nop,nop,TS val 185920 ecr 625900874], length 153
E.......@..0
...
..
....q...........i......
...@%N}JGET /images/ubuntu/amd64/ga-20.04/focal/stable/boot-kernel HTTP/1.1
Connection: keep-alive
User-Agent: iPXE/1.20.1+ (g4bd0)
Host: 10.30.30.10:5248


20:03:03.315579 IP 10.30.30.246.36801 > 10.30.30.10.5248: Flags [P.], seq 519:672, ack 13662018, win 512, options [nop,nop,TS val 186032 ecr 625900987], length 153
E...i...@.."
...
..
....q..p.\h............
....%N}.GET /images/ubuntu/amd64/ga-20.04/focal/stable/boot-initrd HTTP/1.1
Connection: keep-alive
User-Agent: iPXE/1.20.1+ (g4bd0)
Host: 10.30.30.10:5248




20:03:10.364923 IP 10.30.30.246.56428 > 10.30.30.10.5248: Flags [P.], seq 3973909383:3973909512, ack 1459785900, win 502, options [nop,nop,TS val 1006103140 ecr 625908082], length 129
E...z.@.@.n5
...
..
.l......W.......Q......
;..d%N.rGET /images/ubuntu/amd64/ga-20.04/focal/stable/squashfs HTTP/1.1
Host: 10.30.30.10:5248
User-Agent: Wget
Connection: close


20:03:20.543425 IP 10.30.30.246.57710 > 10.30.30.10.5248: Flags [P.], seq 1726640279:1726640494, ack 556370768, win 502, options [nop,nop,TS val 1006113318 ecr 625918261], length 215
E.....@.@.(.
...
..
.n..f.p.!).P....R9.....
;..&%N.5GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1
Host: 10.30.30.10:5248
User-Agent: Cloud-Init/22.2-0ubuntu1~20.04.3
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive


20:03:23.065662 IP 10.30.30.246.57712 > 10.30.30.10.5248: Flags [P.], seq 212885926:212886141, ack 2302260665, win 502, options [nop,nop,TS val 1006115840 ecr 625920783], length 215
E.....@.@...
...
..
.p....a..9......R9.....
;...%N..GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1
Host: 10.30.30.10:5248
User-Agent: Cloud-Init/22.2-0ubuntu1~20.04.3
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

I can see that it does download an “enlist-preseed” document that includes the client IP of whoever made the request. E.g., when I download the same URL from my laptop (connected via VPN), I can see the response includes a metadata_url containing my VPN client IP. Is this correct? Shouldn’t it point to the rack controller instead? Or is there some other magic that’s supposed to happen and is not working here?

curl 'http://10.30.30.10:5248/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed'
#cloud-config
apt:
  preserve_sources_list: false
  primary:
  - arches:
    - amd64
    - i386
    uri: http://archive.ubuntu.com/ubuntu
  - arches:
    - default
    uri: http://ports.ubuntu.com/ubuntu-ports
  proxy: http://10.30.30.10:8000/
  security:
  - arches:
    - amd64
    - i386
    uri: http://archive.ubuntu.com/ubuntu
  - arches:
    - default
    uri: http://ports.ubuntu.com/ubuntu-ports
  sources_list: 'deb $PRIMARY $RELEASE multiverse restricted main universe

    # deb-src $PRIMARY $RELEASE multiverse restricted main universe

    deb $PRIMARY $RELEASE-updates multiverse restricted main universe

    # deb-src $PRIMARY $RELEASE-updates multiverse restricted main universe

    deb $PRIMARY $RELEASE-backports multiverse restricted main universe

    # deb-src $PRIMARY $RELEASE-backports multiverse restricted main universe

    deb $SECURITY $RELEASE-security multiverse restricted main universe

    # deb-src $SECURITY $RELEASE-security multiverse restricted main universe

    '
datasource:
  MAAS:
    metadata_url: http://<snip-my-vpn-client-ip>:5248/MAAS/metadata/
manage_etc_hosts: true
packages:
- python3-yaml
- python3-oauthlib
power_state:
  condition: test ! -e /tmp/block-poweroff
  delay: now
  mode: poweroff
  timeout: 1800
rsyslog:
  remotes:
    maas: <snip>

I guess it’s a bug in MAAS 3.2.5: https://bugs.launchpad.net/maas/+bug/1989970

The workaround until it’s fixed appears to be to clear any custom DNS servers set for the Subnet that is not working.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.