Latest Ubuntu 20.04 image causing netplan error

Hi

MAAS version: 3.1.0

We have a large working cluster, however we deployed today for the first time in a month and network config failed to come up properly, looking in the logs compared to a machine with the same config booted one month before:

Original working:

Cloud-init v. 21.4-0ubuntu1~20.04.1

2022-03-20 12:19:13,313 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
2022-03-20 12:19:13,313 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
2022-03-20 12:19:13,923 - util.py[DEBUG]: Writing to /etc/netplan/50-maas.yaml - wb: [644] 1257 bytes
2022-03-20 12:19:13,923 - util.py[DEBUG]: Changing the ownership of /etc/netplan/50-maas.yaml to 0:0

And today:

Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~20.04.3

2022-04-20 11:04:44,026 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
2022-04-20 11:04:44,026 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
2022-04-20 11:04:44,450 - activators.py[WARNING]: Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system bus: No such file or directory
Falling back to a hard restart of systemd-networkd.service
2022-04-20 11:04:44,746 - util.py[DEBUG]: Writing to /etc/netplan/50-maas.yaml - wb: [644] 1257 bytes
2022-04-20 11:04:44,746 - util.py[DEBUG]: Changing the ownership of /etc/netplan/50-maas.yaml to 0:0

This is causing issues as it’s not bringing up the ips on the interfaces nor the routes. It looks like the latest image from maas.io is the problem.

I’m trying to drill down into what changed on the image to cause this.

Hi,

this is not something I’ve seen before. Is it happening on every deploy, or is it intermittent?

It would be good if you could share the network configuration for the machine, and the cloud-init logs

Hi

Spent a few days doing some digging. Here’s what I’ve found so far.

Background:

  • We operate ~10 separate maas instances in different locations, each with ~40 servers.
  • On all servers we create a bond between two interfaces, add a vlan to that, then add a secondary IP to both the bond and the vlan.
  • None of our servers we’re using maas with have disks, they’re all netbooted and run in RAM.
  • We have a complex cloud init shell script, however I removed it for debugging this and there was no change.

Example network config, all servers use a configuration like this:
https://gist.github.com/Wrhector/86acdd91b175f704b46f4227f27b9043 (some redaction of ips)

With 20.04.1 and before, there was no issue with this setup on any of our systems. However since 20.04.3 in two of our locations, none of the servers properly create the network, (the bonds get created, but no ips added, or just one of the ipv4 get added)

The only change we can see in the cloud-init log is that aforementioned netplan error. However to note this error shows on all locations, it just only breaks the networking creation in these two locations.

Looking at what makes these locations different, the only thing I can see is these are the only two locations we use 10GBASE-T instead of SFPs, and as a result, they use different nics/switches than the other locations.

On these two locations, I’ve done some experimenting, and this issue does not happen if we remove the aliases. It seems to be a problem with the IPv6 address aliases.

Onto debugging the error;

Looking at a working location, and before the image update:
https://gist.github.com/Wrhector/ccc520d42d6159f2a4aa6679c8239311

And now an identical node in the same location with the latest image, also works:
https://gist.github.com/Wrhector/a53b9f32ec077b7b7e2e94844b676f75
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/b1194bf6c2ac4775e265faac940ff6e2
And for later comparison, the systemd-networkd output:
https://gist.github.com/Wrhector/2c3afa145bebb75576ee7a6a779b331c

Now on a non working node, that previously worked, when booting the bond comes up but no ips are set (Since there is no network access I’m looking at the console):

If I manually run an extra netplan apply after the init process is complete, the ips then come up.
The cloud init log for this is:
https://gist.github.com/Wrhector/b929052d25f55aa7a5deb03e8a0e6dc1
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/5a9940780e87627fe5da49fe9b047d4b

Now there is also a new error in the systemd-networkd (which netplan is now falling back to from the previous error?)
https://gist.github.com/Wrhector/e89d567853ed76f1ee96a20c20c5d616

(The 11:39:16 time is when I manually run netplan apply from the console to bring up the ips.)

The fact this only happens to these two locations is interesting, I’m wondering if the switches in these locations are somehow slower to bring up the bond which is causing the issue. Having said that, this wasn’t an issue on the previous image version, when the netplan apply that cloud init ran didn’t fail.

Also 22.04 works fine, it’s only an issue with 20.04 that has cropped up between 20.04.01 and 20.04.3. Unfortunately since 22.04 is so new we’re unable to switch to it yet

@williammmllc, did you ever get any closure on this one?

Nope, I’m just doing a a dodgy hack that keeps running netplan apply repeatedly until the interface ips come up properly. Interestingly now, when all the interfaces are eventually up properly, if you run netplan apply it’ll break, then a second time it goes back to working.

okay, please go ahead and file a bug. it might not be a bug, but if you’ve stumped two MAAS engineers already (bjorn and me), we should treat it as a bug until we find out otherwise.