Hi
Spent a few days doing some digging. Here’s what I’ve found so far.
Background:
- We operate ~10 separate maas instances in different locations, each with ~40 servers.
- On all servers we create a bond between two interfaces, add a vlan to that, then add a secondary IP to both the bond and the vlan.
- None of our servers we’re using maas with have disks, they’re all netbooted and run in RAM.
- We have a complex cloud init shell script, however I removed it for debugging this and there was no change.
Example network config, all servers use a configuration like this:
https://gist.github.com/Wrhector/86acdd91b175f704b46f4227f27b9043 (some redaction of ips)
With 20.04.1 and before, there was no issue with this setup on any of our systems. However since 20.04.3 in two of our locations, none of the servers properly create the network, (the bonds get created, but no ips added, or just one of the ipv4 get added)
The only change we can see in the cloud-init log is that aforementioned netplan error. However to note this error shows on all locations, it just only breaks the networking creation in these two locations.
Looking at what makes these locations different, the only thing I can see is these are the only two locations we use 10GBASE-T instead of SFPs, and as a result, they use different nics/switches than the other locations.
On these two locations, I’ve done some experimenting, and this issue does not happen if we remove the aliases. It seems to be a problem with the IPv6 address aliases.
Onto debugging the error;
Looking at a working location, and before the image update:
https://gist.github.com/Wrhector/ccc520d42d6159f2a4aa6679c8239311
And now an identical node in the same location with the latest image, also works:
https://gist.github.com/Wrhector/a53b9f32ec077b7b7e2e94844b676f75
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/b1194bf6c2ac4775e265faac940ff6e2
And for later comparison, the systemd-networkd output:
https://gist.github.com/Wrhector/2c3afa145bebb75576ee7a6a779b331c
Now on a non working node, that previously worked, when booting the bond comes up but no ips are set (Since there is no network access I’m looking at the console):
If I manually run an extra netplan apply after the init process is complete, the ips then come up.
The cloud init log for this is:
https://gist.github.com/Wrhector/b929052d25f55aa7a5deb03e8a0e6dc1
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/5a9940780e87627fe5da49fe9b047d4b
Now there is also a new error in the systemd-networkd (which netplan is now falling back to from the previous error?)
https://gist.github.com/Wrhector/e89d567853ed76f1ee96a20c20c5d616
(The 11:39:16 time is when I manually run netplan apply from the console to bring up the ips.)
The fact this only happens to these two locations is interesting, I’m wondering if the switches in these locations are somehow slower to bring up the bond which is causing the issue. Having said that, this wasn’t an issue on the previous image version, when the netplan apply that cloud init ran didn’t fail.
