Latest Ubuntu 20.04 image causing netplan error

Hi

MAAS version: 3.1.0

We have a large working cluster, however we deployed today for the first time in a month and network config failed to come up properly, looking in the logs compared to a machine with the same config booted one month before:

Original working:

Cloud-init v. 21.4-0ubuntu1~20.04.1

2022-03-20 12:19:13,313 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
2022-03-20 12:19:13,313 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
2022-03-20 12:19:13,923 - util.py[DEBUG]: Writing to /etc/netplan/50-maas.yaml - wb: [644] 1257 bytes
2022-03-20 12:19:13,923 - util.py[DEBUG]: Changing the ownership of /etc/netplan/50-maas.yaml to 0:0

And today:

Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~20.04.3

2022-04-20 11:04:44,026 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
2022-04-20 11:04:44,026 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
2022-04-20 11:04:44,450 - activators.py[WARNING]: Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system bus: No such file or directory
Falling back to a hard restart of systemd-networkd.service
2022-04-20 11:04:44,746 - util.py[DEBUG]: Writing to /etc/netplan/50-maas.yaml - wb: [644] 1257 bytes
2022-04-20 11:04:44,746 - util.py[DEBUG]: Changing the ownership of /etc/netplan/50-maas.yaml to 0:0

This is causing issues as it’s not bringing up the ips on the interfaces nor the routes. It looks like the latest image from maas.io is the problem.

I’m trying to drill down into what changed on the image to cause this.

Hi,

this is not something I’ve seen before. Is it happening on every deploy, or is it intermittent?

It would be good if you could share the network configuration for the machine, and the cloud-init logs

Hi

Spent a few days doing some digging. Here’s what I’ve found so far.

Background:

  • We operate ~10 separate maas instances in different locations, each with ~40 servers.
  • On all servers we create a bond between two interfaces, add a vlan to that, then add a secondary IP to both the bond and the vlan.
  • None of our servers we’re using maas with have disks, they’re all netbooted and run in RAM.
  • We have a complex cloud init shell script, however I removed it for debugging this and there was no change.

Example network config, all servers use a configuration like this:
https://gist.github.com/Wrhector/86acdd91b175f704b46f4227f27b9043 (some redaction of ips)

With 20.04.1 and before, there was no issue with this setup on any of our systems. However since 20.04.3 in two of our locations, none of the servers properly create the network, (the bonds get created, but no ips added, or just one of the ipv4 get added)

The only change we can see in the cloud-init log is that aforementioned netplan error. However to note this error shows on all locations, it just only breaks the networking creation in these two locations.

Looking at what makes these locations different, the only thing I can see is these are the only two locations we use 10GBASE-T instead of SFPs, and as a result, they use different nics/switches than the other locations.

On these two locations, I’ve done some experimenting, and this issue does not happen if we remove the aliases. It seems to be a problem with the IPv6 address aliases.

Onto debugging the error;

Looking at a working location, and before the image update:
https://gist.github.com/Wrhector/ccc520d42d6159f2a4aa6679c8239311

And now an identical node in the same location with the latest image, also works:
https://gist.github.com/Wrhector/a53b9f32ec077b7b7e2e94844b676f75
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/b1194bf6c2ac4775e265faac940ff6e2
And for later comparison, the systemd-networkd output:
https://gist.github.com/Wrhector/2c3afa145bebb75576ee7a6a779b331c

Now on a non working node, that previously worked, when booting the bond comes up but no ips are set (Since there is no network access I’m looking at the console):

If I manually run an extra netplan apply after the init process is complete, the ips then come up.
The cloud init log for this is:
https://gist.github.com/Wrhector/b929052d25f55aa7a5deb03e8a0e6dc1
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/5a9940780e87627fe5da49fe9b047d4b

Now there is also a new error in the systemd-networkd (which netplan is now falling back to from the previous error?)
https://gist.github.com/Wrhector/e89d567853ed76f1ee96a20c20c5d616

(The 11:39:16 time is when I manually run netplan apply from the console to bring up the ips.)

The fact this only happens to these two locations is interesting, I’m wondering if the switches in these locations are somehow slower to bring up the bond which is causing the issue. Having said that, this wasn’t an issue on the previous image version, when the netplan apply that cloud init ran didn’t fail.

Also 22.04 works fine, it’s only an issue with 20.04 that has cropped up between 20.04.01 and 20.04.3. Unfortunately since 22.04 is so new we’re unable to switch to it yet

@williammmllc, did you ever get any closure on this one?

Nope, I’m just doing a a dodgy hack that keeps running netplan apply repeatedly until the interface ips come up properly. Interestingly now, when all the interfaces are eventually up properly, if you run netplan apply it’ll break, then a second time it goes back to working.

okay, please go ahead and file a bug. it might not be a bug, but if you’ve stumped two MAAS engineers already (bjorn and me), we should treat it as a bug until we find out otherwise.

Did you ever file a bug report for this? I’m seeing similar issues on 20.04.

Cloud-init v. 22.4.2-0ubuntu0~20.04.2 running 'init-local' at Fri, 10 Feb 2023 18:23:42 +0000. Up 5.12 seconds.
2023-02-10 18:23:42,289 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "attempting to read from cache [trust]", "event_type": "start", "origin": "cloudinit", "timestamp": 1676053422.2587974}. This was caused by: HTTPConnectionPool(host='192.168.98.2', port=5248): Max retries exceeded with url: /MAAS/metadata/status/xtm848 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39e7f97190>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-02-10 18:23:42,299 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "no cache found", "event_type": "finish", "origin": "cloudinit", "timestamp": 1676053422.25944, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='192.168.98.2', port=5248): Max retries exceeded with url: /MAAS/metadata/status/xtm848 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39e7fab040>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-02-10 18:23:42,483 - handlers.py[WARNING]: Failed posting event: {"name": "init-local", "description": "searching for local datasources", "event_type": "finish", "origin": "cloudinit", "timestamp": 1676053422.4809628, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='192.168.98.2', port=5248): Max retries exceeded with url: /MAAS/metadata/status/xtm848 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39e7fab220>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-02-10 18:23:42,483 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.
Cloud-init v. 22.4.2-0ubuntu0~20.04.2 running 'init' at Fri, 10 Feb 2023 18:23:44 +0000. Up 7.74 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |          Address           |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: |  ens4  | True |        10.11.29.205        | 255.255.255.0 | global | 52:54:00:29:eb:4f |
ci-info: |  ens4  | True | fe80::5054:ff:fe29:eb4f/64 |       .       |  link  | 52:54:00:29:eb:4f |
ci-info: |   lo   | True |         127.0.0.1          |   255.0.0.0   |  host  |         .         |
ci-info: |   lo   | True |          ::1/128           |       .       |  host  |         .         |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: | Route | Destination |  Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 10.11.29.1 |    0.0.0.0    |    ens4   |   UG  |
ci-info: |   1   |  10.11.29.0 |  0.0.0.0   | 255.255.255.0 |    ens4   |   U   |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   1   |  fe80::/64  |    ::   |    ens4   |   U   |
ci-info: |   3   |    local    |    ::   |    ens4   |   U   |
ci-info: |   4   |  multicast  |    ::   |    ens4   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
2023-02-10 18:23:45,921 - activators.py[WARNING]: Running ['netplan', 'apply'] resulted in stderr output: ^[[0;1;31mFailed to connect system bus: No such file or directory^[[0m
Falling back to a hard restart of systemd-networkd.service

Systems boots up just fine, and netplan apply works okay.

Perhaps I should be ignoring these errors at boot?

@nateybobo, it looks like the OP never filed a bug. could you file one so we can reclassify this problem?