Latest Ubuntu 20.04 image causing netplan error

Hi

MAAS version: 3.1.0

We have a large working cluster, however we deployed today for the first time in a month and network config failed to come up properly, looking in the logs compared to a machine with the same config booted one month before:

Original working:

Cloud-init v. 21.4-0ubuntu1~20.04.1

2022-03-20 12:19:13,313 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
2022-03-20 12:19:13,313 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
2022-03-20 12:19:13,923 - util.py[DEBUG]: Writing to /etc/netplan/50-maas.yaml - wb: [644] 1257 bytes
2022-03-20 12:19:13,923 - util.py[DEBUG]: Changing the ownership of /etc/netplan/50-maas.yaml to 0:0

And today:

Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~20.04.3

2022-04-20 11:04:44,026 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
2022-04-20 11:04:44,026 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
2022-04-20 11:04:44,450 - activators.py[WARNING]: Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system bus: No such file or directory
Falling back to a hard restart of systemd-networkd.service
2022-04-20 11:04:44,746 - util.py[DEBUG]: Writing to /etc/netplan/50-maas.yaml - wb: [644] 1257 bytes
2022-04-20 11:04:44,746 - util.py[DEBUG]: Changing the ownership of /etc/netplan/50-maas.yaml to 0:0

This is causing issues as it’s not bringing up the ips on the interfaces nor the routes. It looks like the latest image from maas.io is the problem.

I’m trying to drill down into what changed on the image to cause this.

Hi,

this is not something I’ve seen before. Is it happening on every deploy, or is it intermittent?

It would be good if you could share the network configuration for the machine, and the cloud-init logs

Hi

Spent a few days doing some digging. Here’s what I’ve found so far.

Background:

  • We operate ~10 separate maas instances in different locations, each with ~40 servers.
  • On all servers we create a bond between two interfaces, add a vlan to that, then add a secondary IP to both the bond and the vlan.
  • None of our servers we’re using maas with have disks, they’re all netbooted and run in RAM.
  • We have a complex cloud init shell script, however I removed it for debugging this and there was no change.

Example network config, all servers use a configuration like this:
https://gist.github.com/Wrhector/86acdd91b175f704b46f4227f27b9043 (some redaction of ips)

With 20.04.1 and before, there was no issue with this setup on any of our systems. However since 20.04.3 in two of our locations, none of the servers properly create the network, (the bonds get created, but no ips added, or just one of the ipv4 get added)

The only change we can see in the cloud-init log is that aforementioned netplan error. However to note this error shows on all locations, it just only breaks the networking creation in these two locations.

Looking at what makes these locations different, the only thing I can see is these are the only two locations we use 10GBASE-T instead of SFPs, and as a result, they use different nics/switches than the other locations.

On these two locations, I’ve done some experimenting, and this issue does not happen if we remove the aliases. It seems to be a problem with the IPv6 address aliases.

Onto debugging the error;

Looking at a working location, and before the image update:
https://gist.github.com/Wrhector/ccc520d42d6159f2a4aa6679c8239311

And now an identical node in the same location with the latest image, also works:
https://gist.github.com/Wrhector/a53b9f32ec077b7b7e2e94844b676f75
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/b1194bf6c2ac4775e265faac940ff6e2
And for later comparison, the systemd-networkd output:
https://gist.github.com/Wrhector/2c3afa145bebb75576ee7a6a779b331c

Now on a non working node, that previously worked, when booting the bond comes up but no ips are set (Since there is no network access I’m looking at the console):

If I manually run an extra netplan apply after the init process is complete, the ips then come up.
The cloud init log for this is:
https://gist.github.com/Wrhector/b929052d25f55aa7a5deb03e8a0e6dc1
/etc/netplan/50-maas.yaml:
https://gist.github.com/Wrhector/5a9940780e87627fe5da49fe9b047d4b

Now there is also a new error in the systemd-networkd (which netplan is now falling back to from the previous error?)
https://gist.github.com/Wrhector/e89d567853ed76f1ee96a20c20c5d616

(The 11:39:16 time is when I manually run netplan apply from the console to bring up the ips.)

The fact this only happens to these two locations is interesting, I’m wondering if the switches in these locations are somehow slower to bring up the bond which is causing the issue. Having said that, this wasn’t an issue on the previous image version, when the netplan apply that cloud init ran didn’t fail.

Also 22.04 works fine, it’s only an issue with 20.04 that has cropped up between 20.04.01 and 20.04.3. Unfortunately since 22.04 is so new we’re unable to switch to it yet

@williammmllc, did you ever get any closure on this one?

Nope, I’m just doing a a dodgy hack that keeps running netplan apply repeatedly until the interface ips come up properly. Interestingly now, when all the interfaces are eventually up properly, if you run netplan apply it’ll break, then a second time it goes back to working.

okay, please go ahead and file a bug. it might not be a bug, but if you’ve stumped two MAAS engineers already (bjorn and me), we should treat it as a bug until we find out otherwise.

Did you ever file a bug report for this? I’m seeing similar issues on 20.04.

Cloud-init v. 22.4.2-0ubuntu0~20.04.2 running 'init-local' at Fri, 10 Feb 2023 18:23:42 +0000. Up 5.12 seconds.
2023-02-10 18:23:42,289 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "attempting to read from cache [trust]", "event_type": "start", "origin": "cloudinit", "timestamp": 1676053422.2587974}. This was caused by: HTTPConnectionPool(host='192.168.98.2', port=5248): Max retries exceeded with url: /MAAS/metadata/status/xtm848 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39e7f97190>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-02-10 18:23:42,299 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "no cache found", "event_type": "finish", "origin": "cloudinit", "timestamp": 1676053422.25944, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='192.168.98.2', port=5248): Max retries exceeded with url: /MAAS/metadata/status/xtm848 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39e7fab040>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-02-10 18:23:42,483 - handlers.py[WARNING]: Failed posting event: {"name": "init-local", "description": "searching for local datasources", "event_type": "finish", "origin": "cloudinit", "timestamp": 1676053422.4809628, "result": "SUCCESS"}. This was caused by: HTTPConnectionPool(host='192.168.98.2', port=5248): Max retries exceeded with url: /MAAS/metadata/status/xtm848 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f39e7fab220>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
2023-02-10 18:23:42,483 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.
Cloud-init v. 22.4.2-0ubuntu0~20.04.2 running 'init' at Fri, 10 Feb 2023 18:23:44 +0000. Up 7.74 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |          Address           |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: |  ens4  | True |        10.11.29.205        | 255.255.255.0 | global | 52:54:00:29:eb:4f |
ci-info: |  ens4  | True | fe80::5054:ff:fe29:eb4f/64 |       .       |  link  | 52:54:00:29:eb:4f |
ci-info: |   lo   | True |         127.0.0.1          |   255.0.0.0   |  host  |         .         |
ci-info: |   lo   | True |          ::1/128           |       .       |  host  |         .         |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: | Route | Destination |  Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 10.11.29.1 |    0.0.0.0    |    ens4   |   UG  |
ci-info: |   1   |  10.11.29.0 |  0.0.0.0   | 255.255.255.0 |    ens4   |   U   |
ci-info: +-------+-------------+------------+---------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   1   |  fe80::/64  |    ::   |    ens4   |   U   |
ci-info: |   3   |    local    |    ::   |    ens4   |   U   |
ci-info: |   4   |  multicast  |    ::   |    ens4   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
2023-02-10 18:23:45,921 - activators.py[WARNING]: Running ['netplan', 'apply'] resulted in stderr output: ^[[0;1;31mFailed to connect system bus: No such file or directory^[[0m
Falling back to a hard restart of systemd-networkd.service

Systems boots up just fine, and netplan apply works okay.

Perhaps I should be ignoring these errors at boot?

@nateybobo, it looks like the OP never filed a bug. could you file one so we can reclassify this problem?

@williammmllc @nateybobo Did you ever manage to file a bug for this?

I’m encountering the exact same issue on Ubuntu 20.04.

This one’s been stewing awhile, so I’m going to summarize it and maybe encourage a call to action.

tl;dr: @ollienuk, can you file a bug request? See below for details.

Based on the conversation, it appears that the problem is related to changes in the Ubuntu 20.04.3 image that may be affecting the way the MAAS network configurations are being applied using netplan. The problem seems to be intermittent and might be related to the specific network hardware being used – can’t tell from what I see here. (Hey, I’m just the tech writer, so…).

From the discussion, it seems that netplan apply command does not always work as expected on boot but works when manually run. Based on my time fighting with modems and fax machines, this seems like it could be a race condition or other timing-related issue.

The first step to address this issue is to ensure that the problem is well documented and reported. Since @williammmllc or @nateybobo haven’t already reported this issue, @ollienuk, are you up to filing a bug report? I ask, because we really need a few details:

  1. Your MAAS version.
  2. The Ubuntu version you’re deploying.
  3. The network configuration being used.
  4. Any relevant logs or error messages, like the ones you’ve already collected.
  5. The hardware details, particularly the network interfaces and switches being used.
  6. Any workarounds or temporary fixes you’ve found, like the manual netplan apply command.

In the meantime, you could try to use @williammmllc’s workaround. If I read this right, they managed to keep things running by having a script that repeatedly runs netplan apply until the interfaces come up properly. It’s incredibly wonky, but it might help keep things running until a proper solution is found.

It would also be beneficial to test with other Ubuntu versions or MAAS versions, if possible, to see if the problem exists there as well. As @williammmllc mentioned, Ubuntu 22.04 seems to work fine, so it might be worth considering an upgrade to that version if all else fails.

Hi @billwear,

In my instance it turns out one of my interfaces in a bonded interface had becoming disconnected from the server so cloud-init was waiting for it to be present before applying the configuration to communicate with the data source. It isn’t really a bug so much in this case but it was just unclear it was causing the issue because the machine would almost immediately reboot when it got to this process. I have to say cloud-init functions perfectly fine in Ubuntu 22.04 with one of the interfaces not being present/up. I had to prevent the reboot by restricting the file permissions on the reboot command as soon as the system booted.

1 Like

That was hard to find. Hats off to you for your efforts in solving the problem. Great job! I need to add this to our troubleshooting section.

I’m still experiencing my issue, which I think is largely different from what @williammmllc posted about.

I created this thread to better discuss my issue: Confusing cloud-init log events