2, minutes... To miiidnight! But no, 2 minutes waiting for Network to get Configured * each MaaS commission stage * failed deployments

Guys,

Even on the simplest network scenario, like 1 PXE via Static IP or DHCP, each stage waits 2 minutes for the network to come online… And MaaS have many stages, and it fails a LOT, so, those 2 minutes becomes hours! Just waiting for the network…

Anyone working to fix this? :stuck_out_tongue:

Cheers!

Check to see if this is a problem with your network switch.

I use Cisco, and the default behavior of most of my switches is to wait a minute after a port comes up before connecting it to the network. This interferes with protocols like DHCP, and is generally annoying. However the purpose is to prevent network loops. The switch brings the port up and listens for neighbor announcements to see if the port has traffic on it that is already on the network. That way the network isn’t immediately lost to a broadcast storm because of a network loop. As soon as the switch sees the port is quiet for a period of time, it will connect the port to the network, but the client thinks the network came up as soon as it was plugged in, and has already gone into either timeout or long waits before retry mode, as it waits for DHCP. Some DHCP clients fail to ever recover all together, others add additional wait making it take even more time, extending past the minute. Your issue seems to suggest similar behavior to me.

Turning this feature off is different for different switches. On some switches this is tunable with a “portfast” setting, and on others I must turn spanning tree off for the whole VLAN. For me, this is a known issue long before I started using MAAS, so I was fortunate to be aware and turn it off before I even started using the switches at the top of my MAAS stacks.

I’ve also been experiencing this for a while. It didn’t appear to be an issue with any of the switches hosts were attached to, the link(s) came up quickly and correctly. However, I had left a few ‘unconfigured’ interfaces lying around - these were interfaces that had been discovered by MAAS on the node but were unused/unconnected. It seems systemd-networkd-wait-online waits for all interfaces that it manages to be fully configured - the status of which can be checked via networkctl.

After removing the unused interfaces from my netplan configuration and applying it, nodes no longer hung for 2 minutes at boot. Additionally, I also made sure to instruct systemd-networkd that certain virtual interfaces (e.g. tunl0 for Calico) should be unmanaged. These interfaces shouldn’t be present at boot anyway, but there’s definitely no need for systemd-networkctl to manage them and other issues can be observed if it does.