Here’s a problem that has been plaguing us.
Background (maas 2.9.2):
- We’re reimaging older boxes (16.04) to (20.04). IP will stay the same.
- We have a small dedicated pool of IP’s for maas to use for DHCP in each of the subnets (10-15 addresses)
- Our core network has an arp timeout of 20 minutes (can’t be changed, policy/politics…)
- MaaS rack controller has an interface on the same subnet of the machine’s being reimaged (BUT NOT THE SAME SWITCH)
- The region controllers are in a different VLAN
- Assume no ip’s in the maas DHCP pool have been used in the past half hour
Problem:
- Commissioning:
Case a. (no ip’s in the commissioning pool have been used today) Commissioning will work fine, machine reboots, pxeboots with an IP from the pool, does its thing and powers off.
Case b. maas gives out a recently used IP for commissioning that was used in the past 20 minutes, TFTP will timeout and fail, retrying sometimes works if it happens to be done after 20 minutes from the last time that IP was used by a different machine (arp entry aged out) - Deployment:
a. Machines are deployed with a STATIC IP (their former IP when it was 16.04)
If it has been less than 20 minutes the problem happens:. Machine powers up, gets ITS FORMER STATIC IP via DHCPfrom maas (why isn’t it using the DHCP pool in this case???), tftp time’s out and the deployment hangs. Multiple retries will fail up until the arp entry from commissioning expires, at which point deployment will succeed.
The problem in this case is ARP, or more specifically the lack of Gratuitous ARP’s when the IP is changed. Ideally this should be done as early in the process as possible, i.e. can pxelinux.0 do it? Should it? Should something else do it?, it needs to happen before tftp is initiated otherwise it fails.
The fact that it doesn’t appear to do this causes these sort of finicky time and network config dependent issues when an IP changes for a host.
Another option would be to specify a specific static address to be used for commissioning on a PER SERVER basis, such that in the timelines of going from 16.02 ->commissioning->deployment->20.04 it’s ip wouldn’t ever change and we’d never hit this aggravating arp related issue.
Right now the only way we can make this work is to put PAUSES into the process to allow the arp entries to age out, or call network engineering to clear arp antries on a one-off (untenable) basis.