SOLVED (pebkac): Changing IP's and lack of gratuitous arp and the pain it causes

dandruczyk · 19 July 2021 14:26

Here’s a problem that has been plaguing us.
Background (maas 2.9.2):

We’re reimaging older boxes (16.04) to (20.04). IP will stay the same.
We have a small dedicated pool of IP’s for maas to use for DHCP in each of the subnets (10-15 addresses)
Our core network has an arp timeout of 20 minutes (can’t be changed, policy/politics…)
MaaS rack controller has an interface on the same subnet of the machine’s being reimaged (BUT NOT THE SAME SWITCH)
The region controllers are in a different VLAN
Assume no ip’s in the maas DHCP pool have been used in the past half hour

Problem:

Commissioning:
Case a. (no ip’s in the commissioning pool have been used today) Commissioning will work fine, machine reboots, pxeboots with an IP from the pool, does its thing and powers off.
Case b. maas gives out a recently used IP for commissioning that was used in the past 20 minutes, TFTP will timeout and fail, retrying sometimes works if it happens to be done after 20 minutes from the last time that IP was used by a different machine (arp entry aged out)
Deployment:
a. Machines are deployed with a STATIC IP (their former IP when it was 16.04)
If it has been less than 20 minutes the problem happens:. Machine powers up, gets ITS FORMER STATIC IP via DHCPfrom maas (why isn’t it using the DHCP pool in this case???), tftp time’s out and the deployment hangs. Multiple retries will fail up until the arp entry from commissioning expires, at which point deployment will succeed.

The problem in this case is ARP, or more specifically the lack of Gratuitous ARP’s when the IP is changed. Ideally this should be done as early in the process as possible, i.e. can pxelinux.0 do it? Should it? Should something else do it?, it needs to happen before tftp is initiated otherwise it fails.

The fact that it doesn’t appear to do this causes these sort of finicky time and network config dependent issues when an IP changes for a host.

Another option would be to specify a specific static address to be used for commissioning on a PER SERVER basis, such that in the timelines of going from 16.02 ->commissioning->deployment->20.04 it’s ip wouldn’t ever change and we’d never hit this aggravating arp related issue.

Right now the only way we can make this work is to put PAUSES into the process to allow the arp entries to age out, or call network engineering to clear arp antries on a one-off (untenable) basis.

cgrabowski · 19 July 2021 20:15

Hey there @dandruczyk!

Regarding the hosts using the same static IP, would you mind sharing the /var/lib/maas/dhcpd.conf and/or /var/lib/maas/dhcpd6.conf if you’re using the deb package, same paths prefixed with /var/snap/maas/common if using the snap? It is likely the static IPs are still rendered within those configs as host reservations, as that would be the case with static or auto-assigned links, unless the configuration of these are to change prior to commission.

It does sound like configuring a new IP for these hosts prior to commisioning will address these issues, is there any reason these hosts would still need their existing IP while using the 16.04 image?

dandruczyk · 22 July 2021 14:46

The change of the ip without a gratuitous arp leads to no connectivity until the arp records ages out, if the ip didn’t change, that wouldn’t happen, if the node sent a gratuitous arp when it got a new IP from dhcp, the problem wouldn’t happen either. The fact that the IP changes (without a gratuitous arp) causes the problem. This is why Loadbalancers that have a floating IP send out a gratuitous arp when the failover occurs…

dandruczyk · 28 July 2021 21:42

I solved this issue. The problem was incorrect networking configuration on the rack controllers, specifically using piolicy based routing for a dual homed rack controller that was handling imaging nodes both internal and external to the network. It turned out the policy routing unnecessarily complicated things and made this act in a very strange way (it would work, then it wouldn’t then it would, etc). Simplifying that configuration and removing the policy routing solved the issue entirely. This is why I’m not a network engineer, I know enough to be dangerous, but not necessarily correct…

system · 30 July 2021 21:42

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.