Rack to REgion latency and TFTP timeouts

Maas 2.9.2 (2.9 stable)
We’re experiencing some pain with “long distance” rack to region communications and deployments. We have a HA 3 node region controller setup as per maas docs in the US, with rack controllers in melbourne Austrailia (and many other places), Commissions work just fine, deployments however keep giving us TFTP timeouts, they eventually work, but the failures cause a lot of pain for the engineers trying to build/rebuild hosts as they need to retry multiple times for success. It is between 1 in 2 or 1 in 3 that WORK which is painful.

The fact the commissions work reasonably reliably but deploys fail a LOT MORE at the TFTP step and Metadata fetch steps tells me this is a latency, packet loss or routing related issue, as I’m making the assumption that during deployment there’s more rack<->region communication going on, during the dhcp/tftp/initial bootup phase.

Round trip time is about 268 ms from austrailia to US.

Hi dandruczyk,

MAAS is designed to work in low latency environments, but some users have reported that it can work in long-distance deployments. In order to help you, we need more information:

  1. What’s the peek latency and packet loss ratio between the sites? (use iperf or some similar tool to get a more precise measurement)
  2. The region controllers are located in the same data-center?
  3. Where is the HA PostgreSQL server?
  4. Are you using a HAProxy?
  5. Can you share the error log?
  1. latency is 260 ms avg, about 280ms peak
  2. 3 region controllers are all in US in same DC, inter-node latency sub 1ms, the region controllers are VM’s on different openstack hosts
  3. postgresql HA servers are in US datacenter, sub 1 ms latency to region controllers, also in openstack on different hypervisors and on SSD local storage.
  4. Yes, haproxy is on the region controllers as per the MaaS HA region documentation, keepalived is used to have the vip float between nodes to allow easy maintenance and failover, and has appeared to be working flawlessly since I upgraded to 2.9.x in april
  5. Error logs for what exactly? The errors are on the console of the VM and can vary from:
    netconn_connect error -10 when tftp is trying to fetch the kernel or initrd, to the vm failing to fetch metadata which is pretty much a verbatim copy of the following with different dates and different ip’s.

2017-12-05 14:53:18,748 - util.py[WARNING]: Failed fetching metadata
from url http:///MAAS/metadata/curtin
2017-12-05 14:53:18,755 - util.py[WARNING]: No instance datasource
found! Likely bad things to come!

@dandruczyk, to answer question 5, are there any MAAS error logs, like the region and rack controller logs, machine syslogs, etc., that you can capture and paste?

Hello,

TFTP is served from Rack Controllers just for initial stage, then communication switches to http which works even better.
I would suggest you check your networking equipment for packet loss or missconfiguration on the remote site.
Unless the connections from rack to region drop and reconnect often (check rack controller /var/log/maas/rackd.log for that) - your pain point is probably located at the remote location alone.
Look into https://www.dummies.com/programming/networking/cisco/spanning-tree-protocol-stp-and-portfast/
just the snippet.

I see no drops from rack to region on the rack controllers. The problem rarely if ever happens on commission but it happens OFTEN on deploy… why? portfast is already set properly on all ports.

I am sorry, without sepecifics there is not much to help diagnose the problem.
Identical ephemeral image load procedure is performed on commissioning, testing and deployment, so if it works for commissioning, it should work for deployment the same way.
The tftp and http to load ephemeral linux is all served from rack, region is not part of this.
The region is only involved once cloud-init gets triggered and deployment starts with communication to maas to get curtin data for deployment.

TFTP Errors usually mean bad cabling or missconfiguration somewhere between rack and server being deployed.

This was caused by the networking configuration used on some dual multi rack controllers, once that was resolved, this issue disappeared.