[2.5] [Reopened] Proxy suddenly stopped working and node can't get metadata or datasource


#1

In my setup rack controller and region controller are in different network.
They both have public ip, but the node provisioned by rack controller don’t.

It was working well before, without any change, it stopped working.
I saw nodes fail to start because fail to fetch metadata, as well as “Can not apply stage final, no datasource found”.

There are some errors in squid log, I don’t know if it’s related:

ERROR: NAT/TPROXY lookup failed to locate original IPs on local=10.1.200.13:3128 remote=10.1.200.2:58719 FD 13 flags=33
2018/12/20 03:26:28 kid1| ERROR: NF getsockopt(ORIGINAL_DST) failed on local=10.1.200.13:3128 remote=10.1.200.2:52488 FD 13 flags=33: (92) Protocol not available

I verified all services on rack/region controller is up and running, I tried restart rackd/proxy service, but it doesn’t help. I suspect this is a hidden bug, can someone help diagnose?


Update:

Don’t know if it’s related, I noticed when this is happening, maas stop generate random hostname, it always use “ubuntu” as hostname


Update2:

I found the dns is down on rack controller, which is the root cause to the issue.


Update 3:

This just happened again, I verified there is no dns issue or network connectivity issue.
It seems happen after i delete a failed commission node and try re-enlistment