[2.5] [Reopened] Proxy suddenly stopped working and node can't get metadata or datasource

In my setup rack controller and region controller are in different network.
They both have public ip, but the node provisioned by rack controller don’t.

It was working well before, without any change, it stopped working.
I saw nodes fail to start because fail to fetch metadata, as well as “Can not apply stage final, no datasource found”.

There are some errors in squid log, I don’t know if it’s related:

ERROR: NAT/TPROXY lookup failed to locate original IPs on local=10.1.200.13:3128 remote=10.1.200.2:58719 FD 13 flags=33
2018/12/20 03:26:28 kid1| ERROR: NF getsockopt(ORIGINAL_DST) failed on local=10.1.200.13:3128 remote=10.1.200.2:52488 FD 13 flags=33: (92) Protocol not available

I verified all services on rack/region controller is up and running, I tried restart rackd/proxy service, but it doesn’t help. I suspect this is a hidden bug, can someone help diagnose?


Update:

Don’t know if it’s related, I noticed when this is happening, maas stop generate random hostname, it always use “ubuntu” as hostname


Update2:

I found the dns is down on rack controller, which is the root cause to the issue.


Update 3:

This just happened again, I verified there is no dns issue or network connectivity issue.
It seems happen after i delete a failed commission node and try re-enlistment

I’ve got absolutely the same situation with my MAAS setup. One of the important differences - my rack controller was located in a different subnet. When this issue occured, I was unable to find rack controller subnet in the list of available subnets in MAAS Web UI. Playing around with subnet records helped to solve the problem. the question is - why this subnet was removed from the list of available subnets?

Hi,

not sure if we are facing the same issue, after digging it around, I’ve found the root cause og it, you can check it here:

https://bugs.launchpad.net/maas/+bug/1813894