Machine listing (25 Machines) hangs, with rackd timing out on region, how to debug

Howdy

thins one is driving me nuts, a bit.
I have a small(?) installation (MAAS 3.3.4, DEB) with about 25 Machines.
Yet, the machines listing in the GUI as well as listing them on the cli takes an awful lot of time (sometimes >10minutes)

During that, it seems that the rackd times out when contacting the regiond (same server):

2023-08-09 15:01:53 provisioningserver.rpc.clusterservice: [info] Rack controller 'yadpqf' registered (via deploy:pid=1167) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-08-09 15:02:03 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=46564, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=5251, flowInfo=0, scopeID=0))
2023-08-09 15:02:03 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://$DEPLOY_B_IP:5240/MAAS).
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 661, in callback
            self._startRunCallbacks(result)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback(  # type: ignore[misc]
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1750, in gotResult
            current_context.run(_inlineCallbacks, r, gen, status)
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1299, in _doUpdate
            eventloops, maas_url = yield self._get_rpc_info(urls)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
            raise config_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback(  # type: ignore[misc]
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
            errors[0].raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
            raise last_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
            result = current_context.run(
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
            raise config_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback(  # type: ignore[misc]
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
            errors[0].raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
            result = current_context.run(
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
            raise last_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
        twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]

The regiond around that time seem benign, tho:

023-08-09 15:01:53 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:$DEPLOY_A_IP:46624'.
2023-08-09 15:01:53 maasserver.ipc: [info] Worker pid:1167 registered RPC connection to ('yadpqf', '$DEPLOY_A_IP', 5251).
2023-08-09 15:01:54 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'deploy (yadpqf)'.
2023-08-09 15:01:54 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'deploy (yadpqf)'.
2023-08-09 15:02:03 RegionServer,3,::ffff:$DEPLOY_A_IP: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=5251, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=46564, flowInfo=0, scopeID=0))
2023-08-09 15:02:03 maasserver.ipc: [info] Worker pid:1167 lost RPC connection to ('yadpqf', '$DEPLOY_A_IP', 5251).
2023-08-09 15:02:03 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'deploy (yadpqf)'.
2023-08-09 15:02:03 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'deploy (yadpqf)'.
2023-08-09 15:02:23 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=5251, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=51808, flowInfo=0, scopeID=0))
2023-08-09 15:02:23 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=5251, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=51824, flowInfo=0, scopeID=0))
2023-08-09 15:02:23 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=5251, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:$DEPLOY_A_IP', port=51832, flowInfo=0, scopeID=0))
2023-08-09 15:02:23 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:$DEPLOY_A_IP:51808'.
2023-08-09 15:02:23 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:$DEPLOY_A_IP:51824'.
2023-08-09 15:02:23 maasserver.rpc.regionservice: [info] Rack controller authenticated from '::ffff:$DEPLOY_A_IP:51832'.
2023-08-09 15:02:24 maasserver.ipc: [info] Worker pid:1167 registered RPC connection to ('yadpqf', '$DEPLOY_A_IP', 5251).
2023-08-09 15:02:24 maasserver.ipc: [info] Worker pid:1167 registered RPC connection to ('yadpqf', '$DEPLOY_A_IP', 5251).
2023-08-09 15:02:24 maasserver.ipc: [info] Worker pid:1167 registered RPC connection to ('yadpqf', '$DEPLOY_A_IP', 5251).
2023-08-09 15:02:25 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'deploy (yadpqf)'.
2023-08-09 15:02:25 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'deploy (yadpqf)'.

The machine is a VM with all of rackd, regiond, and PostgreSQL installed.
It has 16G RAM and 16 Cores.

How do I start to debug this?

I already tried setting num_workers: 8 in regiond.conf

Best regards
-Tobias

Hello Tobias,

Unfortunately, there is an issue in 3.3.x which we’ve discovered recently. A new 3.3.5 with a fix should arrive soon, but I don’t have ETA for it.

Meanwhile, maybe you can try building your own DEB package from source or try ppa:maas/3.3-next?

I think you can even patch your existing installation.
Here is a commit that should improve performance of the API (hence GUI and CLI machines listing)

Much better!
Thanks a lot!

Its much faster now.
around 30 seconds for 25 machines :slight_smile:

Thanks for the feedback!
Glad it helped.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.