Rack controller not connected after upgrade to 3.2 (region endpoints not exposed)

Hi,

After upgrade to 3.2 (Postgres 12, Raspberry Pi 3b/aarch64, Ubuntu 20.04LTS), MAAS UI shows:

One rack controller is not yet connected to the region. Visit the rack controllers page for more information

In the meantime, rackd.log continuously logs the following:

2022-12-29 14:19:24 provisioningserver.rpc.clusterservice: [info] Region is not advertising RPC endpoints. (While requesting RPC info at http://192.168.10.4:5240/MAAS)
2022-12-29 14:19:25 provisioningserver.rpc.clusterservice: [info] Region is not advertising RPC endpoints. (While requesting RPC info at http://192.168.10.4:5240/MAAS)
2022-12-29 14:19:25 provisioningserver.rpc.clusterservice: [info] Region is not advertising RPC endpoints. (While requesting RPC info at http://192.168.10.4:5240/MAAS)
2022-12-29 14:19:26 provisioningserver.rpc.clusterservice: [info] Region is not advertising RPC endpoints. (While requesting RPC info at http://192.168.10.4:5240/MAAS)
2022-12-29 14:19:26 provisioningserver.rpc.clusterservice: [info] Region is not advertising RPC endpoints. (While requesting RPC info at http://192.168.10.4:5240/MAAS)
2022-12-29 14:19:27 provisioningserver.rpc.clusterservice: [info] Region is not advertising RPC endpoints. (While requesting RPC info at http://192.168.10.4:5240/MAAS)

regiond.log:

2022-12-29 14:20:25 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2022-12-29 14:20:26 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2022-12-29 14:20:36 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2022-12-29 14:20:37 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)

maas.log:

(...)
2022-12-29T14:18:43.006990+00:00 maas maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2022-12-29T14:22:43.012537+00:00 maas maas.service_monitor_service: message repeated 4 times: [ [error] Can't update service statuses, no RPC connection to region.]
2022-12-29T14:23:27.913585+00:00 maas maas.boot_image_download_service: [error] Can't initiate image download, no RPC connection to region.
2022-12-29T14:23:27.914123+00:00 maas maas.dhcp.probe: [error] Can't initiate DHCP probe; no RPC connection to region.

regiond is running, one process per CPU core:

$ sudo maas status
bind9                            RUNNING   pid 1381, uptime 1 day, 6:19:57
dhcpd                            STOPPED   Not started
dhcpd6                           STOPPED   Not started
http                             RUNNING   pid 1522, uptime 1 day, 6:19:08
ntp                              RUNNING   pid 1484, uptime 1 day, 6:19:40
proxy                            STOPPED   Not started
rackd                            RUNNING   pid 1384, uptime 1 day, 6:19:57
regiond                          RUNNING   pid 1385, uptime 1 day, 6:19:57
syslog                           RUNNING   pid 1486, uptime 1 day, 6:19:40
$ ps -ef | grep regiond
root        1385     854  2 Dec28 ?        00:40:05 python3 /snap/maas/25212/bin/regiond
root        1462    1385 10 Dec28 ?        03:18:19 python3 /snap/maas/25212/bin/regiond
root        1464    1385  0 Dec28 ?        00:05:01 python3 /snap/maas/25212/bin/regiond
root        1465    1385  0 Dec28 ?        00:04:38 python3 /snap/maas/25212/bin/regiond
root        1467    1385  0 Dec28 ?        00:05:01 python3 /snap/maas/25212/bin/regiond

Strangely, regiond is not exposing any endpoints:

$ curl http://192.168.10.4:5240/MAAS/rpc/
{"eventloops": {}}

I enabled debug log on regiond, in order to investigate the issue and to get the various SQL queries used to populate the response to GET /MAAS/rpc

I understand there is a serie of 2 or 3 queries, which I captured and ran manually:

maasdb=> SELECT "maasserver_regioncontrollerprocess"."id", "maasserver_regioncontrollerprocess"."created", "maasserver_regioncontrollerprocess"."updated", "maasserver_regioncontrollerprocess"."region_id", "maasserver_regioncontrollerprocess"."pid" FROM "maasserver_regioncontrollerprocess" WHERE "maasserver_regioncontrollerprocess"."region_id" IN (1) ORDER BY "maasserver_regioncontrollerprocess"."pid" ASC;
 id  |            created            |            updated            | region_id | pid  
-----+-------------------------------+-------------------------------+-----------+------
 467 | 2022-12-28 08:04:03.901737+00 | 2022-12-29 14:31:19.93355+00  |         1 | 1462
 466 | 2022-12-28 08:04:03.437903+00 | 2022-12-29 14:31:19.736715+00 |         1 | 1464
 464 | 2022-12-28 08:04:02.413728+00 | 2022-12-29 14:31:19.23169+00  |         1 | 1465
 465 | 2022-12-28 08:04:02.989507+00 | 2022-12-29 14:31:19.418739+00 |         1 | 1467

maasdb=> SELECT "maasserver_regioncontrollerprocessendpoint"."id", "maasserver_regioncontrollerprocessendpoint"."created", "maasserver_regioncontrollerprocessendpoint"."updated", "maasserver_regioncontrollerprocessendpoint"."process_id", "maasserver_regioncontrollerprocessendpoint"."address", "maasserver_regioncontrollerprocessendpoint"."port" FROM "maasserver_regioncontrollerprocessendpoint";
 id | created | updated | process_id | address | port 
----+---------+---------+------------+---------+------
(0 rows)

It looks like the maasserver_regioncontrollerprocessendpoint is not being populated with any data, which certainly explain why a call to GET /MAAS/rpc returns an empty eventloops dictionary.

Any idea how can I debug further or fix the issue? It looks like the issue is either on Postgres or regiond side.
Since, I upgraded to MAAS 3.3, Ubuntu 22.04LTS/ Postgres 14 but the issue remains the same.

Thanks!

I manage to solve the issue.

Actually, the root cause is similar to this one: Error during init, init tries to use wlan0 when interface does not exist where regiond was periodically showing the following error, preventing it to advertise its RPC endpoints:

2023-01-04 22:06:24 provisioningserver.utils.services: [critical] Failed to update and/or record network interface configuration: Command `/snap/maas/25212/usr/share/maas/machine-resources/arm64` returned non-zero exit status 1:
        ERROR: Failed to retrieve network information: Failed to add device information for "/sys/class/net/wlan0/device": Failed to add port info: Failed to ETHTOOL_GLINK: operation not supported; interfaces: None
        Traceback (most recent call last):
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback(  # type: ignore[misc]
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1750, in gotResult
            current_context.run(_inlineCallbacks, r, gen, status)
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
            result = current_context.run(
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
        --- <exception caught here> ---
          File "/snap/maas/25212/lib/python3.10/site-packages/provisioningserver/utils/services.py", line 1090, in do_action
            interfaces = yield maybeDeferred(self.getInterfaces)
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 244, in inContext
            result = inContext.theWork()  # type: ignore[attr-defined]
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 260, in <lambda>
            inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/python/context.py", line 117, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/snap/maas/25212/usr/lib/python3/dist-packages/twisted/python/context.py", line 82, in callWithContext
            return func(*args, **kw)
          File "/snap/maas/25212/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 857, in callInContext
            return func(*args, **kwargs)
          File "/snap/maas/25212/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 203, in wrapper
            result = func(*args, **kwargs)
          File "/snap/maas/25212/lib/python3.10/site-packages/provisioningserver/utils/network.py", line 1145, in get_all_interfaces_definition
            for name, ipaddr in get_ip_addr().items()
          File "/snap/maas/25212/lib/python3.10/site-packages/provisioningserver/utils/ipaddr.py", line 28, in get_ip_addr
            output = call_and_check(command)
          File "/snap/maas/25212/lib/python3.10/site-packages/provisioningserver/utils/shell.py", line 107, in call_and_check
            raise ExternalProcessError(process.returncode, command, output=stderr)
        provisioningserver.utils.shell.ExternalProcessError: Command `/snap/maas/25212/usr/share/maas/machine-resources/arm64` returned non-zero exit status 1:
        ERROR: Failed to retrieve network information: Failed to add device information for "/sys/class/net/wlan0/device": Failed to add port info: Failed to ETHTOOL_GLINK: operation not supported

/snap/maas/25212/usr/share/maas/machine-resources/arm64 from MAAS 3.3 is exiting due to wlan0 not supporting ETHTOOL_GLINK

This has been fixed in machine-resources’ lxd dependencies: https://github.com/lxc/lxd/pull/11192

After I rebuild machine-resources with updated deps, /snap/maas/25212/usr/share/maas/machine-resources/arm64 is now working as expected:

ubuntu@maas:/$ /snap/maas/25212/usr/share/maas/machine-resources/arm64
{
    "api_extensions": [
        "resources",
        "resources_cpu_socket",
        "resources_gpu",
        "resources_numa",
        "resources_v2",
        "resources_disk_sata",
        "resources_network_firmware",
        "resources_disk_id",
        "resources_usb_pci",
        "resources_cpu_threads_numa",
        "resources_cpu_core_die",
        "api_os",
        "resources_system",
        "resources_pci_iommu",
        "resources_network_usb",
        "resources_disk_address"
    ],
    "api_version": "1.0",
    "environment": {
        "kernel": "Linux",
        "kernel_architecture": "aarch64",
        "kernel_version": "5.15.0-1021-raspi",
        "os_name": "ubuntu",
        "os_version": "22.04",
        "server": "maas-machine-resources",
        "server_name": "maas",
        "server_version": "5.8"
    },
    "resources": {
(...)

For the record, here the go.mod dependencies I used to rebuild the go-bins:

-require github.com/lxc/lxd v0.0.0-20220801070811-efce00b764d8
+require github.com/lxc/lxd v0.0.0-20221205165740-3214f21cda7a

Can a MAAS developer update the dependencies in the upstream code?

Thanks!

@sparkiegeek sorry to ping you directly, would you take a look at this ^^

Hello @jeanfabrice

The fix will be backported and available in the upcoming 3.2.7 release

1 Like

Since the fix is not yet out (and I couldn’t figure out how to get the rebuilt go binary into squashfs or rebuild the snap), alternative workaround for me was to remove wlan0 device entirely since I am not using it: https://sleeplessbeastie.eu/2022/06/01/how-to-disable-onboard-wifi-and-bluetooth-on-raspberry-pi-4/

Hope that helps someone (and thanks for this fix!)

1 Like

Hello @vpaprots

Just for the reference:

3.2.7 RC1 is ready for tests.
deb: ppa:maas/3.2-next (1:3.2.7~rc1-12036-g.7971dd4e5-0ubuntu1~20.04.1)
snap: 3.2/candidate (3.2.7~rc1-12036-g.7971dd4e5)

We are looking forward for getting release soon.

3.2.7 is released. @jeanfabrice, can you please try it and see if it resolves your issue?

Hi @billwear, thank you for the notification!
Since I was running maas 3.3 rc1 via snap, I upgraded a few days ago to 3.3/stable. I don’t think I’ll get any change in downgrading and testing 3.2.7.
My understanding is that the fix has not been merged to 3.3 yet so I applied the above workaround from @vpaprots (thanks, it really helped, cause rebuilding/remounting the squashfs was not very practical)

okay, let’s keep this open until 3.3 picks up the change, shall we?

I think we can close it since this one was initially about to 3.2 which looks like to be fixed in 3.2.7.
I will revert the workaround in place and test the next 3.3.x version that integrates the fix.
In case of any problem, I will open a new issue on discourse.

Thanks a lot!

np! closed as “solved”. thanks!

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.