3.4.4 rackd controllers not registering after upgrade revert

Our rackd controllers are not registering with regiond. We took snapshots and backups of a mostly working 3.4.4 system, and upgraded to 3.4.7. We had problems with that and decided to revert for now, so we used our usual method: Restore database backup, revert to snapshots of rackd machines, and turn on the cloned copy of regiond. This time it hasn’t worked.

In a similar fashion to Rack controller not connected after upgrade to 3.2 (region endpoints not exposed) we see no endpoints:

root@maas-rackd-01:~# curl -L https://maas.pawsey.org.au/MAAS/rpc/; echo
{"eventloops": {}}

Using that post as a reference, we found public.maasserver_regioncontrollerprocessendpoint to be empty. We put 4 records back in there from the backup (modified to use current ids), and then things started to work.

root@maas-nimbus-rackd:~# curl -L https://maas.pawsey.org.au/MAAS/rpc
{"eventloops": {"maas:pid=15870": [["$regiondIP", 5252]], "maas:pid=15871": [["$regiondIP", 5251]], "maas:pid=15872": [["$regiondIP, 5253]], "maas:pid=15873": [["$regiondIP", 5250]]}}

The controller page in web UI started to look better, for nearly a minute. Then they dropped to all dead again, and the rpc curl was empty again. The database entries were also gone. We enabled debug, and sure enough it is deleting the entries. So we created them with current datestamps in the understanding that after 60 seconds they are replaced. No luck. Debug logs also show attempts to create, but we see no evidence of that in the DB.

2025-05-06 15:22:14 django.db.backends: [debug] (0.000) INSERT INTO "maasserver_regioncontrollerprocessendpoint" ("created", "updated", "process_id", "address", "port") VALUES ('2025-05-06T15:22:13.999466'::timestamp, '2025-05-06T15:22:13.999466'::timestamp, 6819, '$regiondIP'::inet, 5253) RETURNING "maasserver_regioncontrollerprocessendpoint"."id"; args=(datetime.datetime(2025, 5, 6, 15, 22, 13, 999466), datetime.datetime(2025, 5, 6, 15, 22, 13, 999466), 6819, Inet('$regiondIP'), 5253)

2025-05-06 15:22:14 django.db.backends: [debug] (0.061) INSERT INTO "maasserver_regioncontrollerprocessendpoint" ("created", "updated", "process_id", "address", "port") VALUES ('2025-05-06T15:22:14.021752'::timestamp, '2025-05-06T15:22:14.021752'::timestamp, 6819, '$regiondIP'::inet, 5253) RETURNING "maasserver_regioncontrollerprocessendpoint"."id"; args=(datetime.datetime(2025, 5, 6, 15, 22, 14, 21752), datetime.datetime(2025, 5, 6, 15, 22, 14, 21752), 6819, Inet('$regiondIP'), 5253)

I can paste that command into a psql window and it creates, but then gets deleted again.

CPU is also full 100% on regiond, and over 1000 events are in the queue, which look attempted machine status updates, presumably not working because of controller issues.

$ maas $USER events query limit=1000 | jq '.events | length'
1000

Please help!

Your environment was probably full of orphans that are being cleaned up at restart. Be patient and when the regiond will be up and running the racks will be able to register again.

Would be good to also report what was not working :wink:

Helpful though it is to think it might be doing some maintenance or self healing, I’m not sure about the orphans line - although my ignorance of what to expect is a factor. When I enabled debug_queries a few hours ago on regiond, most of what I can see in the logs seems like repetition. For example, this string appeared 76k times

SELECT "maasserver_node"."id", "maasserver_node"."created", "maasserver_node"."updated", "maasserver_node"."system_id", "maasserver_node"."hardware_uuid", "maasserver_node"."hostname", "maasserver_node"."description", "maasserver_node"."pool_id", "maasserver_node"."domain_id", "maasserver_node"."address_ttl", "maasserver_node"."status", "maasserver_node"."previous_status", "maasserver_node"."status_expires", "maasserver_node"."owner_id", "maasserver_node"."bios_boot_method", "maasserver_node"."osystem", "maasserver_node"."distro_series", "maasserver_node"."architecture", "maasserver_node"."min_hwe_kernel", "maasserver_node"."hwe_kernel", "maasserver_node"."node_type", "maasserver_node"."parent_id", "maasserver_node"."agent_name", "maasserver_node"."error_description", "maasserver_node"."zone_id", "maasserver_node"."cpu_count", "maasserver_node"."cpu_speed", "maasserver_node"."memory", "maasserver_node"."swap_size", "maasserver_node"."bmc_id", "maasserver_node"."instance_power_parameters", "maasserver_node"."power_state", "maasserver_node"."power_state_queried", "maasserver_node"."power_state_updated", "maasserver_node"."last_image_sync", "maasserver_node"."error", "maasserver_node"."netboot", "maasserver_node"."ephemeral_deploy", "maasserver_node"."license_key", "maasserver_node"."dynamic", "maasserver_node"."boot_interface_id", "maasserver_node"."boot_cluster_ip", "maasserver_node"."boot_disk_id", "maasserver_node"."gateway_link_ipv4_id", "maasserver_node"."gateway_link_ipv6_id", "maasserver_node"."default_user", "maasserver_node"."install_rackd", "maasserver_node"."install_kvm", "maasserver_node"."register_vmhost", "maasserver_node"."enable_ssh", "maasserver_node"."skip_networking", "maasserver_node"."skip_storage", "maasserver_node"."url", "maasserver_node"."dns_process_id", "maasserver_node"."managing_process_id", "maasserver_node"."current_commissioning_script_set_id", "maasserver_node"."current_installation_script_set_id", "maasserver_node"."current_testing_script_set_id", "maasserver_node"."locked", "maasserver_node"."last_applied_storage_layout", "maasserver_node"."current_config_id", "maasserver_node"."enable_hw_sync", "maasserver_node"."sync_interval", "maasserver_node"."last_sync" FROM "maasserver_node" WHERE "maasserver_node"."node_type" IN

and a bit less of it appeared 150k times

SELECT "maasserver_node"."id", "maasserver_node"."created", "maasserver_node"."updated", "maasserver_node"."system_id", "maasserver_node"."hardware_uuid", "maasserver_node"."hostname", "maasserver_node"."description", "maasserver_node"."pool_id", "maasserver_node"."domain_id", "maasserver_node"."address_ttl", "maasserver_node"."status", "maasserver_node"."previous_status", "maasserver_node"."status_expires", "maasserver_node"."owner_id", "maasserver_node"."bios_boot_method", "maasserver_node"."osystem", "maasserver_node"."distro_series", "maasserver_node"."architecture", "maasserver_node"."min_hwe_kernel", "maasserver_node"."hwe_kernel", "maasserver_node"."node_type", "maasserver_node"."parent_id", "maasserver_node"."agent_name", "maasserver_node"."error_description", "maasserver_node"."zone_id", "maasserver_node"."cpu_count", "maasserver_node"."cpu_speed", "maasserver_node"."memory", "maasserver_node"."swap_size", "maasserver_node"."bmc_id", "maasserver_node"."instance_power_parameters", "maasserver_node"."power_state", "maasserver_node"."power_state_queried", "maasserver_node"."power_state_updated", "maasserver_node"."last_image_sync", "maasserver_node"."error", "maasserver_node"."netboot", "maasserver_node"."ephemeral_deploy", "maasserver_node"."license_key", "maasserver_node"."dynamic", "maasserver_node"."boot_interface_id", "maasserver_node"."boot_cluster_ip", "maasserver_node"."boot_disk_id", "maasserver_node"."gateway_link_ipv4_id", "maasserver_node"."gateway_link_ipv6_id", "maasserver_node"."default_user", "maasserver_node"."install_rackd", "maasserver_node"."install_kvm", "maasserver_node"."register_vmhost", "maasserver_node"."enable_ssh", "maasserver_node"."skip_networking", "maasserver_node"."skip_storage", "maasserver_node"."url", "maasserver_node"."dns_process_id", "maasserver_node"."managing_process_id", "maasserver_node"."current_commissioning_script_set_id", "maasserver_node"."current_installation_script_set_id", "maasserver_node"."current_testing_script_set_id", "maasserver_node"."locked", "maasserver_node"."last_applied_storage_layout", "maasserver_node"."current_config_id", "maasserver_node"."enable_hw_sync", "maasserver_node"."sync_interval", "maasserver_node"."last_sync" FROM "maasserver_node" WHERE

There are 50k UPDATE and 3k INSERT queries too, and only 300 DELETEs. Maybe this is all normal as you say, but it’s been going for some hours now, quite unlike another recent revert to snapshots and backups.

And yes, certainly when we’re ready for another go at upgrades we will check in here (:, be that to 3.4.7 or 3.5.x.

mind running

select subnet_id, count(*) from maasserver_staticipaddress group by subnet_id;

and

select domain_id, count(*) from maasserver_dnsresource group by domain_id;

?

With pleasure

maasdb=# select subnet_id, count(*) from maasserver_staticipaddress group by subnet_id;
 subnet_id | count 
-----------+-------
         1 |  1746
        12 |    44
        14 |   124
        15 |   164
        16 |   116
        22 |    15
        25 |    19
        26 |    19
        28 |     7
        29 |    17
        30 |    17
        31 |    14
        63 |   588
        72 |     4
        73 |     3
        74 |   163
        75 |   156
        76 |    20
        77 |    24
        78 |    55
        79 |    30
        80 |     4
        81 |     3
        82 |    71
        83 |    67
        84 |    16
        85 |    23
        86 |    54
        87 |    26
        88 |    17
        89 |    17
        90 |    29
        91 |    29
        92 |     6
        93 |     6
        94 |     6
        95 |     6
       106 |     2
       108 |    76
           |   925
(40 rows)
maasdb=# select domain_id, count(*) from maasserver_dnsresource group by domain_id;
 domain_id | count 
-----------+-------
         1 |     1
         0 |     9
(2 rows)

Allright, then it’s not the automatic cleanup that is burning all your resources. I’d suggest to collect the regiond and the rackd logs and inspect them. How many regions and racks do you have btw?

  • 1 regiond, 2 rackd, comprising:
    • 1 original region+rack from 2017, with regiond disabled a few years ago
    • 1 new regiond created at that time, as the only active regiond
    • 1 new rackd created at that time (‘rackd 1’)

We’ve been looking at logs and not sure how to dig deeper - finding the database behaviour I opened with was already a bit of a dive. We’ll look at it with fresh eyes tomorrow though.

https://drive.google.com/drive/folders/1Q2Tetm-8FEA9B_MOEn3U5iZVCAviNQ4U?usp=drive_link has rackd 1 logs. The regiond logs are big, so I’ll upload them from a faster connection tomorrow, or maybe trim them first.

It seems this amd64 binary is what’s thrashing the CPU at the moment:

mcollins1@maas:~$ ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head -n 15
    PID    PPID USER     %CPU %MEM CMD
  80782   80780 root     76.7  0.0 /usr/share/maas/machine-resources/amd64
  80330   80328 root     76.6  0.0 /usr/share/maas/machine-resources/amd64
  80528   80527 root     76.6  0.0 /usr/share/maas/machine-resources/amd64
  80685   80683 root     76.6  0.0 /usr/share/maas/machine-resources/amd64
  80646   80644 root     76.5  0.0 /usr/share/maas/machine-resources/amd64
  80325   80302 maas      3.7  7.8 /usr/bin/python3 /usr/sbin/regiond
  80322   80302 maas      3.5  7.3 /usr/bin/python3 /usr/sbin/regiond
  80321   80302 maas      3.3  8.2 /usr/bin/python3 /usr/sbin/regiond
  80323   80302 maas      3.1  7.7 /usr/bin/python3 /usr/sbin/regiond
  80302   80301 maas      1.1  0.8 /usr/bin/python3 /usr/sbin/regiond
    247       1 root      0.3  0.8 /lib/systemd/systemd-journald
    531       1 bind      0.3  0.2 /usr/sbin/named -u bind
  80356     525 nobody    0.2  0.0 nginx: worker process

When running this binary on regiond it seems to be just looping over something and thrashingt he CPU, this provides no output:

mcollins1@maas:~$ sudo /usr/share/maas/machine-resources/amd64
[sudo] password for mcollins1:

This was a smoking gun. We replaced that binary with one from a rackd machine, and on maas-regiond restart there was no CPU thrashing. Controllers now show as registered like before.

1 Like

Thats a very unexpected behaviour. I am curious if you have LXD installed on that machine? If yes, can you please run lxc query /1.0/resources --debug (because thats basically what arm64 binary does)

Do you still have the faulty binary? Can you check the sha256sum or shasum of the binaries?

Hi @troyanov

No LXC is not installed on the machine, here are the sha256sum’s you’ve requested:

mcollins1@maas:~$ sudo sha256sum /usr/share/maas/machine-resources/amd64
ee16aab8178a9eba0f6ca412bf7e1b9e94c86c5edafbf18e3338fea0dc891ef8  /usr/share/maas/machine-resources/amd64
mcollins1@maas:~$ sudo sha256sum /usr/share/maas/machine-resources/amd64.IS-4355
4e6c160c9b2405df1086ac901ea272d102ad0cb30aa25158d1545d1568f9a942  /usr/share/maas/machine-resources/amd64.IS-4355

My suspicion is that it was corrupted when the machine was cloned on Proxmox, either that or perhaps a flipped bit from a space ray! Who knows. :person_shrugging:

@michael5collins mind running sudo strace -p <the pid of the process> if you still have it?

This way we can understand where it is looping

Sure, just had a go at this and the output is completely blank:

mcollins1@maas:~$ sudo strace -p 102944
strace: Process 102944 attached

We have more evidence that the storage on our Proxmox is playing up, so I don’t think this is a MAAS issue tbh.

@michael5collins thanks for the information.

One of the SHA that you’ve shared is not known to me, so your suspicion about Proxmox is absolutely valid. However just to be more confident, may I ask you to upload the faulty binary somewhere so we can inspect it further?

Done. It’s at the Google Drive link above.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.