MaaS Services restarting a lot

Hello guys!

I will first try to explain what’s happening and then proceed with the question.

I’m noticing that some of my Deploys sometimes fails because the Deploy stuck in some steps, for example:

HTTP Request - /images/ubuntu/amd64/ga-22.04/jammy/stable/boot-initrd

or

Node installation - 'cloudinit' searching for network data from DataSourceMAAS

And I saw that this happens when some of MaaS services restarts.

Example of one Rack Controller status:

Is valid to say that sometimes just the proxy service restarts, but almost everytime its proxy, http and syslog restarting together.

Then I dug in MaaS code and was able to see that we have some Postgres Triggers, for example:

subnet_sys_proxy_subnet_insert
subnet_sys_proxy_subnet_update
subnet_sys_proxy_subnet_delete

This triggers are responsible to make the proxy restart on regions.

I began to search to see if I saw something similar to Rack Controllers, but I didn’t found yet.

Other thing that I noticed is that the table maasserver_regioncontrollerprocessendpoint updates a lot, changing the registers constantly. My guess is that the services are restarting everytime that MaaS renovates its connections between Rack Controllers and Regions (and consequently the table registers). Is that makes sense?

Example of the table:

maasv2=# select * from maasserver_regioncontrollerprocessendpoint;
   id   |            created            |            updated            |    address     | port | process_id 
--------+-------------------------------+-------------------------------+----------------+------+------------
 316390 | 2024-01-10 13:48:20.719268+00 | 2024-01-10 13:48:20.719268+00 | IP_HERE        | 5251 |       8440
 316398 | 2024-01-10 13:48:40.321683+00 | 2024-01-10 13:48:40.321683+00 | IP_HERE        | 5252 |       8483
 316367 | 2024-01-10 13:36:29.920189+00 | 2024-01-10 13:36:29.920189+00 | IP_HERE        | 5250 |       8421
 316368 | 2024-01-10 13:36:30.771181+00 | 2024-01-10 13:36:30.771181+00 | IP_HERE        | 5251 |       8425
 316369 | 2024-01-10 13:36:31.735024+00 | 2024-01-10 13:36:31.735024+00 | IP_HERE        | 5252 |       8428
 316370 | 2024-01-10 13:36:32.890751+00 | 2024-01-10 13:36:32.890751+00 | IP_HERE        | 5253 |       8430
 316384 | 2024-01-10 13:48:09.747454+00 | 2024-01-10 13:48:09.747454+00 | IP_HERE        | 5250 |       8431
 316387 | 2024-01-10 13:48:17.713554+00 | 2024-01-10 13:48:17.713554+00 | IP_HERE        | 5253 |       8438
 316391 | 2024-01-10 13:48:27.187671+00 | 2024-01-10 13:48:27.187671+00 | IP_HERE        | 5252 |       8435
 316400 | 2024-01-10 13:50:05.606275+00 | 2024-01-10 13:50:05.606275+00 | IP_HERE        | 5250 |       8488
 316401 | 2024-01-10 13:50:07.736542+00 | 2024-01-10 13:50:07.736542+00 | IP_HERE        | 5251 |       8491
 316402 | 2024-01-10 13:50:09.913182+00 | 2024-01-10 13:50:09.913182+00 | IP_HERE        | 5253 |       8493
(12 rows)

Other question is, what triggers MaaS to renew these connections between RCs <-> Regions (and update the table all the time)? Why its not stable? There is something that I can do to avoid that behavior?

My setup has 3 regions running with HA and 19 RCs.

Looking forward for your help!

Thanks!

Every region has multiple processes running. One of them is called master, the other ones are called workers.
Every worker is exposing an RPC endpoint on every interface of the region. This table keeps track of this information.
In short, this table is NOT keeping track of the RPC connections between the rack and the region. If you are looking for such information, you should look at maasserver_regionrackrpcconnection

Thanks for answering!

I saw that MaaS has 3 tables that are related, maasserver_regionrackrpcconnection, maasserver_regioncontrollerprocessendpoint and maasserver_regioncontrollerprocess.

The maasserver_regionrackrpcconnection table also renews a lot. What I want to understand is what causes a rack to lose its connection with a worker/region.

What I can see in the maasserver_regionrackrpcconnection is that every RC is connected to each worker (I have 4 workers in each region) that belongs to a region. So every RC always have 12 entries in the table (because I have 3 regions, so 3*4=12). And as I can see, these connections changes constantly.

Example of entries for one of my RC:

    id    |            created            |            updated            | endpoint_id | rack_controller_id
----------+-------------------------------+-------------------------------+-------------+--------------------
 13662235 | 2024-01-10 17:58:20.850081+00 | 2024-01-10 17:58:35.685361+00 |      316982 |               4453
 13662259 | 2024-01-10 17:58:25.499041+00 | 2024-01-10 17:58:35.786276+00 |      316984 |               4453
 13662007 | 2024-01-10 17:56:26.381569+00 | 2024-01-10 18:00:45.562622+00 |      316969 |               4453
 13662036 | 2024-01-10 17:56:28.019374+00 | 2024-01-10 17:59:45.517972+00 |      316970 |               4453
 13662062 | 2024-01-10 17:56:29.969844+00 | 2024-01-10 18:01:45.913468+00 |      316973 |               4453
 13662262 | 2024-01-10 17:58:27.053132+00 | 2024-01-10 18:02:15.552838+00 |      316985 |               4453
 13662369 | 2024-01-10 17:59:32.199332+00 | 2024-01-10 17:59:32.199332+00 |      316987 |               4453
 13662360 | 2024-01-10 17:59:29.857787+00 | 2024-01-10 17:59:29.857787+00 |      316986 |               4453
 13662125 | 2024-01-10 17:56:45.6137+00   | 2024-01-10 17:56:46.151874+00 |      316939 |               4453
 13662040 | 2024-01-10 17:56:28.619434+00 | 2024-01-10 17:58:45.893859+00 |      316972 |               4453
 13661979 | 2024-01-10 17:56:15.363964+00 | 2024-01-10 17:56:15.419896+00 |      316938 |               4453
 13662242 | 2024-01-10 17:58:22.886796+00 | 2024-01-10 17:58:35.585631+00 |      316983 |               4453

There is something that I can do to discover what causes my RCs disconnect from these workers and create new conenctions? Is there a way of making these connections more stable?

Do you have any clue if this is causing my RC services to restart a lot?

Thanks!

unfortunately this is a known issue Bug #1998615 “Rack controller status flapping when “ClusterClien...” : Bugs : MAAS . Do you see any degradation in the performances because of that?

I am running 3.3.4 and have the same behavior as some guys replied in the thread that you sent.

The degradation that I am having is related to proxy, http and syslog services restarting a lot (I guess proxy and http are the main ones that causes my errors). Because the service restarts a lot and it can cause a Deploy to fail.

There is also a thing happening that is really strange. When I see the maasserver_service table, it seems that the services are stable (example below):

  id   |            created            |            updated            |    name     |  status  |             status_info              | node_id
-------+-------------------------------+-------------------------------+-------------+----------+--------------------------------------+---------
 41046 | 2024-01-05 03:07:30.346937+00 | 2024-01-10 07:19:25.310939+00 | dhcpd6      | off      |                                      |    5181
 41049 | 2024-01-05 03:07:30.350698+00 | 2024-01-05 19:02:55.410347+00 | tftp        | running  |                                      |    5181
 41044 | 2024-01-05 03:07:30.344166+00 | 2024-01-05 19:02:55.418061+00 | http        | running  |                                      |    5181
 41042 | 2024-01-05 03:07:30.341308+00 | 2024-01-05 19:02:55.43428+00  | ntp_rack    | running  |                                      |    5181
 41048 | 2024-01-05 03:07:30.349405+00 | 2024-01-05 19:02:55.442678+00 | dns_rack    | running  |                                      |    5181
 41050 | 2024-01-05 03:07:30.352288+00 | 2024-01-05 19:02:55.448266+00 | proxy_rack  | running  |                                      |    5181
 41045 | 2024-01-05 03:07:30.345514+00 | 2024-01-05 19:02:55.454553+00 | syslog_rack | running  |                                      |    5181
 41043 | 2024-01-05 03:07:30.342922+00 | 2024-01-10 18:38:55.393367+00 | rackd       | degraded | 92% connected to region controllers. |    5181
 41047 | 2024-01-05 03:07:30.348133+00 | 2024-01-10 07:19:25.309151+00 | dhcpd       | running  |                                      |    5181

But if I run maas status on my RC, we can see that the service was restarted (data below was taken now).

bind9                            RUNNING   pid 1590218, uptime 5 days, 15:35:43
dhcpd                            RUNNING   pid 1595577, uptime 5 days, 15:05:36
dhcpd6                           STOPPED   Not started
http                             RUNNING   pid 3203141, uptime 0:28:18
ntp                              RUNNING   pid 1590204, uptime 5 days, 15:35:43
proxy                            RUNNING   pid 3204343, uptime 0:22:47
rackd                            RUNNING   pid 1589781, uptime 5 days, 15:36:19
syslog                           RUNNING   pid 3203166, uptime 0:28:18

I cannot think about anything specific on top of my head with this limited amount of information.
Your environment seems to be a complex production with 3 region in HA and 19 racks. The “problem” behind this behavior could be literally everything and I suspect it’s unfortunately impossible to tackle it at community level.

I’d say you can open a new bug with as much information as you can. For example, you should quantify all the statements like

Because the service restarts a lot

How frequently? Does this happen on all the regions? And so on…

You can also upload an sos report that could help to triage this issue.

How frequently? Does this happen on all the regions? And so on…

There is no pattern, sometimes there is RCs can keep the proxy, http and syslog service up for 1 hour or more, but then this same RC restarts the services without any pattern (for example, restart and 2 minutes after restarts again, then runs for about 10 minutes and restarts again). This is happening in all Rack Controllers.

I will continue to investigate the code and the logs to see what I can gather, because this behavior is really strange. I will keep this updated when I find something new.

Thanks for your attention!