Regiond Memory Leak and Database Locks in MaaS 3.4.0

oitgg · 9 October 2024 20:20

Hello MaaS community,

I’m experiencing some serious issues with my MaaS setup and could use some help. Here’s a breakdown of my environment and the problems I’m facing:

Environment:

MaaS version: 3.4.0 (snap installation)
2 MaaS instances on dedicated hosts
Netbox instance triggering frequent MaaS API calls
Ruby APIs interacting directly with MaaS API
Dedicated PostgreSQL database host

Hardware Specs:

Database host: 126GB RAM, Intel Xeon Silver 4210 CPU @ 2.20GHz
Region instances: 32GB RAM, Intel Xeon E-2286G CPU @ 4.00GHz

Issues:

Database Locks: Queries are freezing, causing numerous locks.
Regiond Memory Consumption: Child processes of regiond are being spawned a lot and consuming increasing amounts of memory, behaving like a memory leak.
Performance Impact: These issues are severely affecting system performance and stability.

Observations:

regiond.conf is set to use only 2 workers.
supervisord is spawning 2 workers as expected.
Each worker spawns many regiond child processes.
Child processes consume ~1.2GB virtual memory and ~500MB physical memory each.
Parent regiond processes reach over 20GB of both virtual and physical memory.
Memory consumption increases constantly without decreasing.
Restarting the MaaS snap temporarily resolves the issues.
Both MaaS regions seem to run queries simultaneously on the same database table, causing locks.

Troubleshooting Steps Taken:

Checked regiond logs for unusual behavior.
Disabled /metrics export to rule out constant export burst (as per MAAS regiond allocates an increasing amount of memory).
Monitored query timestamps and database activity.

Evidences:

Memory behavior before and after restarts:

mem_decrease958×725 38.4 KB
Database locks (amount and duration) and Regiond parent and child processes spawns https://imgur.com/a/v1KOM7s (can’t upload more than one media)

Questions:

Has anyone encountered similar memory leak issues with regiond in MaaS 3.4.0?
Are there known issues with database locking when multiple regions access the same table simultaneously?
What additional debugging steps or configuration changes would you recommend?
Are there any best practices for optimizing MaaS performance in a setup like mine?

Any insights, suggestions, or potential solutions would be greatly appreciated. I’m happy to provide any additional information that might be helpful in diagnosing this issue.

Thank you in advance for your help!

r00ta · 9 October 2024 20:39

Hi @oitgg ,

We get some reports about memory leaks from time to time, but so far I’ve never been able to reproduce them locally. I was investigating (once again) this issue just 3 days ago but still with no luck.

regiond.conf is set to use only 2 workers.

The expected behaviour is that you should see 1 “master” process and 2 “child” processes. Hence you should have 3 regiond processes.

Both MaaS regions seem to run queries simultaneously on the same database table, causing locks.

This is expected. Many API calls are locking the DB for different reasons (for example, when you acquire a machine).

Is there any chance to share your DB? If not, it would be useful to get a full sos report and all the MAAS logs. If you can’t share them because you can’t redact the confidential information inside, I’d suggest to try to correlate when a new process is spawned (use ps aux | grep regiond to extract when they are spawned) and what was actually happening in your system and in MAAS (check the logs).

oitgg · 9 October 2024 22:12

Hey @r00ta!

Thanks for the reply. Could you see the imgur link that I’ve provided? It has the screenshots from the queries’ behavior and the regiond processes. Can you identify any potential issue with what you see? Is it normal for the regiond to spawn this many child processes?

I’m afraid I can’t share my database with you due to its size (200GB+ and it has a lot of confidential data). I’ve never used the sos before, could you guide me on how to do it?

And also, how can I share the logs here? It only allows me to upload images.

We also suspect that this child processes spawn rate increases whenever there’s any connection lost between the regions and the racks (reported in this bug). But we’re not sure about it, sometimes looking through the logs the timestamps coincide when this happens to when the child processes spawns.

r00ta · 10 October 2024 07:18

Nothing is wrong with your processes: you have 3. You might be confused by the green lines which are threads, not processes. You can double check that you have only 1 regiond that spawns 2 regiond child processes with pstree --hide-threads.
Of course the main regiond process that is taking +20GB is wrong and we have to understand what’s wrong there.

To begin with, you can zip and share the content of /var/snap/maas/common/log/. Again, remember to redact any confidential info from there.

oitgg · 10 October 2024 22:27

Hey @r00ta, sorry for the delay, busy day.

Thanks for the answer and the explanations. I’ve gathered the maas, regiond and supervisor-run logs: https://drive.google.com/file/d/169tK6n7RVRnS5jyh37X2FyChwfzj7-jh/view?usp=sharing (I didn’t see anything useful in the chrony, named, nginx, proxy and rsyslog logs)

We’ve also noticed that the below happens with the racks all the time, is it normal?

2024-10-10 17:46:00 maasserver.ipc: [info] Worker pid:1962318 lost RPC connection to ('kenep3', 'xxx.1.xxx.7', 5250).
2024-10-10 17:46:13 maasserver.ipc: [info] Worker pid:1962319 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:46:17 maasserver.ipc: [info] Worker pid:1962319 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:46:25 maasserver.ipc: [info] Worker pid:1962319 lost RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:46:36 maasserver.ipc: [info] Worker pid:1962318 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5250).
2024-10-10 17:46:36 maasserver.ipc: [info] Worker pid:1962318 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5250).
2024-10-10 17:49:15 maasserver.ipc: [info] Worker pid:1962319 lost RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:49:15 maasserver.ipc: [info] Worker pid:1962319 lost RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:49:40 maasserver.ipc: [info] Worker pid:1962319 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:49:40 maasserver.ipc: [info] Worker pid:1962319 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:49:40 maasserver.ipc: [info] Worker pid:1962319 registered RPC connection to ('kenep3', 'xxx.1.xxx.7', 5251).
2024-10-10 17:49:42 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'prod-web-maas-rack-controller-lon (kenep3)'.
2024-10-10 17:49:42 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'prod-web-maas-rack-controller-lon (kenep3)'.
2024-10-10 17:49:45 maasserver.ipc: [info] Worker pid:1962318 lost RPC connection to ('kenep3', 'xxx.1.xxx.7', 5250).
2024-10-10 17:49:46 maasserver.ipc: [info] Worker pid:1962318 lost RPC connection to ('kenep3', 'xxx.1.xxx.7', 5250).

By the way, to give you a better overview of our stack, we have 21 racks, 2 regions and 1 database. There are 5569 machines managed by our MaaS stack.

r00ta · 11 October 2024 21:57

I see you restarted the region at 2024-10-10 17:08:57. Have you tracked the memory increase from 2024-10-10 17:08:57 to 2024-10-10 17:35:33? Ideally we have to correlate what’s happening in MAAS when there was a bump in the memory usage

oitgg · 12 October 2024 01:28

Hey @r00ta, yes, we had to restart it because the host’s memory was decreasing too fast and the VM could die if we didn’t restart the instance.

This was the memory behavior before and after the restart and the CPU usage by process
This was the queries running and their duration

We’ve noticed that these UPDATE queries causes locks all the time, and they are always present when these memory drains happen, but I really don’t know if they’re causing this issue.

oitgg · 15 October 2024 15:49

Hey @r00ta, how are you doing?

We’ve noticed that even in our newest MaaS instances (and new database for them), there are a lot of database locks being caused by those same UPDATE and INSERT queries from the region/rack controller tables.

Whenever the locks occur, the actions don’t work properly, and I get the following error:
{'error_code': 9, 'error_reason': 'GENERIC_MAAS_REQUEST_FAILED_EXCEPTION', 'error_msg': 'op=commission failed - Code 503 - Unable to connect to any rack controller dcktrk; no connections available.'}

As I said before, in our stack, we have 21 racks, 2 regions and 1 database. There are 5569 machines managed by our MaaS stack. I don’t know if the amount of resources managed by the stack could be causing any kind of overwhelm to the internal architecture of MaaS.

r00ta · 15 October 2024 16:33

Could you clarify what kind of DB locks you are observing? Usually the only DB lock that can affect performances is held when you acquire a machine (also, when you deploy a machine the acquire step is part of it so it will hold the lock as well).

As the error message suggests, no RPC connection to your dcktrk rack was available and MAAS could not perform some operations. Do you have all the racks inside the same DC or do you have them also in remote locations?

oitgg · 16 October 2024 19:25

Hey @r00ta!

All the following queries were identified as Locks:

They happen whenever there’s any record update or delete in the regioncontrollerprocess and regionrackrpcconnection tables.

And about the racks, from the 21 racks, only 2 are not remote, the other 19 are remote.

r00ta · 16 October 2024 20:17

From this snippet you can get a rough idea of the DB locks used in MAAS. Every DatabaseLock is using pg_advisory_lock, but from what I’ve seen so far it was never an issue. If you have evidences about some performance issue it would be great to have a full report and evidences about what’s actually happening.

Regarding the 19 remote racks I’m not surprised that you had some connection issues. Having the racks in a remote location is not a supported setup

oitgg · 16 October 2024 21:19

Got it, I’ll keep an eye to see if I can correlate these locks with any performance issue.

But my main concern right now continues being the memory usage, it keeps rising in a constant rate and making me restart the instances every day. I’ve set up an auditcl rule to keep an eye on the regiond process, but it gave me nothing so far.

Where/what else you recommend me looking for?

r00ta · 16 October 2024 22:23

If possible, could you try to disable the metrics monitoring in MAAS and see if you still experience the memory leak? I do suspect there might be a leak there if you have that amount of racks as we keep track of many metrics about RPC latencies and similar

oitgg · 18 October 2024 18:29

@r00ta, we’ve looked all around how to properly disable the metrics monitoring, but with no success, we managed to block the metrics port on the firewall so it doesn’t perform the data egress, but I imagine it still keeps producing the metrics either way.

How can I properly disable them? I’m using the 3.4.0 snap version.

r00ta · 18 October 2024 19:52

See https://maas.io/docs/monitoring-maas-activities in the section about enabling prometheus

oitgg · 18 October 2024 20:02

So I’d only have to set this to false?
maas $PROFILE maas set-config name=prometheus_enabled value=false

oitgg · 22 October 2024 18:05

@r00ta, I’ve noticed something pretty weird today. I had a spike on memory usage loada nd network input at a specific time:

I went to check the logs for the same specific time window, and saw something weird with the internal DNS logs:

2024-10-22 10:50:08 maasserver.region_controller: [critical] Failed configuring DNS.
	Traceback (most recent call last):
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 271, in _onTimer
	    self.runUntilCurrent()
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/base.py", line 991, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 700, in errback
	    self._startRunCallbacks(fail)
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
	    current.result = callback(  # type: ignore[misc]
	  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/region_controller.py", line 403, in _onDNSReloadFailure
	    failure.trap(DNSReloadError)
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/failure.py", line 451, in trap
	    self.raiseException()
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
	    raise self.value.with_traceback(self.tb)
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 244, in inContext
	    result = inContext.theWork()  # type: ignore[attr-defined]
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 260, in <lambda>
	    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/context.py", line 117, in callWithContext
	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
	  File "/snap/maas/32469/usr/lib/python3/dist-packages/twisted/python/context.py", line 82, in callWithContext
	    return func(*args, **kw)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 856, in callInContext
	    return func(*args, **kwargs)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 203, in wrapper
	    result = func(*args, **kwargs)
	  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/utils/orm.py", line 771, in call_within_transaction
	    return func_outside_txn(*args, **kwargs)
	  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/utils/orm.py", line 574, in retrier
	    return func(*args, **kwargs)
	  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
	    return func(*args, **kwds)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/prometheus/utils.py", line 127, in wrapper
	    result = func(*args, **kwargs)
	  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/dns/config.py", line 125, in dns_update_all_zones
	    bind_write_zones(zones)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/dns/actions.py", line 196, in bind_write_zones
	    zone.write_config()
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/dns/zoneconfig.py", line 661, in write_config
	    with freeze_thaw_zone(needs_freeze_thaw, zone=zi.zone_name):
	  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
	    next(self.gen)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/dns/actions.py", line 97, in freeze_thaw_zone
	    bind_thaw_zone(zone=zone, timeout=timeout)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/dns/actions.py", line 73, in bind_thaw_zone
	    execute_rndc_command(cmd, timeout=timeout)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/dns/config.py", line 310, in execute_rndc_command
	    call_and_check(rndc_cmd, timeout=timeout)
	  File "/snap/maas/32469/lib/python3.10/site-packages/provisioningserver/utils/shell.py", line 104, in call_and_check
	    stdout, stderr = process.communicate(timeout=timeout)
	  File "/usr/lib/python3.10/subprocess.py", line 1154, in communicate
	    stdout, stderr = self._communicate(input, endtime, timeout)
	  File "/usr/lib/python3.10/subprocess.py", line 2022, in _communicate
	    self._check_timeout(endtime, orig_timeout, stdout, stderr)
	  File "/usr/lib/python3.10/subprocess.py", line 1198, in _check_timeout
	    raise TimeoutExpired(
	subprocess.TimeoutExpired: Command '['rndc', '-c', '/var/snap/maas/32469/bind/rndc.conf.maas', 'thaw', 'b.3.0.0.4.0.0.2.0.4.4.6.5.0.6.2.ip6.arpa']' timed out after 2 seconds

r00ta · 22 October 2024 19:04

That’s interesting. I wonder if it’s the rackd process or named that is leaking all that memory

oitgg · 22 October 2024 19:22

Do you have any insights? Can I provide any additional information to help you better analyze this issue?

We’ll be moving these two instances that are consuming a lot of memory to better hosts, with higher memory values to give us a little room to breathe without worrying much about restarting the instances all the time, but this behavior will probably continue, and we’re with our hands tied here with no idea on how to proceed.

r00ta · 22 October 2024 19:55

Unfortunately it’s impossible to provide enterprise support in this community forum. Some setups are very complex and can lead to very nasty bugs: without dedicated experts looking at things one by one it’s very hard to find the root of the problem.

Unfortunately due to the impossibility to have all the details of your setup we are just shooting in the dark: I might guess that it’s actually named that is leaking memory, maybe due to the number of DNS records and updates that you have in your environment. Maybe it’s the rackd, maybe not.

I’d be happy if you keep posting your findings here, but I think it’s important to set the expectations on the potential outcome you might get.

Ideally in order to report us a potential bug

you should provide the full reproducer of the issue OR
you should provide all the information from your environments (monitoring traces, full sos reports, network topology etc…) AND enable the debug mode to collect more logs