Hello MaaS community,
I’m experiencing some serious issues with my MaaS setup and could use some help. Here’s a breakdown of my environment and the problems I’m facing:
Environment:
- MaaS version: 3.4.0 (snap installation)
- 2 MaaS instances on dedicated hosts
- Netbox instance triggering frequent MaaS API calls
- Ruby APIs interacting directly with MaaS API
- Dedicated PostgreSQL database host
Hardware Specs:
- Database host: 126GB RAM, Intel Xeon Silver 4210 CPU @ 2.20GHz
- Region instances: 32GB RAM, Intel Xeon E-2286G CPU @ 4.00GHz
Issues:
- Database Locks: Queries are freezing, causing numerous locks.
- Regiond Memory Consumption: Child processes of regiond are being spawned a lot and consuming increasing amounts of memory, behaving like a memory leak.
- Performance Impact: These issues are severely affecting system performance and stability.
Observations:
- regiond.conf is set to use only 2 workers.
- supervisord is spawning 2 workers as expected.
- Each worker spawns many regiond child processes.
- Child processes consume ~1.2GB virtual memory and ~500MB physical memory each.
- Parent regiond processes reach over 20GB of both virtual and physical memory.
- Memory consumption increases constantly without decreasing.
- Restarting the MaaS snap temporarily resolves the issues.
- Both MaaS regions seem to run queries simultaneously on the same database table, causing locks.
Troubleshooting Steps Taken:
- Checked regiond logs for unusual behavior.
- Disabled /metrics export to rule out constant export burst (as per MAAS regiond allocates an increasing amount of memory).
- Monitored query timestamps and database activity.
Evidences:
-
Memory behavior before and after restarts:
-
Database locks (amount and duration) and Regiond parent and child processes spawns https://imgur.com/a/v1KOM7s (can’t upload more than one media)
Questions:
- Has anyone encountered similar memory leak issues with regiond in MaaS 3.4.0?
- Are there known issues with database locking when multiple regions access the same table simultaneously?
- What additional debugging steps or configuration changes would you recommend?
- Are there any best practices for optimizing MaaS performance in a setup like mine?
Any insights, suggestions, or potential solutions would be greatly appreciated. I’m happy to provide any additional information that might be helpful in diagnosing this issue.
Thank you in advance for your help!