We discovered this when we stop with restart of MAAS regiond services every night for backup of DB.
I can see this behaviour both in 2.7 and after the upgrade to 2.9.1, upgraded 2021-02-01.
And after that we stoped restart of MAAS region services every night.
After turning maas-metrics off in MAAS and removed it from Prometheus. Now memory consumption looks OK. Memory utilization is also evenly distributed across the MAAS python processes. So something is broken in Prometheus maas metrics.
actually, i was asking about the size of your MAAS site, as in how many machines do you have deployed at any given time? i’m particularly interested in the issues faced by customers with, say, 100 to 1000 machines deployed at any given time.
thanks, @glasse, that’s a good benchmark! do you have a feel for issues you see with that many machines, e.g., things you wouldn’t experience with fewer machines? special problems for such a large MAAS? performance nuances?
@evan.sikorski, please, do tell! e.g., how many deployed and operating at once? issues you experience with that many machines that you wouldn’t see with fewer machines? special problems for such a large MAAS? performance nuances?
yes, this should definitely be filed as a bug, even if it turns out to be something simpler, like a configuration thing. thanks for your diligence on this, @glasse!
I have a case with Canonical support and they have observed the same behavior. So this will definitely be filed as a bug. This metrics is a good way to keep track of the MAAS environment. We use it for things like HW (machine) and subnet utilization in our HW cloude.