MAAS regiond allocates an increasing amount of memory

We discovered this when we stop with restart of MAAS regiond services every night for backup of DB.
I can see this behaviour both in 2.7 and after the upgrade to 2.9.1, upgraded 2021-02-01.
And after that we stoped restart of MAAS region services every night.

And it is only one of the regiond processes that use a increasing amount of memory.

After turning maas-metrics off in MAAS and removed it from Prometheus. Now memory consumption looks OK. Memory utilization is also evenly distributed across the MAAS python processes. So something is broken in Prometheus maas metrics.

1 Like

nice catch, thank you! if you don’t mind telling me, how many machines are you running at once, on average?

Not sure what you mean by “how many machines are you running at once”
I only run MAAS metrics on the region controller from one Prometheus server.

Prometheus conf:
job_name: maas-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /MAAS/metrics
scheme: http
static_configs:
- targets:
- maas.rnd.scania.com:5240

actually, i was asking about the size of your MAAS site, as in how many machines do you have deployed at any given time? i’m particularly interested in the issues faced by customers with, say, 100 to 1000 machines deployed at any given time. :slight_smile:

1000 is pretty low actually. try 5000.

1 Like

@billwear Right now we have 294 machines.

1 Like

thanks, @glasse, that’s a good benchmark! do you have a feel for issues you see with that many machines, e.g., things you wouldn’t experience with fewer machines? special problems for such a large MAAS? performance nuances?

@evan.sikorski, please, do tell! e.g., how many deployed and operating at once? issues you experience with that many machines that you wouldn’t see with fewer machines? special problems for such a large MAAS? performance nuances?

least 4900 of the 5000 are deployed.

no issues other than the machines list taking a very long time to load.

considering that opening a new tab with machines list causes it to reload the entire machines list, we have to be careful about opening new tabs.

other than that, it is stable like this.

2 Likes

thank you. that’s helpful. i’ll pass the tabs issues along to our UI team, that’s already an issue with smaller MAASes than yours.

Now it’s confirmed that every query to MAAS/metrics does allocate memory that is never released.

1 Like

I used to poll these metrics in the past but have no done so in recent versions.

I’ll try to find time to test this myself to confirm, although I’m just another user like you.

Have you posted a bug yet?

yes, this should definitely be filed as a bug, even if it turns out to be something simpler, like a configuration thing. thanks for your diligence on this, @glasse!

I have a case with Canonical support and they have observed the same behavior. So this will definitely be filed as a bug. This metrics is a good way to keep track of the MAAS environment. We use it for things like HW (machine) and subnet utilization in our HW cloude.

1 Like