MAAS regiond allocates an increasing amount of memory

glasse · 9 February 2021 08:59

We discovered this when we stop with restart of MAAS regiond services every night for backup of DB.
I can see this behaviour both in 2.7 and after the upgrade to 2.9.1, upgraded 2021-02-01.
And after that we stoped restart of MAAS region services every night.

And it is only one of the regiond processes that use a increasing amount of memory.

glasse · 22 April 2021 06:33

After turning maas-metrics off in MAAS and removed it from Prometheus. Now memory consumption looks OK. Memory utilization is also evenly distributed across the MAAS python processes. So something is broken in Prometheus maas metrics.

billwear · 22 April 2021 14:07

nice catch, thank you! if you don’t mind telling me, how many machines are you running at once, on average?

glasse · 26 April 2021 06:40

Not sure what you mean by “how many machines are you running at once”
I only run MAAS metrics on the region controller from one Prometheus server.

Prometheus conf:
job_name: maas-metrics
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /MAAS/metrics
scheme: http
static_configs:
- targets:
- maas.rnd.scania.com:5240

billwear · 26 April 2021 18:42

actually, i was asking about the size of your MAAS site, as in how many machines do you have deployed at any given time? i’m particularly interested in the issues faced by customers with, say, 100 to 1000 machines deployed at any given time.

evan.sikorski · 27 April 2021 03:26

1000 is pretty low actually. try 5000.

glasse · 27 April 2021 06:18

@billwear Right now we have 294 machines.

billwear · 27 April 2021 12:08

thanks, @glasse, that’s a good benchmark! do you have a feel for issues you see with that many machines, e.g., things you wouldn’t experience with fewer machines? special problems for such a large MAAS? performance nuances?

billwear · 27 April 2021 12:09

@evan.sikorski, please, do tell! e.g., how many deployed and operating at once? issues you experience with that many machines that you wouldn’t see with fewer machines? special problems for such a large MAAS? performance nuances?

evan.sikorski · 30 April 2021 04:00

least 4900 of the 5000 are deployed.

no issues other than the machines list taking a very long time to load.

considering that opening a new tab with machines list causes it to reload the entire machines list, we have to be careful about opening new tabs.

other than that, it is stable like this.

billwear · 30 April 2021 15:11

thank you. that’s helpful. i’ll pass the tabs issues along to our UI team, that’s already an issue with smaller MAASes than yours.

glasse · 3 May 2021 11:53

Now it’s confirmed that every query to MAAS/metrics does allocate memory that is never released.

evan.sikorski · 3 May 2021 15:10

I used to poll these metrics in the past but have no done so in recent versions.

I’ll try to find time to test this myself to confirm, although I’m just another user like you.

Have you posted a bug yet?

billwear · 3 May 2021 16:35

yes, this should definitely be filed as a bug, even if it turns out to be something simpler, like a configuration thing. thanks for your diligence on this, @glasse!

glasse · 4 May 2021 06:41

I have a case with Canonical support and they have observed the same behavior. So this will definitely be filed as a bug. This metrics is a good way to keep track of the MAAS environment. We use it for things like HW (machine) and subnet utilization in our HW cloude.

glasse · 8 June 2021 06:28

Thanks @vtapia

system · 18 May 2022 21:21

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.