Slow API response time

34t24tr · 9 October 2024 19:23

Hoping to get some help on following problem:

My goal is to get information about machines enlisted in MAAS through API requests. When testing API response time I see response takes anywhere between 0.2 to 2 seconds to get single machine info and 1.2 to 4 seconds for all systems (currently 30 enlisted) which is quite slow. Is this expected response time? If not, what could be the issue?
I’m planning to have at least 100 machines at any time in MAAS, so this performance is a problem (considering response time only going to increase with more systems).
MAAS version - 3.5.1
One region+rack controller connected to 2 subnets
Hardware is not overloaded with MAAS jobs

r00ta · 9 October 2024 21:23

Hi @34t24tr ,

The API to get a single machine should be reasonably fast. I’ve never seen it taking 2 seconds.

Instead, the machine listing performance is a well known problem that we are addressing.

rageltman · 21 October 2024 20:15

on MaaS 3.5 (snap), we’re seeing minutes of API response time to read a machine when 32 are in-scope:

data.maas_machine.tst["iter.element"]: Read complete after 2m33s [id=8p8tyq]

the API problems with MaaS seem to “stack up” as requests come in and it lags further and further behind.

This causes problems commissioning or deploying large numbers of machines via API as MaaS effectively freezes, preventing itself from rendering responses via the node control plane resulting in timed-out PXE/HTTP/etc boot stages and other failures.

r00ta · 21 October 2024 20:19

How many machines are you managing with MAAS and what’s the actual load on the servers? Also, do you mind running

select count(*) from maasserver_event;

on the db?

rageltman · 22 October 2024 12:11

As i said above, that happens with 32 machines (16 with enough disks and NICs will do it too when code iterates over them in nested sets). A terraform plan using Canonical’s own TF provider for 32 machines with 8 disks and 12 NICs each takes over an hour when just 4 of those NICs are bonded with a single L3 configuration atop that bond. These are new machines which have done nothing yet, they were commissioned into maas and then provisioned via TF in small batches of 5-6 at a time because the MaaS API times out when deploying any more than that.

I see squashfuse hitting a few cores pretty hard (if the machine has squashfs already loaded… just use the kmod, and i say this as one of the people who helped with Slax back in the day - sqfs was written to stuff OS’ into what are now considered miniscule USB drives, we didn’t have usespace NS’ and all that back then) but this is a rackd problem: rackd pids are pinned at 100% CPU usage.

Would be really nice to have a platform agnostic packaging setup so we wouldn’t be digging through snap-insanity to try and find out what’s going on.

r00ta · 22 October 2024 12:27

Thanks for the report!

Given that you use TF, it would be great if you could provide a reproducer for the performance issues you are hitting

rageltman · 25 October 2024 13:26

@r00ta: enroll 16 machines with a few disks and NICs each, have TF-iterate (use variables, locals, other data sources) over all of the machines to read their data.maas_machine types, create their resources, then use those resources to feed iteration on maas_block_device, bond, vlan, link, and finally instance. Even with maas on a 192c machine this takes 20+m of API call iteration as each call slows the whole thing down. You can do the same thing with shell scripts on the CLI which hit the same API - read all the disks/NICs/machines and iterate those IDs to run commands in-parallel over the whatever resource type you’re iterating with job control or another concurrency primitive.

r00ta · 25 October 2024 16:18

I suspect the issue is actually on the TF provider as it might make some very expensive calls that might not be necessary. If you could provide the TF or the monitoring profile of your MAAS instance we can try to figure out which calls are made and putting MAAS under stress.

r00ta · 25 October 2024 17:12

Never mind, I think I found the issue Many functions are calling the machine listing to retrieve a single machine, overloading the MAAS regions · Issue #239 · canonical/terraform-provider-maas · GitHub