Loading of machines still slow in 2.9 (better than 2.6.x though)

dandruczyk · 18 April 2021 18:24

Background: I have a large environment with 1400+ hosts , 3 region controllers (HA with TLS offload via haproxy and floating VIP via keepalived, all provisioned with terraform and puppet) , 28 rack controllers, which I just rebuilt as 2.9.x

With maas 2.6 it took 3-8 minutes to load the machines page (ouch!, Our #1 user complaint). now with 2.9, it takes just about a minute (they’ll complain eventually).

Questions:
Why does it need to load all of them ? Search doesn’t work consistently till the FULL LIST is loaded, but in this modern webUI world, what’s a javascript call to the region to do a search going to actually cost?). This master machine list should be cacheable i.e. (memcache or similar) so that’s it’s not so painful to load it on big environments. The search IMHO should avoid the cache for queries (sounds crazy but bear with me) and query the DB every time (freshest data) and UPDATE the cache (such that the “main list” of machines is kept current without needing to ask the DB for all of them again, and time the tab is reloaded.

Make the amount or machines fetched from the DB a configurable setting (it seems ot count by 100’s, not sure if you’re querying for 100 at a time or not) as well as the number listed per page. Those of us with SSD backed DB’s can probably be more aggressive on some queries if the schema is optimal with proper indexes.

Why does clicking on a machine and then going back force it to RELOAD ALL OF THEM AGAIN causing yet another thumb twiddling session. (this should leverage the above mentioned cache…)

NOTE: Based on your docs for a HA setup, a user is confined to a single region controller so this cache doesn’t not need to be sharded/shared or replicated among other region controllers, it could be in effect a simple memcached process in the snap.

huwshimi · 19 April 2021 09:15

Hi, thanks for the info, we’re aware that fetching the machine list is slow, but it’s useful to hear the numbers of different node types you have.

There are a number of things in play that are resulting in the slow performance you’re seeing.

We’re in the process of migrating the UI to a modern architecture, this is the reason you’re seeing improved performance vs 2.6. However, parts of the UI is still using the old architecture and this is reason you’re seeing the machine list being refetched (navigating between the machine list and machines won’t reload the list in the next release).

Once this migration is complete it will allow us to make some changes to the API and DB queries to optimise for the new UI.

We’re also exploring some different ways to surface machines in the UI for large deployments like yours, instead of the current list of machines we have right now.

How do you currently search for machines? Do you search by hostname or some other parameters? Do you use any of the advanced search filters (https://maas.io/docs/snap/2.9/ui/interactive-search#heading--manual-filters)?

It’d also be interesting to hear if you navigate through the pages looking for a machine or change the grouping or if you only ever search for machines.

dandruczyk · 19 April 2021 16:00

Mostly our users search via hostname or substring of the host name.

In my personal use of maas, I tend to jump between the subnets and machines pages and if I don’t do these in separate tabs I get my thumb twiddling session(s), esp as the environment is more built out. . For other users, they are usually only on the machines page or a page for a specific machine, and if they don’t those in separate tabs they get hit with the same penalty if you go back to the list expecting it to already be there…

The internal automation tool I wrote (maasterblaster) largely keeps users from needing to interact with the GUI, mainly because the experience on 2.6 was so terribly sluggish and fraught with weird rendering anomalies. but in many cases they might need to go to the gui to look at machine logs to find out why something wouldn’t image properly or to throw something into rescue mode.

pjonason · 21 April 2021 13:54

I’ll pile on here as we also have multiple thousands of machines in MAAS. Searching is usually filtered by either hostname or resource pool or owner, sometimes by fabric or AZ, as those are generally the borders of a MAAS consumer team’s machines. We’ve always been frustrated that getting a machine means getting every bit of information about that machine, especially when getting all machines (which takes about ten minutes in our environment), but that isn’t necessarily a UI problem, per se. However, it seems like the population of the machines in the UI is from oldest to newest, and since we’re usually more interested in the newest machines (which are the ones being deployed or fixed), we have to wait til the end of the population to start working. It would be better for us if the newest machines populated first.

billwear · 21 April 2021 16:51

@pjonason, if there’s more you can tell me about your architecture, I’d like to know.

I document MAAS, so I have a natural curiosity about large MAAS installs and how our users apply MAAS, especially from the perspective of providing good doc to make usage more efficient.

Can you tell me more about your application and maybe some of the pain points?

dandruczyk · 20 May 2021 02:05

UI and general Pain points: 2.9.2 (9165-g.c3e7848d1)
Slow loading on login (up to 2 minutes before everything loads (1700+ machines) (this is 3 region controllers in HA configuration with 25 rack controllers

subnets tab:
When editing/adding a vlan for space, or fabric, they are UNSORTED (painful)
Subnet edit view: Fabrics APPEAR to be sorted, but vlans definitely are not

“managed allocation” toggle. The help for this is somewhat confusing/nonsensical and confusing I’ve used maas well over a year and even now its descriptions and behavior still don’t make complete sense. It should be something like “DHCP on/off”, as that’s what it tends to actually do, toggling it off should grey out any dynamic ranges, unless I’m completely misunderstanding this field/behavior.

Availability Zones: This dosn’t seem to provide much value (yet) you can add a name, whoopity do, now what?

DNS: MaaS should STOP TRYING TO BE AUTHORITATIVE DNS. It might be fine for a small shop, but for an enterprise it’s worthless. An enterprise is going to have enterprise grade DNS, possibly more than one (AD and external). If maas wants to be a caching DNS server, or have integrations with upstream authoritative servers, then add modular functionality to manage records within those systems (AD, Bind, powerdns, etc)

After-the-fact connection to maas. Once a machine is imaged I DO NOT WANT IT TO HAVE A DEPENDENCY ON MAAS, we’ve had nodes that hung on boot trying to phone home for cloud-init crap after they were provisioned and fully deployed in produciton, this is unacceptable as it lead to an outage while the machine hung on boot waiting for a response that could never come…

Controllers: the view breaks if you have more than 18 or so, when you scroll the page rendering breaks in odd ways.

Switching tabs and back to the machines tags sometimes causes the machines view to reload all 1700+ and counting objects making the user throw up his/her hands in disgust and go for a drink…

billwear · 20 May 2021 13:27

thanks very much, @dandruczyk. exactly the kind of feedback we need to hear.