Landing page takes multiple minutes to load 2000 nodes

seanhoughton · 15 September 2020 21:57

My group has been using MaaS for a few years and the UI has always been pretty slow. However, now that we have around 2,000 machines it’s basically unusable. Users must wait around two minutes for the machines to populate in the landing page. Then, after clicking on a node, users have to wait another 3 or 4 minutes for the node page to populate. Most users simply click on a node, switch to other work tasks, then come back a while later to see if the page has loaded. When setting up new machines this seriously affects productivity to the point that we’re considering alternative products.

Both the web service and database are hosted on 72 thread servers with 768 GB of RAM, 2x40 Gb ethernet, and RAID6 SSD storage. It seems unlikely that the server hardware is the problem.

Is anyone else running MaaS with this many resources? What is the expected speed that 2,000 nodes should load? I would be happy with anything under 5 seconds - but we’re a couple orders of magnitude off at the moment.

MaaS version is 2.8.1 (8567-g.c4825ca06-0ubuntu1~18.04.1)

seanhoughton · 16 September 2020 16:23

It looks like the underlying problem is the slow pagination of objects through the websocket.

The machine detail page loads every vlan, every fabric, and every machine in serial with small pagination sizes. I thought this bug was fixed in a patch in 5.6 but we’re on 5.8.1.

The index page has other problems. The filtering happens on the client side so even if you filter on an exact hostname you still have to wait for 50% of the nodes to load on average. For our server each websocket request returns a response within 300ms, but there are sometimes long pauses of 2 or 3 seconds between each request.

billwear · 16 September 2020 17:00

@seanhoughton, I assume you mean “fixed in a patch in 2.6” rather than 5.6.

kitrandel · 16 September 2020 22:14

Hi Sean, thanks for your post. We’ve been progressively migrating the UI to react over the last few releases, in part to help resolve performance issues. You may recall that in the past, the performance of the machine list was poor, even after machines had loaded.

Our upcoming 2.9 release has been focused largely on LXD support in the KVM view. In the next cycle however we’ll be rebuilding machine details, which will address those performance issues you’re experiencing, as there will no longer be any need to load the old angularjs client or refetch data.

I’m sure there’s further work to be done to improve response times from the server on that initial load as well. I’m curious, are the 2-3 second pauses you’re seeing between requests from the client, or responses from the server?

seanhoughton · 17 September 2020 23:03

I misspoke about the 2 second interval between requests - it’s more like 500ms. However, each request appears to take around 2 seconds to complete for a total of around 2500ms per pagination iteration. The index page requires 91 requests on the websocket to fully populate which is why it takes around 3.5 minutes to load.

I took a look at the sql queries running while the page was loading and I think the machine SELECT query being used takes at least 400ms to execute. I’m not sure where the remaining 1600ms is coming from but it could be related to python marshaling, json serialization, and network latency, but that’s just speculation. For reference, it takes 4000ms to select every node and return them to pgadmin over the same network connection using the following simple query:

select * from maasserver_node;

sparkiegeek · 27 October 2020 15:42

https://github.com/canonical-web-and-design/maas-ui/pull/1750 was implemented to address some of the issues you’ve found. Previously the UI was loading machines in batches of 25, it now (2.9.0b7) pulls in 25 for the first call, then 100 at a time in subsequent batches.

You can see the results of the investigation in MAAS Show and Tell: Improving UI performance for large MAAS installs