Performance improvements in 2.6

We got reports that 2.6.0 was slower for some users, especially loading
the machine listing when there were many events recorded for the
machines. We had problems reproducing the issue, but we took some time
and spent a few weeks getting to the bottom of this.

The summary of that work is that we now believe that we found the
culprit, and if you are one of those users having problems with a
slow-loading machine listing, please try 2.6.1. It’s currently in RC1,
but should go final soon.

In addition to finding the issue, we also found and fixed a number of
other issues.

All of those fixes, and more, are in 2.6.1rc1.

In addition to fixing those issues, we also spent time improving our
framework for testing performance. We’re now in a much better
position when it comes to debugging performance, and catching
performance regressions in the future.

That’s the summary. If you want the details on what, and how, we did,
please read on.

Performance improvements across all of MAAS

That last fix deserves some highlighting. We found out that a piece of
code was responsible for almost 50% of the time in the websocket call to
get the machine list. And all the code did was to prevent developers
from making mistakes, that might slow MAAS down. By moving that check to
a different place in the code, we managed to make MAAS significant
faster overall. Just look at this graph over how long it takes to get 25
machines over the websocket:

On the left side is MAAS 2.6.0. The gap where no data is when the
upgrade happen, and the right side is 2.6.1rc1. I’d still like for that
graph to go down significantly still, but it’s a good step in the right
direction.

Performance framework

I mentioned that we had problems reproducing the issue where the machine
listing was slow. One of the problems was that this issue happened
mostly for long-running MAAS deployments with a decent amount of
machines. Our test lab doesn’t fit that description, and we couldn’t
reproduce it by populating the DB with test data.

But we also couldn’t go and buy 25 machines, and have a MAAS deployment
run for a couple of months. Instead, we created a daemon that could
simulate physical machines. Something like this:

Now, that daemon can run even on an old laptop, and still be able to
simulate 25, 50, even 100 machines. It creates a virtual NIC for each of
the machine to be simulate. When it starts up the first time, it gets an
IP from DHCP, and then it does everything a real machine would do when
PXE booting and enlisting with MAAS. The main difference is that it
doesn’t execute any of the scripts MAAS sends it. It just reads them to
find out what MAAS wants the machine to do, and then it sends the
results back to MAAS.

It also has a Redfish-compliant endpoint, so that MAAS can turn on and
off the machines. From MAAS’ point of view, it’s a physical machine. It
can’t tell the difference.

This means that it’s very fast to deploy a machine, only a few seconds.
We wrote a script that used the MAAS API to constantly redeploy
the machines, and over night we had simulated months of normal usage.
And in a way that’s more realistic than putting fabricated test data
into the database.

Prometheus metrics export

We also improved on the Prometheus metrics that we added for 2.6:

  • Not all RPC call latencies were being exported to Prometheus
  • Exporting Prometheus metrics got slower for each MAAS restart
  • No query count or latency were exported to Prometheus for websocket calls.
7 Likes

Great work!
We tried the latest version and it does fix the lising issue very well. Only take <15 seconds to load near 1000 nodes.

However we are facing another issue which we believe is caused by too many rack controller. Could this performance framework also simulate the effects on large number of rack controller?

detail issue here: https://bugs.launchpad.net/maas/+bug/1843268

1 Like

Is your simulation daemon available for general users?

The code itself for the simulation daemon is available here:

https://code.launchpad.net/~bjornt/maas/+git/maas-performance/+ref/2.6-performance-spike

But the code isn’t exactly production quality yet. In order to address the performance issue quicker, I made it as a spike branch, and I’m planning to merge back the code to master, cleaning it up and add proper tests.

So the code works, but it might not be fully documented and handle all edge cases.

1 Like

To answer hyuwang’s question, then it might help debug the issue with many rack controllers.

The rack controllers themselves need to run somewhere, but it should be possible to run many rack controllers on a single machine in lxd containers. That alone might reproduce the issue.

If not, it would be possible to connect fake machine daemons to each of the rack controller to make the rack controllers do more work.

2 Likes