We got reports that 2.6.0 was slower for some users, especially loading
the machine listing when there were many events recorded for the
machines. We had problems reproducing the issue, but we took some time
and spent a few weeks getting to the bottom of this.
The summary of that work is that we now believe that we found the
culprit, and if you are one of those users having problems with a
slow-loading machine listing, please try 2.6.1. It’s currently in RC1,
but should go final soon.
In addition to finding the issue, we also found and fixed a number of
other issues.
- Deleting machines would take a long time
- Period checks for a machine’s power status caused unnecessary load
on the database - Code for preventing developers making mistakes that might slow MAAS
down, actually slowed down MAAS overall
All of those fixes, and more, are in 2.6.1rc1.
In addition to fixing those issues, we also spent time improving our
framework for testing performance. We’re now in a much better
position when it comes to debugging performance, and catching
performance regressions in the future.
That’s the summary. If you want the details on what, and how, we did,
please read on.
Performance improvements across all of MAAS
That last fix deserves some highlighting. We found out that a piece of
code was responsible for almost 50% of the time in the websocket call to
get the machine list. And all the code did was to prevent developers
from making mistakes, that might slow MAAS down. By moving that check to
a different place in the code, we managed to make MAAS significant
faster overall. Just look at this graph over how long it takes to get 25
machines over the websocket:
On the left side is MAAS 2.6.0. The gap where no data is when the
upgrade happen, and the right side is 2.6.1rc1. I’d still like for that
graph to go down significantly still, but it’s a good step in the right
direction.
Performance framework
I mentioned that we had problems reproducing the issue where the machine
listing was slow. One of the problems was that this issue happened
mostly for long-running MAAS deployments with a decent amount of
machines. Our test lab doesn’t fit that description, and we couldn’t
reproduce it by populating the DB with test data.
But we also couldn’t go and buy 25 machines, and have a MAAS deployment
run for a couple of months. Instead, we created a daemon that could
simulate physical machines. Something like this:
Now, that daemon can run even on an old laptop, and still be able to
simulate 25, 50, even 100 machines. It creates a virtual NIC for each of
the machine to be simulate. When it starts up the first time, it gets an
IP from DHCP, and then it does everything a real machine would do when
PXE booting and enlisting with MAAS. The main difference is that it
doesn’t execute any of the scripts MAAS sends it. It just reads them to
find out what MAAS wants the machine to do, and then it sends the
results back to MAAS.
It also has a Redfish-compliant endpoint, so that MAAS can turn on and
off the machines. From MAAS’ point of view, it’s a physical machine. It
can’t tell the difference.
This means that it’s very fast to deploy a machine, only a few seconds.
We wrote a script that used the MAAS API to constantly redeploy
the machines, and over night we had simulated months of normal usage.
And in a way that’s more realistic than putting fabricated test data
into the database.
Prometheus metrics export
We also improved on the Prometheus metrics that we added for 2.6:
- Not all RPC call latencies were being exported to Prometheus
- Exporting Prometheus metrics got slower for each MAAS restart
- No query count or latency were exported to Prometheus for websocket calls.