Hello community
We have recently spotted a bug (link) that might be impacting the performance of your MAAS environment. This issue has repercussions on various functionalities, including HTTP endpoints, UI, and deployments.
Check if you’re affected
To determine if your system is affected, run the following command from one of your regions. Adjust it accordingly in case you don’t have the postgres
user or if your database name is not maasdb
sudo -u postgres psql -d maasdb -c "
SELECT count(*)
from maasserver_staticipaddress
left join maasserver_interface_ip_addresses on maasserver_staticipaddress.id = maasserver_interface_ip_addresses.staticipaddress_id
left join maasserver_interface on maasserver_interface.id = maasserver_interface_ip_addresses.interface_id
where maasserver_staticipaddress.ip is NULL and maasserver_interface.type = 'unknown' and maasserver_staticipaddress.alloc_type = 6;
"
If the result is large (thousands or hundreds of thousands), the workaround we provide here will significantly improve the performances. If the number is low, your environment is likely running fine, and you can wait for the fix in the upcoming upstream releases.
When is it going to be fixed upstream?
The bug fix will be available in MAAS versions 3.4.1 and onwards.
Temporary workaround
Take a snapshot of your database for disaster recovery, just in case.
For MAAS version 3.2 and above, enter the shell with:
sudo snap run --shell maas.supervisor -c "maas-region shell"
Then execute the provided script (Note: It may take minutes or even hours; in our environment, it took 2.5 hours to delete over 100K records).
from maasserver.enum import INTERFACE_TYPE, IPADDRESS_TYPE
from maasserver.models import Interface
interfaces = Interface.objects.filter(type=INTERFACE_TYPE.UNKNOWN,ip_addresses__ip__isnull=True,ip_addresses__alloc_type=IPADDRESS_TYPE.DISCOVERED,)
len_interfaces = len(interfaces)
for index, interface in enumerate(interfaces):
print(f"\rDeleting interface {index}/{len_interfaces}", end="")
interface.delete()
print("")
During the cleanup you can expect 2 CPUs with relatively high usage.
Some numbers
We experienced the following improvements in one of our environments with more than 100K orphan resources
- Machine release time decreased from 26 minutes to 30 seconds.
- Machine allocation time decreased from 12 minutes to 10 seconds.
- Machine deployment time decreased from 1 hour and 8 minutes to 11 minutes.