Boot resources redesign

billwear · 8 April 2024 20:51

Explanation

MAAS previously stored the boot resources (boot-loaders, kernels and disk images) in the MAAS database, and then replicated them on all Rack controllers. This make operations difficult and slow, as the database quickly became huge and files had to be transferred to all Racks before they were available for use. To address this issue, we have moved the resource storage from the database to the Region controller, and repurposed the storage in the Rack.

Storing boot resources in the Region Controllers

All boot resources are stored in the local disk in each Controller host (/var/lib/maas/image-storage for deb or $SNAP_COMMON/maas/image-storage for snap). MAAS checks the contents of these directories on every start-up, removing unknown/stale files and downloading any missing resource.

MAAS checks the amount of disk space available before downloading any resource, and stops synchronizing files if there isn’t enough free space. This error will be reported in the logs and a banner in the Web UI.

Storage use by the Rack Controller

Images are no longer copied from the MAAS database to the rack. Instead, the rack downloads images from the region on-demand. This works well with the redesign of the rack controller (now known as the MAAS agent), which has been reimagined as a LRU caching agent. The MAAS agent has limited storage space, managing cache carefully, but it is possible to configure the size of this cache if you need to do so.

As boot resources are now downloaded from a Region controller on-demand, a fast and reliable network connection between Regions and Racks is essential for a smooth operation. Adjusting the cache size might also be important for performance if you regularly deploy a large number of different systems.

One-time image migration process

The first Region controller that upgrades will try to move all images out of the database. This is a background operation performed after all database migrations are applied, and it’s not reversible. This is also a blocking operation, so MAAS might be un-available for some time (i.e., you should plan for some downtime during the upgrade process).

MAAS will check if the host has enough disk space before starting to export the resources, and it will not proceed otherwise. In order to discover how much disk space you need for all your images, you can run the following SQL query in MAAS database before upgrading:

SELECT SUM(N."size")
FROM (SELECT DISTINCT ON (F."sha256") F."size" FROM MAASSERVER_BOOTRESOURCEFILE F ORDER by F."sha256") N;

The controllers are no longer capable of serving boot resources directly from the database, and won’t be able to commission or to deploy machines until this migration succeeds. If the process fails, you must free enough disk space and restart the controller for the migration to be attempted again.

Sync works differently

When downloading boot resources from an upstream source (e.g. images.maas.io), MAAS divides the workload between all Region controllers available, so each file is downloaded only once, but not all by the same controller. After all external files were fetched, the controllers synchronize files among them in a peer-to-peer fashion. This requires direct communication between Regions to be allowed, so you should review your firewall rules before upgrading.

In this new model, a given image is only available for deployment after all regions have it, although stale versions can be used until everyone is up to date. This differs from previous versions where the boot resource needed to be copied to all Rack controllers before it was available, meaning that the images should be ready for use sooner.