Last year, I encountered several bugs in MAAS, and after discussions with r00ta, I’d like to share how we utilize MAAS in our environment.
We currently manage over 5,000 bare-metal servers distributed across more than 200 data centers worldwide. While managing 5,000 servers itself might not sound large-scale, their global distribution significantly increases operational complexity.
We initially adopted MAAS version 2.9 in 2019, upgraded to 3.1 in 2021, and are currently migrating to version 3.5. Due to our highly distributed infrastructure—where some data centers host as few as 10 servers, while others handle upwards of 500—we operate approximately six independent MAAS regions globally. For instance, Frankfurt manages servers across EMEA, Singapore oversees Asia-Pacific, and Phoenix handles the Americas. In certain geographical areas, we even maintain two distinct MAAS regions to separately manage different server clusters.
Each MAAS region uses a dedicated /16 private IP block (e.g., 10.0.0.0/16), with the first /24 subnet (10.0.0.0/24) specifically reserved for the control plane.
Every top-of-rack (TOR) switch, corresponding to a Rack controller, has its own exclusive /24 subnet (e.g., 10.0.1.0/24) dedicated for BMC access. Typically, these racks use IP addresses such as 10.0.1.1 or 10.0.1.2 to interface with their BMCs. Most servers utilize a “sharelink” setup, which introduces a management challenge since a server cannot directly manage its own BMC. To address this, we implemented High Availability (HA) Rack controllers. We provided a custom patch to solve this issue in MAAS 3.1; however, migrating to MAAS 3.5 reintroduced complexities due to the adoption of Temporal workflows. A definitive fix is still under active development.
In our environment almost every server possessed a public IP address, with the region API exposed directly to the Internet, allowing rack controllers (rackd) straightforward connectivity. However, in MAAS 3.5, we transitioned to OpenVPN combined with FRRouting (FRR) to handle network connectivity. BGP is utilized to advertise a /32 route for each server’s BMC IP, effectively resolving the “sharelink” management issue. Within our OpenVPN setup, maintaining consistent IP addresses for rack controllers connecting to regional controllers is crucial. We use the ifconfig-pool-persist setting to achieve IP stability. An important consideration here is that enabling both duplicate-cn and ifconfig-pool-persist simultaneously is problematic. Specifically, when duplicate-cn is enabled, network instability can result in changing VPN IP addresses for racks. Consequently, the region mistakenly continues to use the outdated IP, causing the DHCP service to stop functioning entirely until the region daemon (regiond) is manually restarted. That is I meet the problem here: MAAS DHCP Server Issue and Bug Report #2089222.
We deploy rackd, OpenVPN, and FRR within a single privileged LXC container, offering several operational benefits:
- Provides MAAS with a “bare-metal-like” environment, greatly simplifying upgrades and migrations.
- Enables direct manipulation of the host’s routing table and iptables rules.
- Avoids noisy kernel logs typically associated with Snap packaging and AppArmor restrictions.
Additionally, while we maintain our own Chrony configurations for time synchronization, MAAS has a tendency to overwrite these settings. By containing MAAS components within LXC containers, we isolate these configuration changes, thus preserving host settings.
We maintain numerous custom patches tailored specifically to our operational needs. These include features such as Single Sign-On (SSO) integration, delete PXE boot commands before machine power on (as certain older BMC models may unexpectedly reset if they receive PXE boot and power-on commands within a short interval), and many other adjustments to ensure smooth operation.
Finally, MAAS alone isn’t sufficient for managing our extensive infrastructure. To enhance operational efficiency, we’ve developed additional tools that help identify the physical location of servers, and facilitate operations such as OS reinstalls or system reboots.
This is just a brief overview of how we use MAAS in our environment. If you have any questions or want to discuss further, please feel free to reach out.