Insufficient configuration sanity check on subnets crashes dhcpd

Background:

We’re using MaaS 3.5.3 (Snap) deployed on Ubuntu 22.04. We have several hundreds of baremetal nodes managed by MaaS and utilize MaaS-provided DHCP for both provisioning baremetal server nodes and offering dynamic IP addresses for Kubernetes pods and application containers. The rackd built-in DHCP served multiple subnets and worked through DHCP relay configured on L3 switches.

While adding a new subnet on MaaS, we accidentally found a loophole that sliently crashes DHCPd.

One team member created a subnet “10.1.66.128/21” via the web UI, and then quickly found that he entered the wrong subnet prefix length 21 instead of 26. He changed the subnet prefix length from 21 to 26 and continued to do other configuration jobs.

A few minutes later, multiple alarms were generated from our managed platforms indicated some nodes were lost connection. We diagnosed the problem and found the probmatic nodes lost their IP addresses because they can’t renew their DHCP leases. Then, we found the following repeating error log lines in the rack controller logs.

Feb 26 15:48:30 ul-maas-rackd-01 maas.pebble[3218666]: 2025-02-26T07:48:30.864Z [pebble] POST /v1/services 2.546059ms 202
Feb 26 15:48:30 ul-maas-rackd-01 maas.pebble[3218666]: 2025-02-26T07:48:30.884Z [pebble] Service "dhcpd" stopped
Feb 26 15:48:30 ul-maas-rackd-01 maas.pebble[3218666]: 2025-02-26T07:48:30.889Z [pebble] Service "dhcpd" starting: sh -c "exec systemd-cat -t dhcpd $SNAP/bin/run-dhcpd"
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: Internet Systems Consortium DHCP Server 4.4.1
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: Internet Systems Consortium DHCP Server 4.4.1
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: Copyright 2004-2018 Internet Systems Consortium.
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: All rights reserved.
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: For info, please visit https://www.isc.org/software/dhcp/
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: Copyright 2004-2018 Internet Systems Consortium.
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: All rights reserved.
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: For info, please visit https://www.isc.org/software/dhcp/
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: bad range, address 10.1.69.255 not in subnet 10.1.66.128 netmask 255.255.255.192
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: bad range, address 10.1.69.255 not in subnet 10.1.66.128 netmask 255.255.255.192
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: If you think you have received this message due to a bug rather
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: than a configuration issue please read the section on submitting
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: bugs on either our web page at www.isc.org or in the README file
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: before submitting a bug.  These pages explain the proper
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: process and the information we find helpful for debugging.
Feb 26 15:48:30 ul-maas-rackd-01 dhcpd[1644698]: exiting.

It’s been clear that the DHCPd process refused to start due to configuration error.

We reviewed the configuration on MaaS web UI and found out that there was a automatically created DHCP dynamic range (that used to provision machines) and it was based on the wrong subnet IP range while the subnet being created using prefix length 21.

It turns out that after the guy changed the subnet prefix length, the reserve range no longer valid. However, MaaS didn’t perform validation before applying the modified configuration to dhcpd.conf, and didn’t trigger dhcpd to test the configuration before restart the service. So the problem occured and it has to be manually inspected and fixed.

We managed to fix the problem by deleting the previously-generated dynamic reserve range and creating a new one (correct one) instead. However, sadly, this incident did take us some time to investigate and caused some negative impact.

Here are my suggestions:

  1. Provide basic configuration sanity check (e.g. subnet address range check) before generating dhcpd.conf config file. If the configuration contains problematic contents, refuse to save and apply the configuration.
  2. Trigger configuration tests provided by underlying service components, before restarting those service components. Gather the return value and report if there is an error. This can provide a safety gurantee - even if the configuration files have problems, it won’t interrupt the service, which may be critical in large production environments.

Thanks.

Thanks for the detailed report! I’d say MAAS should prevent the user to modify the CIDR if it’s incompatible with the existing dynamic range and/or other things. Do you mind opening a bug?

Sure of course.

I submitted a bug at launchpad: Bug #2103615 “Insufficient configuration sanity check on subnets...” : Bugs : MAAS

Thanks for your reply @r00ta .