Enlistment times out, fails with external DHCP server

I’m attempting to set up MAAS/3.3.2 via snap for the first time. If if allow MAAS to managed DHCP on one of my subnets, I’m able to get a node booted and visible in the UI (where presumably I could proceed to commission the system). If I use my existing DHCP servers though, and point my client to download ipxe.cfg from my MAAS server the client eventually fails with:

cloud-init[755]: 2023-05-02 18:43:45,807 - url_helper.py[ERROR]: Timed out, no response from urls: ['']
cloud-init[755]: 2023-05-02 18:43:45,808 - url_helper.py[CRITICAL]: Timed out, no response from urls: [''] after 120 seconds
cloud-init[755]: 2023-05-02 18:43:45,808 - util.py[WARNING]: No instance datasource found! Likely bad things to come!

When I do a packet capture from my MAAS server during this time I see this:

16:35:19.884514 IP myhost-u22-1.my.domain.53144 > maas.my.domain.http: Flags [S], seq 4218718519, win 64240, options [mss 1460,sackOK,TS val 1708093121 ecr 0,nop,wscale 7], length 0
16:35:19.884559 IP maas.my.domain.http > myhost-u22-1.my.domain.53144: Flags [R.], seq 0, ack 1, win 0, length 0

So, it looks like the client is trying to reach the server, but the server keeps resetting the connections. This happens in a loop for a few minutes before the client continues to boot in to a host named “ubuntu” rather than the “maas-enlisting-node” host I get when I let MAAS manage DHCP (and when the host appears in the MAAS UI as expected).

I know that external DHCP is officially not supported, but I’m perplexed by this decision as I don’t know how that would be very useful to most folks wanting to deploy MAAS.

To provide a little more detail, when I’m using external DHCP the client hangs during cloud-init before enlistment happens, as I see no new data showing up in /var/snap/maas/common/log/rsyslog/maas-enlisting-node/$ip_addr like I do when I have MAAS DHCP management enabled.

In comparing my client where DHCP is MAAS-managed vs. not I see in /var/snap/maas/common/log/httpd/access.log this when it’s working: - - [16/May/2023:10:10:59 -0700] "GET /ipxe.cfg HTTP/1.1" 200 238 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:10:59 -0700] "GET /ipxe.cfg-3a%3A13%3A7b%3A4d%3A1a%3Acb HTTP/1.1" 404 5 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:10:59 -0700] "GET /ipxe.cfg-default-amd64 HTTP/1.1" 200 593 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:10:59 -0700] "GET /images/ubuntu/amd64/ga-22.04/jammy/stable/boot-kernel HTTP/1.1" 200 11570216 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:11:00 -0700] "GET /images/ubuntu/amd64/ga-22.04/jammy/stable/boot-initrd HTTP/1.1" 200 116315156 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:11:15 -0700] "GET /images/ubuntu/amd64/ga-22.04/jammy/stable/squashfs HTTP/1.1" 200 464191488 "-" "Wget" - - [16/May/2023:10:11:28 -0700] "GET /MAAS/rpc/ HTTP/1.1" 200 297 "-" "provisioningserver.rpc.clusterservice.ClusterClientService" - - [16/May/2023:10:11:29 -0700] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 1333 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:11:29 -0700] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 1333 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:11:29 -0700] "GET /MAAS/metadata/2012-03-01/meta-data/instance-id HTTP/1.1" 200 17 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:11:29 -0700] "GET /MAAS/metadata/2012-03-01/meta-data/instance-id HTTP/1.1" 200 17 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:11:29 -0700] "GET /MAAS/metadata/2012-03-01/meta-data/instance-id HTTP/1.1" 200 17 "-" "python-requests/2.25.1" - - [16/May/2023:10:11:29 -0700] "GET /MAAS/metadata/2012-03-01/meta-data/instance-id HTTP/1.1" 200 17 "-" "python-requests/2.25.1"

But I see this when it’s not working: - - [16/May/2023:10:46:30 -0700] "GET /ipxe.cfg-3a%3A13%3A7b%3A4d%3A1a%3Acb HTTP/1.1" 404 5 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:46:30 -0700] "GET /ipxe.cfg-default-amd64 HTTP/1.1" 200 575 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:46:30 -0700] "GET /images/ubuntu/amd64/ga-22.04/jammy/stable/boot-kernel HTTP/1.1" 200 11570216 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:46:31 -0700] "GET /images/ubuntu/amd64/ga-22.04/jammy/stable/boot-initrd HTTP/1.1" 200 116315156 "-" "iPXE/1.20.1+ (g4bd0)" - - [16/May/2023:10:46:46 -0700] "GET /images/ubuntu/amd64/ga-22.04/jammy/stable/squashfs HTTP/1.1" 200 464191488 "-" "Wget" - - [16/May/2023:10:46:58 -0700] "GET /MAAS/rpc/ HTTP/1.1" 200 297 "-" "provisioningserver.rpc.clusterservice.ClusterClientService" - - [16/May/2023:10:46:58 -0700] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 1288 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:46:58 -0700] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 1288 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:47:00 -0700] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 1288 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1" - - [16/May/2023:10:47:00 -0700] "GET /MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed HTTP/1.1" 200 1288 "-" "Cloud-Init/23.1.1-0ubuntu0~22.04.1"

One thing I’ve noticed in enlistment logs where clients are working is that cloud-init reports handing out an IPv6 address to clients. I don’t currently provide IPv6 addresses via my external server, but I’m not sure if this is relevant or related to the failures I’m seeing.

From your description, it seems like your MAAS server is not correctly configured to work with your existing DHCP server. When MAAS manages DHCP, it provides more than just IP addresses to the machines, it also provides PXE booting, DNS, and other network services that are necessary for MAAS to work correctly.

Still, you can use an external DHCP server with MAAS as long as it is correctly configured to provide these services. The problem you’re experiencing seems to be related to the network boot process (PXE) rather than DHCP itself.

The iPXE error messages and the access log entries seem to indicate that the machine is not able to get the instance ID metadata from the MAAS server, which is a necessary step for the machine to identify itself to the MAAS server. This suggests that there may be a network configuration issue that’s preventing the machine from reaching the MAAS server, or that the MAAS server is not correctly configured to respond to these requests.

One possibility could be that your DHCP server is not correctly set up to provide the necessary boot options for MAAS. For MAAS to work with an external DHCP server, the DHCP server needs to be configured with specific boot options that tell the machines where to get their PXE boot images and metadata. The MAAS documentation provides more information about PXE and DHCP for MAAS.

Regarding the IPv6 addresses, MAAS can work with both IPv4 and IPv6, but if you’re not providing IPv6 addresses through your external DHCP, this shouldn’t be an issue.

If none of the above helps, I’d recommend you look for DEBUG events in the MAAS logs to see if you can sort out what’s wrong.

Finally, remember that while it’s technically possible to use an external DHCP server with MAAS, the recommended and supported configuration is for MAAS to manage DHCP, because this ensures that all the necessary network services are correctly configured. If you continue to have issues with your existing DHCP server, you may want to consider switching to MAAS DHCP management.

Thanks Bill, I’ll review the dhcpd.conf file that MAAS uses and see if I can work something out.

Hey @bmcnally-uw, did you make any progress on this?

No, as I said in the other thread here:

I haven’t worked on it recently, but I will some time soon.

Hello @bmcnally-uw
I am wondering if thats the same issue as in Cloud-init fails to fetch MAAS datasource from metadata_url missing port

If thats the same in your case, then thats indeed a bug and we are already working on a fix.
You can track the progress at https://bugs.launchpad.net/maas/+bug/2022926

Thanks Anton! That does indeed look it could be the same problem I was seeing. Should I be watching the milestones pages for an expected release date to know when those fixes might end up in the production branch of 3.3.x or 3.4.x? At the moment I’m running 3.3.4-13189-g.f88272d1e.