New MAAS 3.5 deployment not adding machines

Hi,

I deployed an new MAAS Region from scratch with the new 3.5 version.

I installed from snap, configured vlan, subnet, enabled dhcp, to the point in which the documentation states that “you should be able to add, comission, etc”.

$ maas status
Service Startup Current Since
agent disabled active yesterday at 10:37 UTC
apiserver enabled active yesterday at 10:36 UTC
bind9 disabled active yesterday at 17:27 UTC
dhcpd disabled active yesterday at 17:14 UTC
dhcpd6 disabled inactive -
http disabled active yesterday at 10:36 UTC
ntp disabled active yesterday at 10:36 UTC
proxy disabled active yesterday at 14:15 UTC
rackd enabled active yesterday at 10:36 UTC
regiond enabled active yesterday at 10:36 UTC
syslog disabled active yesterday at 10:36 UTC
temporal disabled active yesterday at 10:36 UTC
temporal-worker disabled active yesterday at 10:36 UTC

I boot new servers on the VLAN, and I can see DHCP, and PXE … until:

provisioningserver.rackdservices.http: [info] /images/3c025aba630a30f7d03d6b02f4947ebc1edf80fc1dcfc861bfc9d3a3352203bc/ubuntu/amd64/ga-22.04/jammy/stable/boot-kernel requested by 10.6.200.3

Which looks to me like the image has been downloaded.

However, no machine is added, or anything.

I have no idea how to debug the issue.

The machine has AMT management and I could enable a serial-over-lan, but I am not sure how to add boot options at this stage to enable a serial console.

I dont see any obvious error message either in logs, thou I find these suspicious:

Aug 16 11:56:15 tarzan dhcpd[83901]: execute_statement argv[0] = /snap/maas/36368/usr/sbin/maas-dhcp-helper
Aug 16 11:56:15 tarzan dhcpd[83901]: execute_statement argv[1] = notify
Aug 16 11:56:15 tarzan dhcpd[83901]: execute_statement argv[2] = --action
Aug 16 11:56:15 tarzan dhcpd[83901]: execute_statement argv[3] = expiry

Aug 16 11:56:15 tarzan maas-regiond[54525]: maasserver.rpc.leases: [info] Lease update: expiry for 10.6.200.3 on ec:a8:6b:f9:85:a1 at 2024-08-16 11:56:15
Aug 16 11:56:16 tarzan maas-regiond[54340]: maasserver.region_controller: [warn] The dynamic dns update notification ‘’ is not valid. It will be dropped.

Any ideas on how to debug this?

Hi @rvallel
Hm, normally it should request boot-kernel and boot-initrd:

Example:

/images/3c025aba630a30f7d03d6b02f4947ebc1edf80fc1dcfc861bfc9d3a3352203bc/ubuntu/amd64/ga-22.04/jammy/stable/boot-kernel
/images/b2e605796fbeca4801cf393a8db53d7c8557dc03c1015121ca949a6e6681254c/ubuntu/amd64/ga-22.04/jammy/stable/boot-initrd

I guess the only way to debug this would be:

  1. snap restart maas and try again
  2. Collect tcpdump and inspect what happens on the wire
  3. Try to get serial console access and see if there are any errors on the machine side.

I will try to capture, and see if we can learn something from the traffic

I think I used Serial console in the past to be able to see the boot process, it was very useful in several occasions.

However, I don’t remember where to add the parameters, I guess it belongs:

pxelinux.cfg/mac-addr

But I am not sure how to add them.

Do you know how to get serial console at this stage?

I dont think I can chanche kernel parameters for the first boot, when machines are added.

How is that done?

I see 2 potential issues:

[ 1752.582791] audit: type=1326 audit(1725537790.812:448): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=snap.maas.pebble pid=4176 comm=“lshw” exe=“/snap/maas/36892/usr/bin/lshw” sig=0 arch=c00000b7 syscall=33 compat=0 ip=0xffff9c387518 code=0x50000

what appears to be a confinement problem in the snap. @billwear have you seen this before?

and also a networking issue, downloading during PXE is being very slow.

I am not sure if they are related.

The target device gets stuck during PXE downloading kernel and image. they never finish.

if I attempt downloading the URL by hand, from the same network segment data comes very very slowly.

However, from the MAAS snap server image download works fine, with good connectivity.

capturing network traffic shows TCP issues Duplicates and Resending.

I have no idea what this could be about, and why affects some traffic and not another.

Any ideas?

if I attempt downloading the URL by hand, from the same network segment data comes very very slowly.

Same network segment as you machine that fails to boot?

capturing network traffic shows TCP issues Duplicates and Resending.

Hm, to me it sounds like a network configuration issue.

Can you describe your setup and network topology?

It was indeed a network issue. Just testing with a different network adapter fixed the problem

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.