Commissioning failure

Hi. I’m hoping for some help with a commissioning problem I’m having. When I have a new machine that I would like MaaS to see, I start by PXE booting the node. It gets a DHCP address from MaaS, comes up for the first time, begins commissioning, and eventually shuts down. At this point, MaaS shows most of the Commissioning tasks as Pending (with the exception of 30-maas-01-bmc-config which is Passed). And the node stays off forever. Unsure of what to do at this point, I PXE boot it again. It boots, does nothing, and then shuts back down again, and the commissioning tasks in MaaS remain Pending.

At this point, there are two things I can do that will make it succeed, but neither of them are ideal:

  1. Eventually (I’m not sure how long - maybe 30 minutes) the Pending tasks Time Out. At this point, I can request (through “Take Action”) for it to Commission again. This will cause it to boot the machine and the commissioning will succeed.
  2. The other way is to Delete the machine from MaaS and try the whole process over again. This time it will succeed.

Of the three hosts that I have tried to commission in the last two days, this exact same thing has happened on all of them. It tries to commission them the first time but gets stuck in Pending and eventually times out. I have hundreds more hosts that I will eventually need to bring into MaaS, so I want to find a better way around this problem. The two solutions that I noted above will take far too long if I have to do it every single time. Any help is appreciated. Thanks.

What version of MAAS are you running? I had a similar issue once, but I can’t remember if it was a weird network issue or a bug in the version of MAAS i was running.

We’re on 3.3 so that we can deploy Rocky.

Since 3.3 is still in beta, it’s quite possible that it’s a bug in the current version. I would try installing 3.2/stable and test a deployment of ubuntu to see if you have the same issue with your current setup. If that doesn’t work, see the reply I made to your other recent post.
I would be suspicious that it’s your network config. My understanding is that MAAS needs to be able to see your IPMI interface and your PXE interface.

Wouldn’t the fact that it works fine after the first time indicate that it can see the IPMI interface?

One would sure think so. I guess is there any reason that MaaS can’t be a part of the same vlan as your servers?

We have many, many vlans for our client machines. We won’t have enough rack controllers to put one on every vlan. As for trunking, I don’t think our network admins will allow layer 2 adjacency to every vlan. I can push that issue with them, but if there is another solution, that might be preferable.

I can see why your network/security team wouldn’t want to open up MaaS to all vlans. Would it be possible instead to give maas access to just one of the vlans that a node it on to see if that makes a difference? Also, you shouldn’t need to add a separate rack controller for every vlan. It sounds like you may have a bigger setup than I do, but I am also running region and rack controller with one box with multiple vlans. I just had to define them under the subnet tab.

If you can’t make any network changes, then I would say your next step is to downgrade to the latest stable version and see if the issue repeats itself.

Hi @dcaunt42. Could you please give some details on how exactly you try to commission the machine? Do you add the machine first in MAAS (using “Add Hardware”), or do you just PXE boot it and then it appears in the MAAS UI? Do you have any commissioning logs in the MAAS UI (“Logs” tab on the machine details page)?

For these instances, I was PXE booting and then they were appearing in MAAS UI. Since then, I have attempted to provision them again, so the logs in the UI are for the most recent (successful) operations.

[I tried to link to the pastebin output of the Logs tab, but this Discourse won’t let me post more than one link since I’m a new user. I think the following is more useful.]

As for the last unsuccessful commissioning attempt, here is the log file that I found in /var/snap/maas/common/log/rsyslog/maas-enlisting-node/2022-11-30:

Part of the problem is in reproducing the error. Since it only happens the first time that I attempt to commission a node, I only have one shot to attempt a new solution.

The one time I tried adding the machine first in MAAS (using “Add Hardware”), I got the output shown in the screenshot that I’m attaching. I’m not sure that’s the exact same problem that’s happening when I PXE boot them without inputting IPMI information first, but the pastebin output that I linked from the failed commission does contain an error about “Failed installing package(s) for 20-maas-01-install-lldpd” so maybe it is the same problem.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

For anyone else who may be experiencing this problem, we were finally able to solve it. The ephemeral image that MaaS 3.3 uses by default is Ubuntu 20.04. That image has this problem during the commissioning phase (specifically, during 20-maas-01-install-lldpd):

Err:2 focal-updates/main amd64 libmysqlclient21 amd64 8.0.31-0ubuntu0.20.04.2
  404  Not Found [IP: 80]

It must have an old mirror baked into it. Downloading and using 22.04 as the ephemeral image doesn’t have this issue and allowed us to commission our nodes without the above errors.

Having similar problem. I have packet dump file which I can provide cc @billwear

1 Like

please provide it via pastebin, if possible? if not, let’s come up with another way. cc @billwear

1 Like

thanks; someone is taking a look, and i will review it later today when i have time.

@cmills could you please also provide regiond/rackd.log files (or a paste of them) from around the time the issue happened?

Okay, give me 10 minutes.

why do you want the region, we have rack servers in all datacenters and one region controller in our main datacenter. Do you want the Rack logs or the region controller.