Test MAAS installation fails to commission any machines

Hi folks -

I’ve been trying to familiarize myself with Ubuntu’s MAAS, but consistently encounter failures when trying to get a test environment up and running. For reference, I’m following the guide “Build a MAAS and LXD environment in 30 minutes with Multipass on Ubuntu” and after spending several hours on this “30 minute” experiment, it just won’t work as described.

I first tried with a (fresh) install of ubuntu-18.04.6. When that didn’t work, I then installed ubuntu-20.04.6. In either case, I followed the guide linked above step-by-step, but it ultimately fails whenever I try to commission a machine:

So far, nothing terribly insightful showing up in system logs. I also connected to the console of the lxc instance I’m attempting to comission, and it simply shows this the message: “Booting under MAAS direction…” and nothing else. Clearly, the PXE boot is failing, but I’m not sure why that would be.

Wondering if anyone might happen to know what the disconnect may be, or where to look for further diagnostics? Thanks!

I managed to get a console screen shot earlier in the boot:

That IPv4(0.0.0.0…) bit looks very suspicious. Some conversation with DHCP server must have happened because PXE got an IP address, but the boot details seem to be missing.

I think you should check maas /var/snap/maas/common/log/dhcpd.log for errors. Another log file you could check is /var/snap/maas/current/supervisord/supervisord.log (superviord is coordinating various daemons that MAAS is composed of)

Thanks. At the moment, I don’t think the IPv4(0.0.0.0…) bit is relevant. After letting it sit for the long weekend, I started digging into it again today. After forcing the lxc instance to quit and restarting it, it did actually start booting this time - but not successfully. So, deleted that machine instance and tried a new one. Again, the new instance attempted to boot, but encountered the same problem as the previous… a looping soft lockup:

[  563.060573] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/0:1]
[  563.064407] Modules linked in:
[  563.064407] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.4.0-169-generic #187-Ubuntu
[  563.064407] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS unknown 2/2/2022
[  563.064407] RIP: 0010:smp_call_function_single+0xdc/0x110

I’m not sure what the original issue was preventing the boot from starting, but I seem to be past it and am on to the new issue. Perhaps I need to try commissioning with a different kernel version or something. I’ll poke around a bit more.

Also, thanks for the info on the log paths… I was wondering where I might find those :slight_smile:

ETA: its possible this was my problem all along, but I just wasn’t looking at the console at the right time.

Not sure why this thread was marked “solved”, as its very much not solved yet.

So far I’ve tried commissioning with 20.04 and 22.04, failures in either case. Somtimes the boot just seems to randomly hang, and other times I get the same soft lockup as noted in the previous comment, despite using a completely different kernel. At this point, I’m beginning to wonder if there’s just simply something about my hardware (HPE 325 Gen 10) that MAAS just doesn’t like. I poked around in the hardware BIOS a bit looking for any obvious issues. I change the default workload profile from “General Power Efficient Compute” to “Virtualization - Max Performance”, but I don’t think that really matters much at this point. In any case, the results are the same - still failing to commission.

Hi @guzzijason, my apologies. I marked this as solved, I misinterpreted the results of your testing as having solved the issue.
OOI have you had success on different hardware?

No worries @lloydwaltersj. I haven’t had a chance to test on different hardware yet, but I’m going to be working on that (hopefully) today.

@lloydwaltersj I think we can actually mark this as “solved” at this point, although the failures I was seeing are still a bit mysterious.

Ultimately, I was able to get a successful machine commissioning after re-imaging my bare-metal host with ubuntu-22.04.3-live-server-amd64. Not clear why both 18.04.6 and 20.04.6 experienced similar failures. I see now, however, that this particular model (HPE ProLiant DL325 Gen10) was never actually certified by Ubuntu, so perhaps that has something to do with it. The later “Gen10 Plus” and “Gen10 Plus V2” variants were certified, however. On top of that, this particular server is running with an unsupported CPU variant (AMD Naples) rather than the AMD Rome CPU it is supposed to have. So, perhaps this was just a terrible decision to pull this one out of our lab pool to play with MAAS on.

At any rate, with jammy on the bare-metal host, it seems to be working at the moment. Thanks!

1 Like

Just to add a bit more to this… even though I got the weirdly-built HPE DL325 Gen 10 server mostly working with jammy, it still wasn’t good. Things were very slow, and deploying an OS to the new lxd machine actually timed out while the OS install was in-progress.

Today I switched to a Dell 7515, and had an entirely different experience. Install went smoothly, commissioning the lxd machine and deploying an OS to it took only a few minutes each.

Moral of the story: hardware matters :slight_smile: