MAAS failing to deploy on NVMe boot device on 2.3 with CentOS (storage is unsupported in 2.3)

This looks like MAAS is not installing a kernel to the machine for deployment, or rescue mode.

Commissioning works great, I am able to monitor the machine’s console, and all is well.

During deployment it seems to work at first, even allowing me to ssh into the system while the deployment is running. However on the final boot, the machine boots up to a “boot:” prompt, which generally tells me that the machine can’t find the kernel.

When I try the rescue mode, it does the same thing.

I’ve been using MAAS for two years with great success, this problem just started yesterday, last week I deployed 8 systems flawlessly as is normal, today I can commission, but not deploy or rescue.

This is true regardless of the version of ubuntu/centos I try to install.

Interestingly when I run rescue mode I get a message:
Loading ubuntu/amd64/ga-1804/xenial/no-such-image/boot-kernel… failed: No such file or directory
boot: _

This may be a side effect of another problem that I had about a month ago. During a image sync, one of the images got stuck at 12% because my ISP had an outage. Nothing I would do could get the sync to restart. However deployments were still working, they didn’t start failing until yesterday (4/23). After deployments started failing, I decided to try and fix the stuck sync. Telling MAAS to stop syncing that image did not work, but after an update which brought maas up to 2.4.2, I was able to deslect all of my images, and re-select them to fetch fresh ones. However that did not fix my deploy problem.

We mainly deploy centos7 as our application is certified for RHEL, but when centos fails I can generally get ubuntu to work… But not it seems like nothing works, every deployment or rescue results in a system that boots up to the point it can’t find the kernel. But the kernel used for commissioning and deployment does work, so the problem seems to be in the part of the deployment scripts where the kernel is installed, and seems to impact everything.

There is nothing custom in my MAAS, other than simple configuration.

Thanks!

I have a few questions:

  1. Did you upgrade MAAS right before these failures?
  2. If 1 is not true, did you only update images?
  3. Have you tried to force the sync of images again and see what happens?
  4. Check your Settings page, and see for the commissioning image. What’s selected there?
  5. Are there any error messages in the logs? What does rackd.log have?

Thanks for taking the time to look at my post!
]

  • Did you upgrade MAAS right before these failures?
    Actually I did not. I did upgrade after the failures started, because I know from experience that sometimes new images cause new bugs until you update… And any bugs you find are often fixed by an update, so I generally update when I have a problem.

  • If 1 is not true, did you only update images?
    Great question, but no. As I was investigating, I found my images had stopped updating on March 21, after an ISP outage. The images did not update again until after the update, which was after the problem started.

  • Have you tried to force the sync of images again and see what happens?
    Yes, I deleted all of the images, then rechecked them, forcing fresh images… and I confirmed the timestamps on the images had updated.

  • Check your Settings page, and see for the commissioning image. What’s selected there?
    I /was/ commissioning with 18.04, and I tried 16.04 just for good measure.

  • Are there any error messages in the logs? What does rackd.log have?

Rackd.log on my region controller only shows the normal probe info lines. On the top of rack server:

2019-04-24 20:51:26 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:51:26 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:51:26 provisioningserver.rackdservices.tftp: [info] ldlinux.c32 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:51:26 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/01-ac-1f-6b-0f-ff-8e requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:51:26 provisioningserver.rackdservices.tftp: [info] ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:51:28 provisioningserver.rackdservices.tftp: [info] ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:52:17 rackd: [info] 10.37.132.10 GET /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs HTTP/1.1 --> 200 OK (referrer: -; agent: Wget)
2019-04-24 20:52:35 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: ens160.
2019-04-24 20:52:45 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2019-04-24 20:53:07 rackd: [info] 10.37.132.10 GET /images/centos/amd64/generic/centos70/daily/root-tgz HTTP/1.1 --> 200 OK (referrer: -; agent: Wget/1.19.4 (linux-gnu))
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:centos:amd64:generic:centos70: to_add=[‘20180901_02’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:grub-efi-signed:amd64:generic:uefi: to_add=[‘20190404.0’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:grub-efi:arm64:generic:uefi: to_add=[‘20190404.0’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:grub-ieee1275:ppc64el:generic:open-firmware: to_add=[‘20190403.0’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:pxelinux:i386:generic:pxe: to_add=[‘20180807.0’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-16.04-lowlatency:xenial: to_add=[‘20190424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-16.04:xenial: to_add=[‘20190424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-18.04-lowlatency:bionic: to_add=[‘20190419’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-18.04:bionic: to_add=[‘20190419’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-19.04-lowlatency:disco: to_add=[‘20190420’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-19.04:disco: to_add=[‘20190420’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-16.04-edge:xenial: to_add=[‘20190424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-16.04-lowlatency-edge:xenial: to_add=[‘20190424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-16.04-lowlatency:xenial: to_add=[‘20190424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-16.04:xenial: to_add=[‘20190424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-18.04-edge:bionic: to_add=[‘20190419’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-18.04-lowlatency-edge:bionic: to_add=[‘20190419’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-18.04-lowlatency:bionic: to_add=[‘20190419’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-18.04:bionic: to_add=[‘20190419’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-p:precise: to_add=[‘20170424.1’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-q:precise: to_add=[‘20170424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-r:precise: to_add=[‘20170424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-s:precise: to_add=[‘20170424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-t:precise: to_add=[‘20170424’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-t:trusty: to_add=[‘20190409’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-u:trusty: to_add=[‘20190409’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-v:trusty: to_add=[‘20190409’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-w:trusty: to_add=[‘20190409’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-x-lowlatency:trusty: to_add=[‘20190409’] to_remove=[]
2019-04-24 20:54:09 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:hwe-x:trusty: to_add=[‘20190409’] to_remove=[]
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] ldlinux.c32 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] pxelinux.cfg/01-ac-1f-6b-0f-ff-8e requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] chain.c32 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] libcom32.c32 requested by ac:1f:6b:0f:ff:8e
2019-04-24 20:54:32 provisioningserver.rackdservices.tftp: [info] libutil.c32 requested by ac:1f:6b:0f:ff:8e

Of course that was centos 7.0, ending with the “boot:” problem once it tried to boot the local disk

I’ll see if I still have a log from a successful deployment, to see if I can spot a difference…

Shoot! The top of the log file starts a day and a half after last successful deployment…

Is there a minimum kernel set for the machine in the machine details page? What’s the setting for ‘Commissioning’ in the general settings? Ubuntu + what ?

I have tried several values for minimum kernel. But until the problem occured, I had it set to the default, “no minimum”

In general I left most of the settings alone. I have two maas systems, one was installed in the last 6 months as an 18.04 system (this is the one causing the issues). The other is my original system which was a combination region and top of rack installed together. It was a 16.04 based system. I don’t have any free systems on the 16.04 (maas 2.3.5) system, but I can move a system over to that stack just for a test, in case you think it is a hardware issue.All of our systems are mostly identical, though purchased 20 to 40 at a time, so each batch may have slight variations in motherboard/cpu/memory. All SuperMicro systems with dual cpu’s, 128g or 256g of memory, and up to 8 SSD’s. The system I’m installing to is slightly odd as it is the first one based on NVME SSD’s, instead of sata, and it just has one drive. So it is a change in that it is a bit faster, but still very much a standard comodity system from a big name in cloud servers. But I doubt that is the problem as it commissions without any problem. But is IS something new, the last thing that changed, and the last thing that changed is where you look first for a problem.

I’ll move the new server to the 16.04/2.3.5 system, just to see what happens there. I’ll also see if someone can free up a machine so that I can test a machine known to be working before… This is the first of a new rack of servers, so if it is a hardware problem, I’m in for a world of hurt when the rest of my order starts to arrive.

Well, I didn’t expect this, but under 2.3.5 is does the same thing. I really expected it to be a new problem with the MAAS stack, but instead it seems the issue is something subtle with the hardware I guess. Commissioning works fine. How could commissioning work and not deployment?

I have a theory as to the nature of the problem… This system is new in that it does not use a hard disk or SATA SSD, it uses a 1TB NVME drive. This is intended to be a high-speed database system, so the engineers asked for NVME.

I will add a hard disk configured as the boot device and see if that solves it. A co-worker theorizes that there is a configuration problem in the bootloader that does not handle NVME.

Worst case, since this is a dedicated system we can just manually configure it and not use MAAS for this one system. I’m pretty sure the spec on the rest of the order is sata ssd, have to go look at that next… Fingers crossed… Shoulda really crossed the fingers, the whole order is NVME… if the hd solves the problem I’ll need to order 30 hard drives… Rassin frazzin ig dig dagnabbit!

MAAS insists the NVME is the drive for the root partition. I’ve opened a ticket with SuperMicro to see if there is a way for the NVMe drive to show up as the second drive. I will move the system back to my newer MAAS stack in case the UI has a solution there. Setting the SATA drive as the boot device failed to convince MAAS 2.3.5 to accept the configuration.

Solution:

NVMe firmware source AMI Native Support
NVMe option ROM Legacy
AOC-URN2-i2XS NVMe1 OPROM Legacy

MAAS storage UI will only deploy a root filesystem for Centos, so multi-disk configurations that work for ubuntu did not work for Centos.

Support for CentOS storage configuration is only available from 2.5+. See MAAS 2.5.0 beta 1 released

Good to know! Thanks!