PXE not working after upgrading to 2.5.0


#1

Hi all,

I installed maas yesterday using default package repos, apt installed v2.4.0.
I configured it and added a server to it for testing, all was working fine.
Later I added the maas ppa to update maas to 2.5.0, after that pxe booting does not work anymore.

I always hangs after loading lpxelinux:

After the Boot failed message nothing happens anymore.
In the maas rackd.log I see this:

2019-02-21 13:18:59 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: ens18.
2019-02-21 13:19:09 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2019-02-21 13:28:59 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: ens18.
2019-02-21 13:29:09 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2019-02-21 13:29:43 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by 10...91
2019-02-21 13:29:43 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by 10...91

To me it looks like the request from the machine to load ldlinux.c32 is never received by the rack controller.
If I’m not wrong, lpxelinux have to load a config file before, which says what to do before loading more? Unfortunately I didn’t find a place where I could find / check it.

I already tried:

  • rebooting server
  • dpkg-reconfigure the rack controller
  • maas boot-resources import
  • completely removing/deleting node from maas and readded it
  • removing boot resources folder in var/lib/maas and restart server to force redownload / recreation of all boot resources

Maybe someone is able to assist with this problem, thanks in advance.


#2

Further troubleshooting results:

  1. I booted a live cd on the server and connected via tftp to the rackcontroller.
    Looks fine so far (i think):

  2. I compared the logs from before upgrading to 2.5.0 and after upgrading to 2.5.0 and noted:
    v2.4.0:

2019-02-20 13:44:51 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by 9c:b6:54:99:12:f0
2019-02-20 13:44:51 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by 9c:b6:54:99:12:f0
v2.5.0:
2019-02-21 15:22:50 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by 10...91
2019-02-21 15:22:50 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by 10...91

on 2.4.0 pxelinux.0 is used instead of lpxelinux.0.
I edited the dhcpd.conf in /var/lib/maas to use pxelinux.0 and restarted maas-dhcpd.
The machine used pxelinux.0 but the result was the same as in the image in my last post. After that I reverted the dhcpd.conf back to the original state.
Also you can see that on 2.4.0 it shows a mac address after “requested by”, on 2.5.0 it shows an ip address.
Has this any meaning or is this just a change of the log message?


#3

It looks like only my test machine is affected by this problem.
I turned another machine into pxe booting which was able to boot fine.


#4

I tried a lot more now.
I did a tcpdump on the rack controller and captured the dhcp request and the tftp transfer of lpxelinux.0.
I was also able to verify the hashes of lpxelinux.0 file on the disk of the rack controller and the udp datagrams of the tftp transfer.
Interesting is also that there is no tftp request received for ldlinux.c32 on the rack controller.
The server just downloads lpxelinux.0 and nothing happens anymore.

After many many times of restarting the machine it worked once. However at the next try I had the same problem again.

If I read through my previous findings, I would actually say that the problem is caused by the machine.
However, this does not fit with the fact that the problem has occurred since the update to 2.5.0 and that it worked perfectly before.


#5

I downgraded back to 2.4.2 and it’s working again


#6

Also having issues with some 10gb nic’s on 2.5 when back to to 2.4, to many bugs in 2.5 here’s my orgininal post MAAS 2.5 won’t PXE Qlogic 10G nic HP model# NC523SFP


#7

I have the exact same error, however, my MAAS version is 2.4.2.


#8

I’m seeing this with MAAS 2.6 Beta 4 (installed via apt) but only on one box (a Dell R900).


#9

I ended up replacing my 10g nic cards.


#10

I guess this might be to do with the card in this box (bnx2) being unsupported by iPXE (although i thought MAAS doesn’t actually use iPXE, rather it uses Pxelinux, no?)


#11

I wonder if this could be somewhat related to Pxelinux, I am able to successfully netboot the troublesome box using a separate PXE server running Pxelinux version 4.06. Apparently after versions >5.0 some file dependencies are introduced for pxelinux.0 and the c32 modules (and they are arch specific also since version 6.0).

Noseying around in /var/lib/maas/boot-resources/current/bootloader/ there does seem to be several version of the boot resources but they differ from the docs (or at least appear to).

This may be something that’s already handled within MAAS code but I haven’t delved in to see. I’m posting this here in case it hadn’t been considered :slight_smile:


#12

after some more digging it does seem like the root of the problem is in the bnx2 NIC support (or lack thereof)…


#13

Would either of the following work?:

1: With my (working) external PXE server, set MAAS as next-server for commissioning and deployment, or does the machine need to PXE boot from MAAS directly?

2: Could I replace the pxelinux.0 that MAAS uses with my older version, or with a new version that has the bnx2 driver compiled in? (And probably replace the initrd also)?


Boot from CD to MaaS
#14

I’ve the same issue…my ver is 2.5.3. I’ve made the same lab on another server with the 2.4.X and it works fine!!!


#15

Hi,
I’ve tried to make the upgrade from 2.5 to 2.6 as suggested here, and that issue has been resolved.
But to make up on my lab I’ll wait that the 2.6 go to ppa stable.


#16

2.6 hasn’t fixed it for me. I think in my case the issue is the fw for the bnx2 is not distributable as it’s under a proprietary license.

I found an old article describing a workaround here to add the bnx2 firmware into the initrd for pxebooting to work. It’s not clear to me which initrd is used by Maas for pxeboot. is it /var/lib/maas/boot-resources/current/boot-initrd? Would this approach still work although the advice is 10 years old?

Here’s how my boot looks right now:

Imgur


#17

More head scratching here. I managed to get a second NIC installed into 1 box to test, (a modern intel NIC) and I see same issue still.


#18

Right, I have one of these boxes in front of me now, and am able to reproduce this issue 100%. It seems like some other network issue may be causing this, as I’ve installed a test MAAS region/rack to bare metal on a simple LAN, and the issue still occurs as stated above regardless of if I’m attempting to boot from the bnx2 card or the test second NIC I’ve installed (an Intel 82574L card). I guess next steps are to run some Wireshark captures and see what’s actually happening.


#19

I wasn’t able to glean much from Wireshark. 2.4.x successfully commissions/deploys the servers however.

I need pod deployment for those machines to complete my environment buildout. Is it possible to restore a 2.5.x package to a ppa please? I assume some iteration on 2.6 is in the works to improve the compatibility regression that was introduced, but I’d really appreciate an interim solution that I might get from a 2.5.x version.


#20

So MAAS 2.4 vs MAAS 2.5/2.6 has differences in that 2.4 fully uses TFTP and in MAAS 2.5/2.6 the PXE process for legacy (which seems to be the case here) leverages HTTP boot. That means that lpxelinux.0 gets downloaded over TFTP and the initrd/kernel do so over HTTP.

This was introduced in MAAS 2.5. What’s interesting here is that MAAS 2.6 has not had any changes wrt to this for legacy systems. The changes that MAAS 2.6 has introduced all relate to:

  1. KVM VMs will now use iPXE
  2. For EFI machines and arm64 with older firmware, we do something similar in which grub is downloaded over TFTP and the rest over HTTP.
  3. For EFI machines with newer firmware, it supports full HTTP boot.
  4. Changes in the DHCP config to support all of the above.

So I’m surprised this won’t work on 2.6 when it did work on 2.5. While the machine is PXE booting, can you paste the output of rackd.log ?