Problem with PXE 2.1 v6.0.40

phrancesco · 15 September 2020 09:42

Hi,

I’m experiencing some problem with MAAS 2.8.2 and PXE boot.
The box (hp BL460G6) is pretty old, but seems to work correctly with a standard tftp server and a centos 7.7 image.
In this case I used a pure TFTP deploy, no HTTP at all.

In MAAS this seems to be difficult, at least for me.

Here’s what I noticed:

standard PXE boot implies HTTP after a couple of TFTP transaction, everything get stuck at a random point after few seconds.
looking in /var/lib/maas/boot-resources/current it seems that pxelinux.0 is never used because of sylink to lpxelinux.0 [xenial and bionic beaver]
changing the symlink to point both to pxelinux.0 causes that ldlinux.c32 is not available in pXE boot.
it’s not clear how to have a clear tftp server log even with rackd in debug mode.

So here’s my question:

is there a way to force legacy PXE boot in maas, NO HTTP involved ?
how I can enable the TFTP server log, it looks like a part of rackd

Found these bugs:

This stuff is really similar to my problem… but looks like there’s no solution yet.

Any help will be really appreciated.

BR
Francesco

ltrager · 15 September 2020 20:22

MAAS has always strived to be the fastest bare metal installer that exists. We discovered a few years ago that one area of slow down was booting. TFTP is a much slower protocol than HTTP no matter what server you use. We decided the best way to improve performance was by moving as much of the boot process over to HTTP as possible. PXELinux allows us to do that. The system firmware requests the bootloader via TFTP and then PXELinux takes over and uses HTTP for the rest of the process.

There is no supported way to go back to full TFTP. The following would have to be done:

When dhcpd.conf is written the path-prefix shouldn’t be given, it tells PXELinux to get everything over HTTP.
The rendered pxelinux.cfg would need to be modified to use TFTP again.

TFTP/HTTP requests are logged in a couple of places

/var/log/maas/rackd.log
/var/log/maas/http/*.log
As a node event

Two things you could try to fix this:

Update the system’s firmware
Try using lpxelinux from Focal. We currently use all bootloaders from Bioinc and are planning on making this switch soon but havn’t had time to fully test all bootloaders.

Feel free to file a new bug on this. Please include all MAAS logs as well as the hardware you are experiencing this problem with.

phrancesco · 17 September 2020 09:12

Thanks for your kind and complete reply.

Here’s some considerations.

changing the whole dhcpd.conf without any path prefix didn’t allow me to have a full tftp boot; probably my fault that I’m not that in dhcp snippets.
The nic and box firmware is uptodate
focal is not working either

So, I’m getting a bit curious if is there a way to make these old box working with maas 2.8.2.
Seems that the problem can be the same on g7 and g8, for us this mean more or less 100 blades, this can really compromise MAAS future in our infrastructure…

BR
Francesco

seffyroff · 17 September 2020 16:19

I had many PXE issues after the HTTP switchover happened with MAAS initially, and I am also using some older hardware in my racks. However I was able to pretty much overcome all of them by changing the network configuration to be simpler and more in line with what MAAS would expect. I think TFTP based PXE is less fussy than HTTP, and making small changes like having a rack controller on the same L2 space as the racks in question was a big step towards resolving this myself. I am afraid I don’t recall specifics but I hope you can take some assurance that there’s likely a solution to the issue you’re facing by reworking your deployment placements.

phrancesco · 21 September 2020 07:41

Thanks,

unfortunately the network configuration is already like you said, the rack controller is already on the same subnet of the box.

Best regards

knaledge · 21 September 2020 14:18

@phrancesco - You may find some success by piecing together what you are able from this thread about PXE-booting UEFI

In your case (and with a healthy dose of willingness-to-experiment - even if it’s not your “ideal”), the major deviations would likely be:

Ensure that the enlist-able machine(s) you want to commission are, in fact, set to boot legacy BIOS
Modify the dnsmasq.conf entries to map to each enlist-able machine’s architecture (RFC4578) (see this comment for background/more info; you’re likely fine with arch,7 and arch,9)
Modify the dnsmasq.conf entry for dhcp-boot to be lpxelinux.0 (the “legacy” version of PXE boot loader - i.e. non-UEFI version)

That said, if your enlist-able machine(s) are capable of supporting UEFI, definitely give that a go since things are made much smoother (and faster) by doing so - especially if you modify the UEFI-BIOS setting for boot order to be PXE HTTP. That said, you mentioned them being “old” machines, so I outlined the above with their likely lack of UEFI support in mind.

As for logging (namely to watch MAAS attempt to PXE-boot your enlist-able machines), simply log in to your MAAS host (via SSH) and then follow along:

If you installed via snap:

sudo tail -f /var/snap/maas/common/log/rackd.log | grep "provisioningserver\|sstreams"

If you installed via apt:

sudo tail -f /var/log/maas/rackd.log

phrancesco · 23 September 2020 10:05

Thanks, I’ll take a deep look into this.

In the mean while I found this one…

Just found this in the syslinux wiki:

Broadcom 57711
HP Proliant BL460c G6 servers with Broadcom BCM 57711 10Gbit NICs were reported to have an issue with gpxelinux.0 (gPXE + PXELINUX) as of v4.02. The workaround (implemented in v4.04 as gpxe/gpxelinuxk.0; emphasis on the single “k”) was to:

Edit gpxe/Makefile to use undionly.kpxe instead of undionly.kkpxe and;
Edit gpxe/pxelinux.gpxe to change “set use-cached 1” to “set use-cached 0” (as “set use-cached 1” is not supported with .kpxe image).
http://www.mail-archive.com/gpxe@etherboot.org/msg01002.html
http://www.syslinux.org/archives/2010-October/015782.html

They are talking about the exact model of the Server and the NIC I’m using…

https://wiki.syslinux.org/wiki/index.php?title=Hardware_Compatibility

phrancesco · 23 September 2020 15:40

Dear All,

I found a working solution.

So, a small recap:

HP Proliant BL460c G6 servers with Broadcom BCM 57711 10Gbit NIC can’t PXE boot using lpxelinux or pxelinux, using 16.04 LTS, 18.04 LTS or 20.04 LTS.

The solution I found having a look in to gpxe old posts is:

wget http://boot.ipxe.org/undionly.kpxe
mv undionly.kpxe /var/lib/maas/boot-resources/current/

then using a dhcp snippet of the mac address of the old box:

host oldGen6Box {
        fixed-address 10.102.XX.YY;
        hardware ethernet D8:D3:85:AA:BB:ZZ;
        server-name "10.102.XX.ZZ";
 if exists user-class and option user-class = "iPXE" {
	filename "http://10.102.XX.ZZ:5248/ipxe.cfg";
  } else {
      filename "undionly.kpxe";
  }
}

Using 18.04 LTS I can commission and deploy Centos 8.1 using MAAS 2.8.2
16.04 LTS goes kernel panic during commissioning
20.04 LTS can commission but loops forever during deploy (maybe I need to test it again… looks odd)

Basically I changed the bootloader to support that odd card (Broadcom BCM 57711) and then switched back to iPXE MAAS workflow.

Does this make sense ?
Does anyone took the idea of supporting “undionly” pxe device ?

The machines are IPMI LAN2.0 using legacy boot option.

BR
Francesco

system · 19 May 2022 20:40

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.