Enlistment port restriction

Hello everyone,
I hope you’re doing fine !

We have this weird issue here which we can’t seem to sort out by ourselves…

When we try to enlist this kind/type of machines, with their admin Ethernet port, it enlist, but when we try with another available Ethernet port from an embeded Ethernet switch it is not.
It correctly boot via DHCP/PXE/BOOTP/TFTP… but fail on 30-maas-01-bmc-config, 40-maas-01-machine-resources, 50-maas-01-commissioning.

Any restrictions on ports usage for enlistment?
Would it be possible that something on those scripts, cut the network it is sitting on?

At the end, the machine appear on the MAAS (3.1) interface machine panel and is correctly powered off but with “failed commissioning” and no hardware info.

Those machine have two IPMI/(Open)BMC Ethernet interfaces. (path)
One on Managment port but one accessible from any other port we try to enlist from (so to avoid connection on Managment port) :
https://www.kontron.com/en/products/me1210-high-performance-vran-mec-platform/p160518

Thanks in advance,
Have a nice day,
Best Regards,
Mickaël.

hi, @mkl1,

We need to know the version and build (and packaging format) that you’re running. I think you said Version 3.1, but we need additional information. We may need even more information, but won’t know that until we have the following items.

If you’re using a snap

If you’re using a snap, execute snap listmaas at the command line, which will return some lines like this:

Name  Version                       Rev    Tracking     Publisher   Notes
maas  3.0.0~beta2-9796-g.2182ab55f  13292  latest/edge  canonical✓  -

Just send us the results of that command.

If you’re using a debian package

If you’re using a deb, execute apt list maas at the command line, and tell us what it returns.

How you accessed MAAS

Also, you’ll need to specify which interface you’re using (CLI, UI, or API), and generally what command(s) you were attempting or what screens were involved. It looks like you were using the Web UI, but we want to be sure we have that right. Screenshots may help.

Get us this, and we can try to move forward answering your question.

Hello Bill,

I hope you’re doing fine !

Thanks for your answer.

The version is :

dpkg -l | grep maas
ii  maas                                  1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          "Metal as a Service" is a physical cloud and IPAM
ii  maas-cli                              1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS client and command-line interface
ii  maas-common                           1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS server common files
ii  maas-dhcp                             1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS DHCP server
ii  maas-proxy                            1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS Caching Proxy
ii  maas-rack-controller                  1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          Rack Controller for MAAS
ii  maas-region-api                       1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          Region controller API service for MAAS
ii  maas-region-controller                1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          Region Controller for MAAS
ii  python3-django-maas                   1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS server Django web framework (Python 3)
ii  python3-maas-client                   1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS python API client (Python 3)
ii  python3-maas-provisioningserver       1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1  all          MAAS server provisioning libraries (Python 3)

apt list -a maas
Listing... Done
maas/focal,now 1:3.1.0-10901-g.f1f8f1505-0ubuntu1~20.04.1 all [installed]
maas/focal-updates 1:0.7 all
maas/focal 1:0.6 all

I use the web gui, no fancy configuration, it’s for a learning phase.

I just do nothing special, the machines enlist correctly via their managment interface ONLY.
All port are doing DHCP / PXE / BOOTP / TFTP correctly and got the correct payload and the machine boot. (via the one that firstly do that)
Any restriction on Ethernet port usage with MAAS ?
What if all the ports are connected to MAAS ?

Thanks in advance,
Have a nice day,
Best Regards,
Mickaël.

PS :

uname -a
Linux maas2 5.4.0-104-generic #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
lsb_release --all
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

@mkl1, sorry this one fell thru the cracks. did you get it resolved yet?

Hello Bill,

I was off :wink:

We have identified some restictions linked to our level two networks that include a lot of different variations of spaning tree protocols that give some link loss multiple time during different pahses on the interface(s) that are used to boot from the network. (BIOS/EFI/PXE/GRUB/KERNEL…)
We are still strugling to find a solution, maybe in grub to repeat, retry, wait, whatever, for the process to always work.

The admin port was directly connected to our maas prototype, but plugin it to the network show the same issues.

That would be great if you could lead me to the path inside the maas of the grub.cfg file loaded by the maas so we could study it and increase it’s reliability during network boot in case of network failures.

Not sure then if we will not meet other booting vs network availability issues.

Thanks in advance,
Have a nice day,
Best Regards,
Mickaël.

PS: T
Then quid 802.1x… I have some work to do ;-(

@mk1, I’m not quite sure that you’re asking here. can you try asking again?

Hello Bill,

I hope you’re doing fine !

The issue is that during the differents phases of PXE booting we lost network.
(As explained, maybe due to spanning tree vs net driver load, ects…)
And so, sometime we end up at he grub prompt forever.
We would like to read the grub configuration file posted durng PXE boot to try to increase it’s realiability.
Where is(are) located, under a MaaS server, the grub.cfg file(s) posted during PXE booting ?

Thanks in advance,
Have a nice day,
Best Regards,
Mickaël.

ah, i get it. i’m asking around. let you know.

-best.

Hello Bill,

Not sure why, I have to check if no one change my network environment, but machines are not booting anymore and I get stuck in grub>.

So I took the time to look around :

env
prefix=(tftp,rack-ip)/grub

ls
(memdisk) …
ls (memdisk)/
grub.cfg

cat (memdisk)/grub.cfg
if [ -e $prefix/x86_64-efi/grub.cfg ]; then
source $prefix/x86_64-efi/grub.cfg
elif [ -e $prefix/grub.cfg-amd64 ] ; then
source $prefix/grub.cfg-default-amd64
else
source $prefix/grub.cfg
fi

I really don’t like much this default grub, which is so much prone to network failures.
I also can’t find it in MaaS cause I guess it is build on the fly.
I would rather see a kind of while, not if…

Could you please makes some trials on networks with spannnig tree enabled so you can encounter the issues? And maybe open a case?

Thanks in advance,
Have a nice day,
Best Regards,
Mickaël.

Hi @mkl1,

You mentioned that when trying to commission your machines are “getting stuck on grub”. If you watch the boot process, are they getting stuck on “Fetching netboot image”?

I noticed in the spec sheet of your server that the ethernets are 10G intel (X722 controllers). There is a race condition bug on some intel NICs (#1437353). I have personally seen it on i350s and X540s and have seen it in the forums recently. I don’t know if it’s applicable to the X722 controllers, but the solution has been to flash the firmware with intel’s flash utility.

Hello,

We are stuck in "grub> " cause, my guess is thet the grub is not able to fetch $prefix/whatever…
And never ever retry, which is stupid…

I will look into intel Ethenet chip race condition.
But I guess the issue is spanning tree…
If you would have read the whole thread…

Have a nice day,
Best Regards,
Mickaël.

well, @mk1, for some odd reason, i decided to draw a very crude picture:

]

on a snap, the grub file (if that’s what it’s using) would be located in:

/var/snap/maas/common/maas/boot-resources/current/bootloader/

i don’t have a package install of MAAS handy atm, but you can find the bootloader directory easily enough with:

find / -name bootloader -print 2>/dev/null

you’ll need to figure out which bootloader you’re using, as i think there are 3 directories under that one, each with some number of bootloaders, possibly.

Hello Bill,

Thanks for your answer.

Still not helping cause I guess grub.cfg is embedded in the grubx64.efi.
So I could not do any trials modifying the grub.cfg.

Another path, compare to the logs, maybe the lease time for PXE are too short.

Any CLI commands exemple to handle lease times for PXE and then for exploitation ones?

Thanks in advance,
Have a nice day,
Best Regards,
Mickaël.

@mkl1, i’m not sure i know of any, but i’ll ask around real quick.

i think you’ll have to construct a dhcp snippet to handle that for you – it’s a standard thing, not something MAAS provides a handle for, but you should be able to pull it off.

Thanks for your answer Bill.

But :

sudo grep lease /var/lib/maas/dhcpd.conf
# Shorter lease time for PXE booting
   default-lease-time 30;
   max-lease-time 30;
# Define lease time globally (can be overriden globally or per subnet
default-lease-time 600;
max-lease-time 600;

How to build the snippet only for PXE?
I tried to use the CLI to get the info (using -h --help) but it does not provide me with good information on how to build the snippets…

I just detected something doing wireshark inspection.

It look like this :
https://osqa-ask.wireshark.org/questions/22519/tftp-transfer-option-negotiation-failed-error-8-packet-trace/

Why MaaS TFTP server build answers that contains field not requested by the clients ?

@mkl1, hmm…

@mkl1, wdym in this particular statement?