PXE booting on a tagged VLAN

Hello,

Looking to PXE boot on a specific tagged vlan that is the only network with DHCP, and it looks like the initramfs image provided does not properly support vlans. So far, I found that you guys are using a customized copy of initramfs-tools from debian that adds vlan support. The source of which can be found at https://launchpad.net/ubuntu/+source/initramfs-tools. The problem with the existing vlan implementation is that it requires a vlan definition of vlan=interface.vlanid:interface and, while that interface name is predictable on a singular machine, it is not predictable across a wide verity of machines. It would be neat if it could have a placeholder key that can be replaced with the interface detected by the BOOTIF field.

After configuring the boot args as vlan=vlan.22:enp1s0f0np0 ip=::::maas-enlist:vlan.22, I then run into a problem with netplan and cloud-init. Reviewing the code for the initramfs script, it looks like it has a way to generate a proper netplan; But, the code never runs as there is no /run/"net-${DEVICE}.conf" file. I could possibly modify the functions script to generate a /run/"net-${DEVICE}.conf" file if it doesn’t exist, based on information determined from dhcp and what not. But that would require making custom initramfs images.

I think while I wait to see if anyone here has a better solution, I guess I’ll go and poke at MAAS | How to build MAAS images to see what’s all involved in modifying the /scripts/functions to fix the above noted issues. But I’d love if I didn’t have to.

An update on the situation, I made a patch for the functions script in initramfs which I’m applying in the mass-images maas-cloudimg2ephemeral script before the chroot update is processed.

--- ../main/scripts/functions   2024-08-23 11:50:09.362269934 -0500
+++ functions   2024-08-23 13:53:55.235159261 -0500
@@ -235,6 +235,25 @@
        fi
 }
 
+_handle_vlan_vs_ip()
+{
+       # If the ip= parameter is present and is a colon-separated list,
+       # then:
+       # - If it specifies a device, use that in preference to any
+       #   device name we already have
+       # - Otherwise, substitute in any device name we already have
+       local IFS=:
+       set -f
+       # shellcheck disable=SC2086
+       set -- ${IP}
+       set +f
+       if [ $# -ge 2 ] && [ -n "${vname}" ] && [[ "$DEVICE" == "$vlink" ]]; then
+               IP="$1:$2:$3:$4:$5:${vname}"
+               shift 6 || shift $#
+               IP="${IP}:$*"
+       fi
+}
+
 run_dhclient() {
         local timeout conffile pidfile pid
 
@@ -288,6 +307,7 @@
 
 configure_networking()
 {
+       BOOTIFDEVICE=""
        if [ -n "${BOOTIF}" ]; then
                # pxelinux sets BOOTIF to a value based on the mac address of the
                # network card used to PXE boot, so use this value for DEVICE rather
@@ -319,6 +339,7 @@
                                if [ "$bootif_mac" = "$current_mac" ]; then
                                        DEVICE=${device##*/}
                                        DEVICE6=${device##*/}
+                                       BOOTIFDEVICE=${device##*/}
                                        break
                                fi
                        fi
@@ -329,6 +350,9 @@
 
        for v in $VLAN; do
                vlink=${v##*:}
+               if [[ "$vlink" == "BOOTIF" ]]; then
+                       vlink=$BOOTIFDEVICE
+               fi
                VLAN_LINK="$VLAN_LINK $vlink"
                VLAN_NAMES="$VLAN_NAMES ${v%:*}"
        done
@@ -351,10 +375,16 @@
 
        for v in $VLAN; do
                vlink=${v##*:}
+               if [[ "$vlink" == "BOOTIF" ]]; then
+                       vlink=$BOOTIFDEVICE
+               fi
                vname=${v%:*}

This patch has worked great in testing, and I did end up finding through the patch work that initramfs does somehow generate the net-device.conf file. I just did not research enough to know how. You simply set your kernel parameters as vlan=vlan.22:BOOTIF and it then auto finds the interface based on what was detected as the boot interface.

The, hopefully, last problem I am trying to fix is the cloud-init/netplan issue. I did end up finding the problem after patching the image to allow me to poke at the booted enlist. The problem is cloud-init is making a netplan config file in /etc/netplan/50-cloud-init.yaml as follows:

# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network config capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        vlan.22:
            dhcp4: true
            match:
                macaddress: 52:54:00:e3:43:ab
            set-name: vlan.22

The problem here is its defining the vlan interface as an ethernet port, while the netplan config generated by initramfs:

network:
    version: 2
    ethernets:
        enp1s0f0np0:
            {}
    vlans:
        vlan.22:
            id: 22
            link: enp1s0f0np0
            dhcp4: true
            dhcp-identifier: mac
            critical: true
            nameservers:
                addresses: ["10.0.0.4"]
                search: ["maas."]

As can be seen, it configures the vlan interface as a vlan which creates the conflict.

The error shown in console by cloud-init is:

/run/netplan/vlan.22.yaml:7:5: Error in network defintion: Updated definition 'vlan.22' changes device type

To try and solve this, I did as the comment suggested and added network: {config: disabled} to the file it lists. That did not change the result. I then tried adding a line under system_info->network in cloud.cfg, and the same result occurs. I then tried appending network: {config: disabled} after the preserve_hostname section of the cloud.cfg file, and no difference. I then tried modifying the cloud-init config provided by maas with the following:

[root@util01 ~]# cat /var/snap/maas/current/preseeds/enlist
{{preseed_data}}
network: {config: disabled}

And nothing I have tried has worked. From what I can see, this is added by the cloud-init-local.service service which runs /usr/bin/cloud-init init --local. I haven’t had a change to RND what that actually does, and where it gets its configs from. But, I am on the track to getting boot from a tagged vlan to work.

Update. I have been able to fix the cloud-init issue by adding network-config=disabled to the kernel parameters, and it works with the existing cloud.cfg that is on the system. I, however, seem to have encountered a bug with MAAS itself.

Aug 24 12:40:39 util01 maas-log[300465]: maas.rpc.rackcontrollers: message repeated 2 times: [ [info] Existing rack controller 'util01' running version 3.5.1-16317-g.409891638 has connected to region 'util01'.]
Aug 24 12:40:39 util01 maas-log[300465]: maas.node: [info] : Status transition from NEW to COMMISSIONING
Aug 24 12:40:39 util01 maas-regiond[300302]: maasserver: [error] ################################ Exception: 'ValidationError' object has no attribute 'message' ################################
Aug 24 12:40:39 util01 maas-regiond[300302]: maasserver: [error] Traceback (most recent call last):
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/db/models/query.py", line 581, in get_or_create
Aug 24 12:40:39 util01 maas-regiond[300302]:     return self.get(**kwargs), False
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/db/models/query.py", line 435, in get
Aug 24 12:40:39 util01 maas-regiond[300302]:     raise self.model.DoesNotExist(
Aug 24 12:40:39 util01 maas-regiond[300302]: maasserver.models.interface.PhysicalInterface.DoesNotExist: PhysicalInterface matching query does not exist.
Aug 24 12:40:39 util01 maas-regiond[300302]: During handling of the above exception, another exception occurred:
Aug 24 12:40:39 util01 maas-regiond[300302]: Traceback (most recent call last):
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/forms/__init__.py", line 1479, in save
Aug 24 12:40:39 util01 maas-regiond[300302]:     node.add_physical_interface(mac)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/models/node.py", line 2146, in add_physical_interface
Aug 24 12:40:39 util01 maas-regiond[300302]:     iface, created = PhysicalInterface.objects.get_or_create(
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/models/interface.py", line 434, in get_or_create
Aug 24 12:40:39 util01 maas-regiond[300302]:     interface, created = super().get_or_create(*args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/db/models/manager.py", line 85, in manager_method
Aug 24 12:40:39 util01 maas-regiond[300302]:     return getattr(self.get_queryset(), name)(*args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/db/models/query.py", line 588, in get_or_create
Aug 24 12:40:39 util01 maas-regiond[300302]:     return self.create(**params), True
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/db/models/query.py", line 453, in create
Aug 24 12:40:39 util01 maas-regiond[300302]:     obj.save(force_insert=True, using=self.db)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/models/interface.py", line 1763, in save
Aug 24 12:40:39 util01 maas-regiond[300302]:     super().save(*args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/models/interface.py", line 1636, in save
Aug 24 12:40:39 util01 maas-regiond[300302]:     super().save(*args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/models/cleansave.py", line 46, in save
Aug 24 12:40:39 util01 maas-regiond[300302]:     self.full_clean(exclude=exclude_clean_fields, validate_unique=False)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/db/models/base.py", line 1251, in full_clean
Aug 24 12:40:39 util01 maas-regiond[300302]:     raise ValidationError(errors)
Aug 24 12:40:39 util01 maas-regiond[300302]: django.core.exceptions.ValidationError: {'mac_address': ["'04:32:01:c5:d0:10,c4:5a:b1:bf:04:bc,04:32:01:c5:d0:11,c4:5a:b1:bf:04:bb' is not a valid MAC address."]}
Aug 24 12:40:39 util01 maas-regiond[300302]: During handling of the above exception, another exception occurred:
Aug 24 12:40:39 util01 maas-regiond[300302]: Traceback (most recent call last):
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
Aug 24 12:40:39 util01 maas-regiond[300302]:     response = wrapped_callback(request, *callback_args, **callback_kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/utils/views.py", line 298, in view_atomic_with_post_commit_savepoint
Aug 24 12:40:39 util01 maas-regiond[300302]:     return view_atomic(*args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
Aug 24 12:40:39 util01 maas-regiond[300302]:     return func(*args, **kwds)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/api/support.py", line 62, in __call__
Aug 24 12:40:39 util01 maas-regiond[300302]:     response = super().__call__(request, *args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
Aug 24 12:40:39 util01 maas-regiond[300302]:     response = func(*args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/piston3/resource.py", line 196, in __call__
Aug 24 12:40:39 util01 maas-regiond[300302]:     result = self.error_handler(e, request, meth, em_format)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/usr/lib/python3/dist-packages/piston3/resource.py", line 194, in __call__
Aug 24 12:40:39 util01 maas-regiond[300302]:     result = meth(request, *args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/api/support.py", line 371, in dispatch
Aug 24 12:40:39 util01 maas-regiond[300302]:     return function(self, request, *args, **kwargs)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/api/machines.py", line 1937, in create
Aug 24 12:40:39 util01 maas-regiond[300302]:     machine = create_machine(request, requires_arch=True)
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/api/machines.py", line 1752, in create_machine
Aug 24 12:40:39 util01 maas-regiond[300302]:     machine = form.save()
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/forms/__init__.py", line 401, in save
Aug 24 12:40:39 util01 maas-regiond[300302]:     node = super().save()
Aug 24 12:40:39 util01 maas-regiond[300302]:   File "/snap/maas/36889/lib/python3.10/site-packages/maasserver/forms/__init__.py", line 1481, in save
Aug 24 12:40:39 util01 maas-regiond[300302]:     mac_addresses_errors.append(e.message)
Aug 24 12:40:39 util01 maas-regiond[300302]: AttributeError: 'ValidationError' object has no attribute 'message'
Aug 24 12:40:39 util01 maas-http[300465]:  10.0.0.4 - - [24/Aug/2024:12:40:39 -0400] "POST /MAAS/api/2.0/machines/ HTTP/1.1" 500 51 "-" "Python-urllib/3.10"

The request sent is:

architecture=amd64&mac_addresses=c4:5a:b1:bf:04:bc,04:32:01:c5:d0:10,c4:5a:b1:bf:04:bb,04:32:01:c5:d0:11&commission=True&power_type=ipmi&power_parameters={"cipher_suite_id":+"3",+"k_g":+"",+"mac_address":+"C4:5A:B1:BF:04:B5",+"power_address":+"10.0.0.29",+"power_boot_type":+"efi",+"power_driver":+"LAN_2_0",+"power_pass":+"XXXXXXXXX",+"power_user":+"maas",+"privilege_level":+"ADMIN"}

I finally got enlisting to work over a VLAN. I had to re-enable the network config in cloud-init, but modify cloud-init to have vlan as an option from the klibc config. As ipconfig klibc format, did not have this support by default, I just made up my own way of defining it.

/scripts/functions

--- ../main/scripts/functions   2024-08-23 11:50:09.362269934 -0500
+++ functions   2024-08-26 06:53:46.625845257 -0500
@@ -235,6 +235,25 @@
        fi
 }
 
+_handle_vlan_vs_ip()
+{
+       # If the ip= parameter is present and is a colon-separated list,
+       # then:
+       # - If it specifies a device, use that in preference to any
+       #   device name we already have
+       # - Otherwise, substitute in any device name we already have
+       local IFS=:
+       set -f
+       # shellcheck disable=SC2086
+       set -- ${IP}
+       set +f
+       if [ $# -ge 2 ] && [ -n "${vname}" ] && [[ "$DEVICE" == "$vlink" ]]; then
+               IP="$1:$2:$3:$4:$5:${vname}"
+               shift 6 || shift $#
+               IP="${IP}:$*"
+       fi
+}
+
 run_dhclient() {
         local timeout conffile pidfile pid
 
@@ -288,6 +307,7 @@
 
 configure_networking()
 {
+       BOOTIFDEVICE=""
        if [ -n "${BOOTIF}" ]; then
                # pxelinux sets BOOTIF to a value based on the mac address of the
                # network card used to PXE boot, so use this value for DEVICE rather
@@ -319,6 +339,7 @@
                                if [ "$bootif_mac" = "$current_mac" ]; then
                                        DEVICE=${device##*/}
                                        DEVICE6=${device##*/}
+                                       BOOTIFDEVICE=${device##*/}
                                        break
                                fi
                        fi
@@ -329,6 +350,9 @@
 
        for v in $VLAN; do
                vlink=${v##*:}
+               if [[ "$vlink" == "BOOTIF" ]]; then
+                       vlink=$BOOTIFDEVICE
+               fi
                VLAN_LINK="$VLAN_LINK $vlink"
                VLAN_NAMES="$VLAN_NAMES ${v%:*}"
        done
@@ -349,12 +373,22 @@
                esac
        done
 
+       TYPE="physical"
        for v in $VLAN; do
                vlink=${v##*:}
+               if [[ "$vlink" == "BOOTIF" ]]; then
+                       vlink=$BOOTIFDEVICE
+               fi
+               if [[ "$vlink" == "$DEVICE" ]]; then
+                       TYPE="vlan"
+               fi
                vname=${v%:*}
                vid=${vname#*.}
                ip link set up dev "$vlink"
                ip link add name "$vname" link "$vlink" type vlan id "$vid"
+
+               # Update device in IP= to the vlan name if the vlink matches device.
+               _handle_vlan_vs_ip
        done
 
        if [ -n "${DEVICE}" ]; then
@@ -453,11 +487,22 @@
        # but no IPv4 conf files exist.
        for conf in /run/"net-$DEVICE.conf" /run/net-*.conf; do
                if [ -e "$conf" ]; then
+                       echo "TYPE=$TYPE" >> "$conf"
+                       if [ -n "${BOOTIFDEVICE}" ]; then
+                               echo "BOOTIFDEVICE=$BOOTIFDEVICE" >> "$conf"
+                       fi
                        # source specific bootdevice
                        . "$conf"
                        break
                fi
        done
+
+       # Create a boot interface config, if this is a vlan it will not exist.
+       conf="/run/net-$BOOTIFDEVICE.conf"
+       if [ -n "${BOOTIFDEVICE}" ] && [ ! -e $conf ]; then
+               echo "DEVICE=$BOOTIFDEVICE" > $conf
+               echo "PROTO=none" >> $conf
+       fi
 
        netinfo_to_resolv_conf /etc/resolv.conf \
                /run/"net-${DEVICE}.conf" /run/net-*.conf /run/net6-*.conf
@@ -660,6 +705,7 @@
                unset DEVICE DEVICE6 PROTO IPV6PROTO
                unset IPV6ADDR IPV6NETMASK IPV6GATEWAY
                unset IPV4ADDR IPV4NETMASK IPV4GATEWAY
+               unset VLINK VID
                . "$f" || { echo "WARN: failed '. \"$f\"'" 1>&2; return 1; }
                local name=""
                name=${DEVICE:-${DEVICE6}}
@@ -681,9 +727,16 @@
                {
                for v in $VLAN; do
                        vlink=${v##*:}
+                       if [[ "$vlink" == "BOOTIF" ]]; then
+                               vlink=$BOOTIFDEVICE
+                       fi
                        vname=${v%:*}
                        vid=${vname#*.}
                        if [ "$name" = "$vname" ]; then
+                               if [ -z "$VLINK" ]; then
+                                       echo "VLINK=$vlink" >> $f
+                                       echo "VID=$vid" >> $f
+                               fi
                                echo "vlink=$vlink"
                                echo "vname=$vname"
                                echo "vid=$vid"

/scripts/init-bottom/cloud-initramfs-dyn-netconf

--- ../main/scripts/init-bottom/cloud-initramfs-dyn-netconf     2020-08-17 11:00:58.000000000 -0500
+++ cloud-initramfs-dyn-netconf 2024-08-26 04:35:49.492437237 -0500
@@ -177,6 +177,20 @@
                fi
        fi
 
+       # Add vlan link device.
+       for v in $VLAN; do
+               vlink=${v##*:}
+               if [[ "$vlink" == "BOOTIF" ]]; then
+                       vlink=$BOOTIFDEVICE
+               fi
+               vname=${v%:*}
+               vid=${vname#*.}
+               if [ "$DEVICE" = "$vname" ]; then
+                       printf "\t%s\n" "vlan-raw-device $vlink"
+                       break
+               fi
+       done
+
        nsline=""
        if [ -n "$IPV4DNS0" -a "$IPV4DNS0" != "0.0.0.0" ]; then
                nsline="${IPV4DNS0}"

/usr/lib/python3/dist-packages/cloudinit/net/cmdline.py

--- ../ephemeral/usr/lib/python3/dist-packages/cloudinit/net/cmdline.py 2024-03-27 08:14:04.000000000 -0500
+++ cmdline.py  2024-08-26 06:56:02.921289498 -0500
@@ -117,6 +117,7 @@
         name = data["DEVICE"] if "DEVICE" in data else data["DEVICE6"]
     except KeyError as e:
         raise ValueError("no 'DEVICE' or 'DEVICE6' entry in data") from e
+    iftype = data.get("TYPE", "physical")
 
     # ipconfig on precise does not write PROTO
     # IPv6 config gives us IPV6PROTO, not PROTO.
@@ -131,11 +132,15 @@
         raise ValueError("Unexpected value for PROTO: %s" % proto)
 
     iface = {
-        "type": "physical",
+        "type": iftype,
         "name": name,
         "subnets": [],
     }
 
+    if iftype == "vlan":
+        iface["vlan_link"] = data.get("VLINK")
+        iface["vlan_id"] = int(data.get("VID"))
+
     if name in mac_addrs:
         iface["mac_address"] = mac_addrs[name]

After applying these patches to cloud-init and the initramfs scripts, I can just add vlan=vlan.22:BOOTIF to the kernel parameters alone and it boots and enlists. I haven’t tested commissioning or provisioning yet. But hope to get some testing done today.

1 Like

An update, it looks like I can commission and deploy a server with the updates above. However for deploys, choosing an IP address from the native vlan ends up getting wiped as soon as PXE boot off the vlan occurs. So I need to look into what’s doing that and fix it so it doesn’t wipe configurations set.

A side note, the way my updates to the cloud-init code works ends up causing a schema error as the vlan has an MAC address. I tried fixing that, however in doing so it breaks enlisting as the commissioning scripts returns to giving all MAC addresses back to MAAS. So, I’m just going to ignore the schema error as it works with it.

Ok, traced down where the IP assignment is erased.

10.0.0.85 - - [27/Aug/2024:12:15:48 -0400] "GET /grub/grub.cfg-04:32:01:c5:d0:10 HTTP/1.1" 200 839 "-" "UefiHttpBoot/1.0"

After that call, it then logs the following:

Reloaded DNS configuration; ip 1.1.1.10 disconnected from clean-gnu on enp1s0f0np0
Reloaded DNS configuration; ip 10.0.0.85 connected to clean-gnu on enp1s0f0np0

I have booted the machine on a system rescue CD, and confirmed the behavior by manually curling the grub config. After curling the GRUB config, it erases the network config.

I am going to stop researching this as I do not know the reasons behind you guys implementing a feature like this. I have gone ahead and moved the PXE vlan to the native vlan and just added our public vlan as a tag, and that works for now. I would love to see if the team at Canonical can take some of my work here, and finish the process of making this work. I know there are others looking for this kind of functionality, and it’ll be awesome if this setup just worked.

Another posting that I believe is related to this: PXE boot with vlan without native vlan set