Network troubleshooting: Environment issues

billwear · 8 October 2025 16:37

Network troubleshooting: Environment and topology gotchas

Not all problems come from MAAS itself. The surrounding environment – switches, routers, firewalls, overlays, and virtualization layers – often create the hardest troubleshooting puzzles. This document catalogs the most common topological “gotchas” and how to recognize, verify, and fix them.

Campus and enterprise networks

Campus and enterprise networks often use large-scale routing and segmentation across multiple VRFs or subnets. When DHCP relay is not configured correctly, discovery traffic never reaches MAAS. Understanding these boundaries is essential for confirming which subnets MAAS can serve directly.

Symptom

DHCP requests never reach MAAS.

Verify

# capture dhcp packets on the rack uplink to confirm none are received
sudo tcpdump -ni <rack-if> port 67 or 68
# test dhcp relay path from a node subnet
sudo dhclient -v -sf /bin/true -r <iface>

Cause

Core routers segment traffic into VRFs; DHCP relay not configured.

Fix

Ensure DHCP relay (IP helper) is set for each VRF or subnet where MAAS nodes live.

Example fix

# verify vlan dhcp settings in maas
maas $PROFILE vlan read <fabric-id> <vid> | jq '.dhcp_on'

# cisco: enable dhcp relay to region/rack subnet
# interface Vlan<vid>
#  ip helper-address <rack-or-region-ip>

# juniper: add dhcp relay agent
# set forwarding-options helpers bootp interface vlan.<vid> server <rack-or-region-ip>

# verify relay operation
sudo tcpdump -ni <rack-uplink-if> port 67 or 68

DHCP snooping and source guard

DHCP snooping and source guard protect networks from rogue DHCP servers. These features can block legitimate MAAS rack responses if not configured for trusted ports. When enabled without exceptions, PXE clients may receive no offers or inconsistent responses.

Symptom

PXE offers do not reach nodes, or commissioning fails randomly.

Verify

# on the rack controller interface, capture dhcp offers
sudo tcpdump -ni <rack-if> port 67 or 68
# on the node port, confirm packets are not received
sudo tcpdump -ni <node-if> port 67 or 68

Cause

Switch enforces DHCP snooping or IP source guard, blocking MAAS rack replies.

Fix

Mark rack controller access ports as trusted DHCP ports and relax source guard for expected MACs.

Example fix

# cisco: trust dhcp and disable ip source guard
# interface GigabitEthernet0/1
#  ip dhcp snooping trust
#  no ip verify source

# juniper: trust dhcp on rack port
# set ethernet-switching-options secure-access-port interface ge-0/0/1 dhcp-trusted
# delete ethernet-switching-options secure-access-port interface ge-0/0/1 ip-source-guard

# verify dhcp handshake
sudo tcpdump -ni <rack-if> port 67 or 68

Port security and MAC limits

Many access switches restrict the number of MAC addresses learned per port. During PXE boot, nodes temporarily use a different MAC for the ephemeral environment. If limits are exceeded or stale entries remain, packets from new MACs are silently dropped.

Symptom

Node never appears in MAAS, despite PXE attempt.

Verify

# capture ephemeral mac address on rack interface
sudo tcpdump -ni <rack-if> ether host <ephemeral-mac>
# view switch mac table for the access port
# show mac address-table interface <port>

Cause

Switchport MAC limit exceeded; ephemeral boot MAC rejected or aged out.

Fix

Increase per-port MAC limit and clear stale entries.

Example fix

# cisco: raise limit and clear mac table
# interface GigabitEthernet0/2
#  switchport port-security maximum 10
#  switchport port-security aging time 5
# clear mac address-table dynamic interface gi0/2

# juniper: bump mac limit
# set ethernet-switching options secure-access-port interface ge-0/0/2 mac-limit 10

# verify connectivity restoration
sudo tcpdump -ni <rack-if> ether host <ephemeral-mac>

Spanning Tree and link delays

Spanning Tree Protocol (STP) prevents loops but introduces startup delays. When ports transition through listening or blocking states, PXE clients may time out before the link is active. Servers should connect to ports configured for immediate forwarding to avoid delay.

Symptom

PXE client times out before link becomes active.

Verify

# observe interface state changes on client
sudo journalctl -b -u systemd-networkd | grep eth
# confirm link activation delay
sudo dmesg | grep eth

Cause

STP places port into blocking/listening for ~30 seconds.

Fix

Enable portfast or edge mode on access ports used by servers and PXE clients.

Example fix

# cisco
# interface GigabitEthernet0/3
#  spanning-tree portfast

# juniper
# set protocols rstp interface ge-0/0/3 edge

# verify quick link activation
sudo ethtool <iface> | grep Link

Mirror/monitor/SPAN sessions

SPAN sessions duplicate traffic for analysis. If misconfigured or oversubscribed, packet loss and duplication can occur during capture. Accurate visibility requires correct direction settings and adequate buffer capacity.

Symptom

Captures show missing packets or odd duplication.

Verify

# check capture interface for drops
sudo ethtool -S <capture-if> | grep -i drop
# validate both directions visible
sudo tcpdump -ni <capture-if> -c 10 port 67 or 68

Cause

SPAN misconfigured or oversubscribed.

Fix

Ensure capture port mirrors both ingress and egress for correct VLANs and avoid oversubscription.

Example fix

# cisco
# monitor session 1 source vlan <vid> both
# monitor session 1 destination interface gi0/10

# verify capture continuity
sudo tcpdump -ni <capture-if> port 67 or 68

Overlay networks (VXLAN, EVPN, GRE)

Overlay networks encapsulate traffic for tenant isolation or tunneling. Encapsulation increases packet size, often exceeding standard MTU limits. Without sufficient headroom, packets fragment or drop, disrupting large transfers and PXE boot.

Symptom

Image downloads stall; packet fragmentation occurs.

Verify

# check path mtu between rack and node
ping -M do -s 8972 <rack-ip> -c 3 || true

Cause

MTU mismatch across tunnel overlays.

Fix

Raise underlay or overlay MTU, or lower host MTU for consistent operation end-to-end.

Example fix

# verify current mtu
ip link show <iface> | grep mtu

# set mtu appropriately
sudo ip link set dev <iface> mtu 9000

# update maas vlan mtu
maas $PROFILE vlan update <fabric-id> <vid> mtu=9000

# confirm mtu alignment
ip link show <iface> | grep mtu

WAN or remote sites

Remote sites often depend on high-latency WAN links to central MAAS controllers. Packet delay and loss affect DHCP, metadata, and image download reliability. Deploying local rack controllers reduces round trips and improves consistency.

Symptom

Commissioning/deployment fails intermittently; long delays.

Verify

# measure latency and packet loss
ping -c 5 <region-ip>
mtr -rwc 10 <region-ip>

Cause

High latency and packet loss to central controllers; proxy/mirror too far away.

Fix

Deploy a local rack controller and configure nodes to use local proxy and mirror.

Example fix

# install and register a local rack controller
sudo snap install maas --channel=3.4/stable
sudo maas init rack --region-url=http://<region-ip>:5240/MAAS/ --secret <shared-secret>

# point maas to local proxy
maas $PROFILE maas set-config name=http_proxy value=http://<local-rack-ip>:3128/

# verify proxy reachability
curl -I http://<local-rack-ip>:3128/

Virtualization labs (KVM, LXD, Multipass)

Virtualized environments often include built-in DHCP services. These interfere with MAAS-controlled subnets when default dnsmasq instances remain active. Disabling host NAT and bridging directly to the physical network prevents DHCP conflicts.

Symptom

Rogue DHCP seen; conflicts with MAAS.

Verify

# identify rogue dhcp servers
sudo nmap --script broadcast-dhcp-discover

Cause

Libvirt/LXD default dnsmasq running on host bridges.

Fix

Disable default NAT networks and bridge host NICs to the real VLAN.

Example fix

# kvm/libvirt
virsh net-list --all
sudo virsh net-destroy default || true
sudo virsh net-autostart default --disable || true

# lxd
sudo lxc network set lxdbr0 ipv4.dhcp false
sudo lxc network set lxdbr0 ipv6.dhcp false

# create linux bridge
sudo nmcli connection add type bridge ifname br0
sudo nmcli connection add type bridge-slave ifname <nic> master br0

Nested virtualization

Nested virtualization allows virtual machines to host other VMs. PXE traffic may fail due to NAT or unsupported virtual NIC models. Bridging and compatible NIC models ensure DHCP and TFTP visibility.

Symptom

PXE boot works inconsistently in nested VMs.

Verify

# confirm traffic passes bridge
sudo tcpdump -ni br0 port 67 or 68

Cause

Host-only interfaces, NAT, or missing PXE support in virtual NIC model.

Fix

Use bridged networking with PXE-capable NIC models.

Example fix

virt-install   --name nested-vm   --memory 4096   --vcpus 2   --disk size=20   --network bridge=br0,model=virtio   --pxe

# verify dhcp visibility
sudo tcpdump -ni br0 port 67 or 68

Security appliances and firewalls

Firewalls and security gateways often block traffic beyond DHCP. If HTTP, proxy, or NTP are filtered, commissioning stops after PXE boot. Verifying port reachability ensures MAAS communication across all layers.

Symptom

Commissioning fails after boot.

Verify

# check service ports from node
nc -zv <region-ip> 5240
curl -I http://<rack-ip>:3128/
chronyc sources

Cause

Firewall permits DHCP but blocks HTTP, proxy, or metadata.

Fix

Open required ports between racks, regions, and nodes.

Example fix

# allow required ports using ufw
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 5240/tcp
sudo ufw allow 3128/tcp
sudo ufw allow 123/udp

# verify open ports
sudo ss -tuln | grep -E ':(80|443|5240|3128|123)'

Next steps

With these environmental pitfalls in mind, the next section turns to performance troubleshooting, where the network is technically “up,” but throughput, latency, or reliability are degraded.