Network troubleshooting: Environment and topology gotchas
Not all problems come from MAAS itself. The surrounding environment – switches, routers, firewalls, overlays, and virtualization layers – often create the hardest troubleshooting puzzles. This document catalogs the most common topological “gotchas” and how to recognize, verify, and fix them.
Campus and enterprise networks
Campus and enterprise networks often use large-scale routing and segmentation across multiple VRFs or subnets. When DHCP relay is not configured correctly, discovery traffic never reaches MAAS. Understanding these boundaries is essential for confirming which subnets MAAS can serve directly.
Symptom
DHCP requests never reach MAAS.
Verify
# capture dhcp packets on the rack uplink to confirm none are received
sudo tcpdump -ni <rack-if> port 67 or 68
# test dhcp relay path from a node subnet
sudo dhclient -v -sf /bin/true -r <iface>
Cause
Core routers segment traffic into VRFs; DHCP relay not configured.
Fix
Ensure DHCP relay (IP helper) is set for each VRF or subnet where MAAS nodes live.
Example fix
# verify vlan dhcp settings in maas
maas $PROFILE vlan read <fabric-id> <vid> | jq '.dhcp_on'
# cisco: enable dhcp relay to region/rack subnet
# interface Vlan<vid>
# ip helper-address <rack-or-region-ip>
# juniper: add dhcp relay agent
# set forwarding-options helpers bootp interface vlan.<vid> server <rack-or-region-ip>
# verify relay operation
sudo tcpdump -ni <rack-uplink-if> port 67 or 68
DHCP snooping and source guard
DHCP snooping and source guard protect networks from rogue DHCP servers. These features can block legitimate MAAS rack responses if not configured for trusted ports. When enabled without exceptions, PXE clients may receive no offers or inconsistent responses.
Symptom
PXE offers do not reach nodes, or commissioning fails randomly.
Verify
# on the rack controller interface, capture dhcp offers
sudo tcpdump -ni <rack-if> port 67 or 68
# on the node port, confirm packets are not received
sudo tcpdump -ni <node-if> port 67 or 68
Cause
Switch enforces DHCP snooping or IP source guard, blocking MAAS rack replies.
Fix
Mark rack controller access ports as trusted DHCP ports and relax source guard for expected MACs.
Example fix
# cisco: trust dhcp and disable ip source guard
# interface GigabitEthernet0/1
# ip dhcp snooping trust
# no ip verify source
# juniper: trust dhcp on rack port
# set ethernet-switching-options secure-access-port interface ge-0/0/1 dhcp-trusted
# delete ethernet-switching-options secure-access-port interface ge-0/0/1 ip-source-guard
# verify dhcp handshake
sudo tcpdump -ni <rack-if> port 67 or 68
Port security and MAC limits
Many access switches restrict the number of MAC addresses learned per port. During PXE boot, nodes temporarily use a different MAC for the ephemeral environment. If limits are exceeded or stale entries remain, packets from new MACs are silently dropped.
Symptom
Node never appears in MAAS, despite PXE attempt.
Verify
# capture ephemeral mac address on rack interface
sudo tcpdump -ni <rack-if> ether host <ephemeral-mac>
# view switch mac table for the access port
# show mac address-table interface <port>
Cause
Switchport MAC limit exceeded; ephemeral boot MAC rejected or aged out.
Fix
Increase per-port MAC limit and clear stale entries.
Example fix
# cisco: raise limit and clear mac table
# interface GigabitEthernet0/2
# switchport port-security maximum 10
# switchport port-security aging time 5
# clear mac address-table dynamic interface gi0/2
# juniper: bump mac limit
# set ethernet-switching options secure-access-port interface ge-0/0/2 mac-limit 10
# verify connectivity restoration
sudo tcpdump -ni <rack-if> ether host <ephemeral-mac>
Spanning Tree and link delays
Spanning Tree Protocol (STP) prevents loops but introduces startup delays. When ports transition through listening or blocking states, PXE clients may time out before the link is active. Servers should connect to ports configured for immediate forwarding to avoid delay.
Symptom
PXE client times out before link becomes active.
Verify
# observe interface state changes on client
sudo journalctl -b -u systemd-networkd | grep eth
# confirm link activation delay
sudo dmesg | grep eth
Cause
STP places port into blocking/listening for ~30 seconds.
Fix
Enable portfast or edge mode on access ports used by servers and PXE clients.
Example fix
# cisco
# interface GigabitEthernet0/3
# spanning-tree portfast
# juniper
# set protocols rstp interface ge-0/0/3 edge
# verify quick link activation
sudo ethtool <iface> | grep Link
Mirror/monitor/SPAN sessions
SPAN sessions duplicate traffic for analysis. If misconfigured or oversubscribed, packet loss and duplication can occur during capture. Accurate visibility requires correct direction settings and adequate buffer capacity.
Symptom
Captures show missing packets or odd duplication.
Verify
# check capture interface for drops
sudo ethtool -S <capture-if> | grep -i drop
# validate both directions visible
sudo tcpdump -ni <capture-if> -c 10 port 67 or 68
Cause
SPAN misconfigured or oversubscribed.
Fix
Ensure capture port mirrors both ingress and egress for correct VLANs and avoid oversubscription.
Example fix
# cisco
# monitor session 1 source vlan <vid> both
# monitor session 1 destination interface gi0/10
# verify capture continuity
sudo tcpdump -ni <capture-if> port 67 or 68
Overlay networks (VXLAN, EVPN, GRE)
Overlay networks encapsulate traffic for tenant isolation or tunneling. Encapsulation increases packet size, often exceeding standard MTU limits. Without sufficient headroom, packets fragment or drop, disrupting large transfers and PXE boot.
Symptom
Image downloads stall; packet fragmentation occurs.
Verify
# check path mtu between rack and node
ping -M do -s 8972 <rack-ip> -c 3 || true
Cause
MTU mismatch across tunnel overlays.
Fix
Raise underlay or overlay MTU, or lower host MTU for consistent operation end-to-end.
Example fix
# verify current mtu
ip link show <iface> | grep mtu
# set mtu appropriately
sudo ip link set dev <iface> mtu 9000
# update maas vlan mtu
maas $PROFILE vlan update <fabric-id> <vid> mtu=9000
# confirm mtu alignment
ip link show <iface> | grep mtu
WAN or remote sites
Remote sites often depend on high-latency WAN links to central MAAS controllers. Packet delay and loss affect DHCP, metadata, and image download reliability. Deploying local rack controllers reduces round trips and improves consistency.
Symptom
Commissioning/deployment fails intermittently; long delays.
Verify
# measure latency and packet loss
ping -c 5 <region-ip>
mtr -rwc 10 <region-ip>
Cause
High latency and packet loss to central controllers; proxy/mirror too far away.
Fix
Deploy a local rack controller and configure nodes to use local proxy and mirror.
Example fix
# install and register a local rack controller
sudo snap install maas --channel=3.4/stable
sudo maas init rack --region-url=http://<region-ip>:5240/MAAS/ --secret <shared-secret>
# point maas to local proxy
maas $PROFILE maas set-config name=http_proxy value=http://<local-rack-ip>:3128/
# verify proxy reachability
curl -I http://<local-rack-ip>:3128/
Virtualization labs (KVM, LXD, Multipass)
Virtualized environments often include built-in DHCP services. These interfere with MAAS-controlled subnets when default dnsmasq instances remain active. Disabling host NAT and bridging directly to the physical network prevents DHCP conflicts.
Symptom
Rogue DHCP seen; conflicts with MAAS.
Verify
# identify rogue dhcp servers
sudo nmap --script broadcast-dhcp-discover
Cause
Libvirt/LXD default dnsmasq running on host bridges.
Fix
Disable default NAT networks and bridge host NICs to the real VLAN.
Example fix
# kvm/libvirt
virsh net-list --all
sudo virsh net-destroy default || true
sudo virsh net-autostart default --disable || true
# lxd
sudo lxc network set lxdbr0 ipv4.dhcp false
sudo lxc network set lxdbr0 ipv6.dhcp false
# create linux bridge
sudo nmcli connection add type bridge ifname br0
sudo nmcli connection add type bridge-slave ifname <nic> master br0
Nested virtualization
Nested virtualization allows virtual machines to host other VMs. PXE traffic may fail due to NAT or unsupported virtual NIC models. Bridging and compatible NIC models ensure DHCP and TFTP visibility.
Symptom
PXE boot works inconsistently in nested VMs.
Verify
# confirm traffic passes bridge
sudo tcpdump -ni br0 port 67 or 68
Cause
Host-only interfaces, NAT, or missing PXE support in virtual NIC model.
Fix
Use bridged networking with PXE-capable NIC models.
Example fix
virt-install --name nested-vm --memory 4096 --vcpus 2 --disk size=20 --network bridge=br0,model=virtio --pxe
# verify dhcp visibility
sudo tcpdump -ni br0 port 67 or 68
Security appliances and firewalls
Firewalls and security gateways often block traffic beyond DHCP. If HTTP, proxy, or NTP are filtered, commissioning stops after PXE boot. Verifying port reachability ensures MAAS communication across all layers.
Symptom
Commissioning fails after boot.
Verify
# check service ports from node
nc -zv <region-ip> 5240
curl -I http://<rack-ip>:3128/
chronyc sources
Cause
Firewall permits DHCP but blocks HTTP, proxy, or metadata.
Fix
Open required ports between racks, regions, and nodes.
Example fix
# allow required ports using ufw
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow 5240/tcp
sudo ufw allow 3128/tcp
sudo ufw allow 123/udp
# verify open ports
sudo ss -tuln | grep -E ':(80|443|5240|3128|123)'
Next steps
With these environmental pitfalls in mind, the next section turns to performance troubleshooting, where the network is technically “up,” but throughput, latency, or reliability are degraded.