Network troubleshooting: Checklists and runbooks

Structured checklists and runbooks keep troubleshooting consistent and repeatable. This document provides intake, failure-specific, and periodic health-check procedures tailored for MAAS.

Intake checklist

Use this checklist to establish a clear network baseline before troubleshooting.

Before you begin, capture key environmental details:

  • Diagram: fabrics, VLANs, and switchport mappings
  • Confirm: DHCP mode (MAAS, relay, external)
  • Firmware: PXE/HTTP boot order enabled
  • NTP: sources reachable from nodes and racks
  • DNS: MAAS authoritative zones and forwarders configured
  • Proxy: rack proxy reachable on TCP 3128
  • Images: synced and available in MAAS
  • Logs: collect journalctl -u snap.maas.*

PXE failure runbook

Use this runbook to verify PXE readiness before investigating deeper DHCP or boot issues.

  1. Verify link and VLAN on the client port:
    ip link show
    tcpdump -i <iface> port 67 or 68
    
  2. Confirm only MAAS DHCP is active on the VLAN:
    ps aux | grep dnsmasq
    
  3. Check relay or IP helper configuration on the upstream switch or router.
  4. Ensure the rack controller has DHCP enabled for the VLAN:
    maas $PROFILE vlan read <fabric-id> <vid> | jq '.dhcp_on'
    
  5. Capture DHCP discover/offer/ack packets to confirm the full handshake.

Commissioning failure runbook

Use this runbook when nodes fail to commission or stop early in the process.

  1. Confirm the kernel and initrd were downloaded successfully.
  2. Test metadata reachability from the ephemeral environment:
    curl -I http://<region-ip>:5240/MAAS/
    
  3. Check rack logs for DHCP, proxy, or TFTP errors:
    journalctl -u snap.maas.rackd
    
  4. Validate DNS resolution from the ephemeral environment:
    dig @<maas-dns-ip> rackd.maas
    
  5. Review /var/log/cloud-init.log and /var/log/curtin/install.log for errors.

Deploy failure runbook (curtin/apt)

Use this runbook to diagnose deployment failures related to proxy or package retrieval.

  1. Confirm proxy configuration:
    apt-config dump | grep -i proxy
    
  2. Test package mirror reachability:
    curl -I http://archive.ubuntu.com/
    
  3. If SSL interception is present, ensure the corporate CA is installed.
  4. Review installer and curtin logs:
    less /var/log/installer/syslog
    less /var/log/curtin/install.log
    
  5. Verify NTP synchronization to prevent clock-skew issues.

Post-deploy network down runbook

Use this runbook when a deployed system comes up without network connectivity.

  1. Log in to the console and inspect netplan configuration:
    netplan get
    
  2. Review systemd-networkd logs for errors:
    journalctl -u systemd-networkd
    
  3. Compare interface names between MAAS and the node.
  4. Adjust and reapply netplan if necessary:
    netplan try
    netplan apply
    

Periodic health-check runbook

Use this runbook to perform regular validation of MAAS and network health.

Run these checks weekly or monthly to maintain reliability:

  • List recent warnings:
    maas $PROFILE events query level=WARNING limit=50
    
  • Confirm rack controllers are online (UI or CLI).
  • Test DHCP offers on each VLAN using tcpdump.
  • Test metadata endpoint from an isolated host.
  • Check NTP offset with chronyc tracking.
  • Inspect rack health with top or iostat.
  • Rotate logs and archive old pcaps or evidence.

Escalation bundle

Use this procedure to collect standard artifacts for escalation to higher-level support.

sudo maas dumpstate  # if available
journalctl -u snap.maas.regiond -u snap.maas.rackd > /tmp/maas-services.log
ip a > /tmp/ip-a.txt
ip r > /tmp/ip-r.txt
bridge vlan show > /tmp/bridge-vlan.txt
resolvectl status > /tmp/resolvectl.txt
dig @<maas-dns-ip> <node>.maas A > /tmp/dns.txt
tar czf /tmp/maas-evidence-$(date +%F).tar.gz /tmp/*.txt /tmp/*.log

Include an sosreport to capture complete system diagnostics for kernel, storage, and network context:

# collect sosreport for full system diagnostics
sudo apt install sosreport -y
sudo sosreport --batch --tmp-dir /tmp

Next steps

The next version of this document will include appendices such as a port table, BPF filter cribsheet, switch configuration checklist, known-bads, and lab patterns.