Network troubleshooting: Evidence bundle

Network troubleshooting: Evidence bundle for escalation

When a network or deployment issue cannot be resolved locally, a complete and consistent evidence bundle accelerates escalation and diagnosis. This document defines what to collect, how to collect it, and how to package it for higher-level support. Note the exact wall-clock time when symptoms start to align logs and events during analysis.

Why evidence matters

Use this section to understand why disciplined evidence collection is critical.

  • Ensures the problem is reproducible and visible to others
  • Prevents back-and-forth about missing details
  • Speeds triage by showing both symptoms and baseline state
  • Supports audits and incident response requirements

Standard evidence set

Use this section to collect all relevant MAAS and system-level data before escalation.

MAAS logs

Collect controller and event logs to capture MAAS activity at the time of failure.

  • Region and rack controller journals:

    journalctl -u snap.maas.regiond -u snap.maas.rackd > /tmp/maas-services.log
    
  • Recent MAAS events (warnings and errors):

    maas $PROFILE events query level=WARNING limit=200 > /tmp/maas-events.json
    
  • Commissioning and deploy logs:

    cp /var/log/cloud-init.log /tmp/
    cp /var/log/cloud-init-output.log /tmp/
    cp /var/log/curtin/install.log /tmp/
    cp /var/log/installer/syslog /tmp/installer-syslog.txt
    

Timebox and filter

Use these steps to capture only the time window around the failure and focus on relevant lines.

# mark incident window (note the exact wall clock times)
date -Is | tee /tmp/incident-start.iso
sleep 120  # reproduce / wait ~2 minutes
date -Is | tee /tmp/incident-end.iso

START=$(cat /tmp/incident-start.iso)
END=$(cat /tmp/incident-end.iso)

# confirm clock and timezone for correlation
timedatectl | tee /tmp/timedatectl.txt

# focus controller journals to the incident window and grep likely culprits
journalctl --since "$START" --until "$END"   -u snap.maas.regiond -u snap.maas.rackd | grep -Ei 'dhcp|tftp|pxe|metadata|proxy|curtin|cloud-init|commission|deploy|dns' > /tmp/maas-services-window.log

# slice node logs to the same window
sed -n "/$START/,/$END/p" /var/log/cloud-init.log | grep -Ei 'error|fail|timeout|metadata|network|dns|ntp' > /tmp/cloud-init-window.log

sed -n "/$START/,/$END/p" /var/log/curtin/install.log | grep -Ei 'error|fail|timeout|mirror|proxy|apt|dpkg' > /tmp/curtin-window.log

# filter MAAS events by timestamp window
maas $PROFILE events query level=WARNING limit=1000 > /tmp/maas-events.json
jq -r --arg s "$START" --arg e "$END"   'map(select(.created >= $s and .created <= $e))'   /tmp/maas-events.json > /tmp/maas-events-window.json

# add incident markers to system logs (optional but helpful)
logger -t incident "marker: repro start @ $START"
logger -t incident "marker: repro end   @ $END"

System state

Collect baseline system configuration to reveal network and routing context.

  • Interfaces and addresses:

    ip a > /tmp/ip-a.txt
    ip -d link > /tmp/ip-dlink.txt
    
  • Routing table:

    ip r > /tmp/ip-r.txt
    
  • Bridges and VLANs:

    bridge vlan show > /tmp/bridge-vlan.txt
    
  • DNS and resolver state:

    resolvectl status > /tmp/resolvectl.txt
    dig @<maas-dns-ip> <node>.maas A +noall +answer > /tmp/dns.txt
    
  • Listening sockets and services:

    ss -ltnup > /tmp/listeners.txt
    
  • Firewall configuration:

    nft list ruleset > /tmp/nftables.txt
    

Targeted captures

Use packet captures to verify DHCP, TFTP, and HTTP traffic flow during commissioning and PXE boot.

  • DHCP handshake:

    tcpdump -n -e -vv -i <iface> '(port 67 or 68)' -c 50 -w /tmp/dhcp.pcap
    
  • TFTP and HTTP boot attempts:

    tcpdump -n -i <iface> port 69 -c 50 -w /tmp/tftp.pcap
    tcpdump -n -i <iface> tcp port 80 -c 200 -w /tmp/httpboot.pcap
    

Performance snapshots

Use performance data to detect system-level constraints contributing to network symptoms.

  • CPU, memory, and disk usage:

    top -b -n1 > /tmp/top.txt
    iostat -x 5 3 > /tmp/iostat.txt
    
  • Network throughput:

    iperf3 -c <server> -P 4 -t 20 > /tmp/iperf3.txt
    

Full system diagnostics

Collect an sosreport to include kernel, storage, and network context.

sudo apt install sosreport -y
sudo sosreport --batch --tmp-dir /tmp

Packaging procedure

Use this procedure to assemble and compress the collected data into a single archive for handoff.

tar czf /tmp/maas-evidence-$(date +%F).tar.gz   /tmp/maas-services.log   /tmp/maas-events.json   /tmp/maas-events-window.json   /tmp/cloud-init*.log   /tmp/install.log   /tmp/installer-syslog.txt   /tmp/ip-*.txt   /tmp/bridge-vlan.txt   /tmp/resolvectl.txt   /tmp/dns.txt   /tmp/listeners.txt   /tmp/nftables.txt   /tmp/*.pcap   /tmp/top.txt   /tmp/iostat.txt   /tmp/iperf3.txt   /tmp/timedatectl.txt   /tmp/incident-*.iso   /tmp/sosreport*

Verify the archive contents:

tar tzf /tmp/maas-evidence-$(date +%F).tar.gz | less

Redact sensitive data (passwords, keys, user information) before sharing.

Redaction helpers

Use these tools to sanitize or extract data when confidentiality is required.

  • Mask IP and MAC addresses in packet captures:

    editcap -C secret.map in.pcap out.pcap
    
  • Extract only DHCP, DNS, or HTTP streams for analysis:

    tshark -r in.pcap -Y 'bootp or dhcp or dns or http' -T fields -e frame.time -e ip.src -e ip.dst
    

Escalation handoff

Use this checklist to prepare the final escalation package.

Include the following with your tarball:

  • Brief description of the issue (symptoms, time of first occurrence, actions taken)
  • Affected node or system IDs
  • Relevant fabric and VLAN identifiers
  • Whether the issue is reproducible and under what conditions

Next steps

After evidence collection, the next appendices provide quick references such as port tables, switch configuration checklists, known bad patterns, and lab network templates.