Network troubleshooting: Mental models

Network troubleshooting: Mental models

Understanding the mental model of how MAAS uses networks is the first step in troubleshooting. Provisioning bare metal isn’t like troubleshooting a single host’s connectivity — it’s a chain of services, across multiple layers of the network stack, that all need to line up. This document walks through those layers, the flows between them, and where MAAS fits in.

Layers and flows

MAAS touches nearly every layer of the OSI/TCP‑IP model:

  • Layer 1/2 (Physical, Data link): links, switchports, VLANs, bonds, bridges.
  • Layer 3 (Network): subnets, gateways, DHCP, routing, MTU sizing.
  • Layer 4/7 (Transport, Application): TFTP, HTTP(S), DNS, NTP, metadata services, apt mirrors.

The life of a MAAS machine

When a machine moves from “Powered off” to “Deployed,” it passes through a sequence of network‑dependent steps:

  1. Boot request: PXE or iPXE/HTTP boot on the access VLAN.
  2. DHCP exchange: Machine requests an address, MAAS provides an IP plus boot parameters (next‑server, bootfile, or iPXE script).
  3. Bootloader fetch: The node retrieves bootloaders or scripts via TFTP or HTTP.
  4. Ephemeral OS boot: The kernel/initrd boots and the ephemeral environment comes up.
  5. Commissioning: The node fetches metadata, runs scripts, and reports results.
  6. Deployment: Curtin/image install begins over HTTP/proxy, configuring storage and network.
  7. Reboot: The node boots into its permanent OS, configured via Netplan or cloud‑init.
  8. Steady state: The node relies on DNS, NTP, package mirrors/proxies, and MAAS API heartbeats.

If any one of these layers fails, the process halts or stalls. Symptoms often don’t show the real cause (for example: a PXE timeout might be a switchport VLAN mismatch).

MAAS components and their roles

MAAS itself has distinct pieces, each tied to network flows:

  • Region controller:

    • Hosts the API, database, metadata service, and image repository.
    • Provides the UI and API endpoints that machines contact during commissioning.
  • Rack controller:

    • Manages DHCP, TFTP, and PXE boot services for machines in assigned fabrics.
    • Proxies HTTP/S for image downloads.
    • Acts as the first line of contact for deployed and commissioning machines.

Together, region and rack need clear L3 reachability to each other, and rack must be able to reach all the VLANs it manages for DHCP and PXE.

Why this model helps

When you troubleshoot, don’t just ask “why didn’t DHCP work?” Ask where in the lifecycle the machine is stuck, then map it back to the network service at that layer. For example:

  • Stuck at “PXE boot not found”: Layer 2 VLAN mis‑tagging.
  • Stuck after “Loading kernel…”: HTTP reachability problem.
  • Stuck after “Commissioning”: Metadata or DNS failure.
  • Deployed but networkless: Netplan misrendered, DHCP disabled on wrong VLAN.

By keeping this mental model in mind, you can move systematically instead of guessing.

Next steps

With this model in place, the next document covers “Quickstart: 10 high‑confidence fixes in 10 minutes,” to give you the fastest path to solving the most common problems.