General MAAS Server VLAN setup questions and help with figuring out how to debug problems

Hi folks,

I am trying to build 1x Physical MAAS Server, and 5x Physical nodes for a test OpenStack deployment. But I am running into intermittent problems and I can’t pinpoint exactly what the issues are.

MAAS 3.1 installed via snap on Ubtuntu 20.04 is my setup.

I am using 2x VLANs on physical switches. I have one subnet with BMC / iDRAC IPs and a PXE boot VLAN. The MAAS server has 2x NICs, one in each subnet. Eth0 connects to my internet side of the network and Eth1 is where the nodes should get their PXE boot instructions from etc. The PXE VLAN / fabric is set to use the MAAS proxy and use MAAS as DNS server. The PXE subnet/VLAN is allowed to the DNS and DNSSEC is off. DNS forwarding is setup and works (intermittently of course, which is part of the problem).

I have successfully “commissioned” and “deployed” Ubuntu 20.04 to 4 of those nodes at various times, but just not consistently. I’m doing something wrong but not sure what it is.

(a.) Sometimes “Commissioning” would fail with “failed installing dependencies” error when running “20-maas-01-install-lldpd”. When SSHed into the node, “apt update” would work fine, but the logs would show errors “fetching” from http://archive.ubuntu.org.during commissioning

(b.) During “deployment”, sometimes it would fail and the logs would have “Connection failed [IP: 172.27.41.2 8000]” But not consistently. Sometimes deploying a second time would work without a hitch.

(c.) IPv6 DNS requests seem to have been a problem sometimes. I disabled IPv6 using on the MAAS server which made “nslookup” happier.

(d.) On my firewall I sometimes see attempts at direct connections from the node being “commissioned” or “deployed” to external DNS servers on port 53 even though as far as I can tell the nodes should be using MAAS as their DNS and using 172.27.10.2 as a forwarder. I also see the nodes trying to access Ubuntu servers direcltly, bypassing the MAAS proxy for updates.

I have 3 questions anyway:

  1. Looking at the setup I have in the diagram below, am I doing something wrong with my gateways and Nameserver and VLAN settings? The nodes should be connected to “trunked” switchports with a native VLAN set? In my scenario, should my MAAS server have a VLAN interface set to 2741 on eth1?

  2. fabrics, subnets and VLANS and “untagged” in MAAS all seem to mean something different to my intuition from past experience in other areas. Does anyone have pointers to a good tutorial or explainer with examples in different scenarios? I’ve searched but I’m coming up empty…

  3. Can anyone point me to a “how to debug a failed “x .y z”” blog post for MAAS. I.e. what are the best logfiles to check first (and where they are) is a great starting place.

Thanks,

Network diagram with settings etc. The only DHCP server in play is on VLAN 2741, for PXE booting.

Subnets:

FABRIC VLAN DHCP SUBNET AVAILABLE IPS SPACE
fabric-0 untagged No DHCP 172.27.40.0/24 96% space_MaaS_Network
fabric-1 untagged MAAS-provided 172.27.41.0/24 54% space_MaaS_Network

Hi @duibhneach, thanks for the detailed explanation of your setup.

Could you please confirm that no other dns/dhcp/proxy service is running on the machine hosting MAAS?

WRT debugging maas, log files for the snap install are under /var/snap/maas/common/logs, which contains logs for all services, and rsyslog logs for deployed hosts. Specifically you can start by looking at regiond.log and rackd.log to see if MAAS is reporting any issue.

In your network diagram, 172.27.42.2 is mentioned, is that a typo for .41.2?

Gaaah, yes that is a typo . The DNS server for the nodes is 172.27.41.2 (MAAS server) and the Proxy server is 41.2 as well. The gateway is 172.27.41.1 on the L3 switch.

I have seen examples where the MAAS server is the gateway for these other subnets and iptables is used to NAT internet access for the nodes. I didn’t do this and tried instead to use the inter-vlan routing on the L3 switch. Which generally works I think, but I have a feeling it is not best practice?

The machine hosting MAAS is 100% new install of Ubuntu Server 20.04 with the bare minimum installed, so definitely no other DNS, Proxy or DHCP running on it. MAAS installed via snap.

I don’t see anything wrong with the setup you described.
Could you please check the mentioned maas logs and also the one for the machine you’re trying to deploy under rsyslog/$hostname/$date/messages, in case they show errors or relevant info?

Thanks for your ideas up to now. I am getting a better idea of it now. I have gone through some of the logs on a new “Commission” (which failed installing dependencies for “smartctl-validate”) and looking at the rsyslog for it I have a timeline like this:

Shortly after loading ephemeral, loading snap service succeeds, time and date starts etc.:

2022-01-25T02:06:07+00:00 fair-mammal systemd[1]: snap.lxd.activate.service: Succeeded.
2022-01-25T02:06:07+00:00 fair-mammal systemd[1]: Finished Service for snap application lxd.activate.
2022-01-25T02:06:07+00:00 fair-mammal systemd[1]: Started snap.lxd.hook.configure.32c6306c-897f-4995-808b-b0762ebf69a4.scope.
2022-01-25T02:06:07+00:00 fair-mammal systemd[1]: snap.lxd.hook.configure.32c6306c-897f-4995-808b-b0762ebf69a4.scope: Succeeded.
2022-01-25T02:06:07+00:00 fair-mammal dbus-daemon[1720]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.12' (uid=0 pid=1733 comm="/usr/lib/snapd/snapd ")
2022-01-25T02:06:07+00:00 fair-mammal systemd[1]: Starting Time & Date Service...
2022-01-25T02:06:07+00:00 fair-mammal dbus-daemon[1720]: [system] Successfully activated service 'org.freedesktop.timedate1'
2022-01-25T02:06:07+00:00 fair-mammal systemd[1]: Started Time & Date Service.
2022-01-25T02:06:11+00:00 fair-mammal systemd[1]: dmesg.service: Succeeded.

A few seconds later, something tries to connect directly to the internet (IP 91.189.94.10) This fails because of our firewall I assume:

2022-01-25T02:06:12+00:00 fair-mammal pollinate[1729]: WARNING: Network communication failed [28] ...
*   Trying 91.189.94.10:443.

A few seconds after that network failure, we get the Proxy being set and what seems like a succesful apt-get update:

2022-01-25T02:06:49+00:00 fair-mammal systemd-timesyncd[2545]: Initial synchronization to time server 172.27.41.2:123 (172.27.41.2).
2022-01-25T02:06:49+00:00 fair-mammal cloud-init[2514]: Cloud-init v. 21.4-0ubuntu1~20.04.1 running 'modules:config' at Tue, 25 Jan 2022 02:06:48 +0000. Up 68.00 seconds.
2022-01-25T02:06:49+00:00 fair-mammal cloud-init[2514]: Begin run command: snap set system proxy.http="http://172.27.41.2:8000/" proxy.https="http://172.27.41.2:8000/"
2022-01-25T02:06:49+00:00 fair-mammal cloud-init[2514]: End run command: exit(0)
2022-01-25T02:06:49+00:00 fair-mammal systemd[1]: Finished Apply the settings specified in cloud-config.
2022-01-25T02:06:49+00:00 fair-mammal systemd[1]: Starting Execute cloud user/final scripts...
2022-01-25T02:06:49+00:00 fair-mammal cloud-init[2557]: Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
2022-01-25T02:06:49+00:00 fair-mammal cloud-init[2557]: Get:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]

A few minutes later it successfully installs packages for lldpd as far I can tell, but fails at the end for smartctl-validate

2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Installing apt packages for 20-maas-01-install-lldpd (id: 1198, script_version_id: 1)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting 20-maas-01-install-lldpd (id: 1198, script_version_id: 1)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished 20-maas-01-install-lldpd (id: 1198, script_version_id: 1): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting 20-maas-02-dhcp-unconfigured-ifaces (id: 1199, script_version_id: 2)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished 20-maas-02-dhcp-unconfigured-ifaces (id: 1199, script_version_id: 2): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting 40-maas-01-machine-resources (id: 1201, script_version_id: 4)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished 40-maas-01-machine-resources (id: 1201, script_version_id: 4): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting 50-maas-01-commissioning (id: 1202, script_version_id: 5)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished 50-maas-01-commissioning (id: 1202, script_version_id: 5): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-capture-lldpd (id: 1209, script_version_id: 12)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-get-fruid-api-data (id: 1206, script_version_id: 9)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-kernel-cmdline (id: 1207, script_version_id: 10)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-list-modaliases (id: 1205, script_version_id: 8)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-lshw (id: 1204, script_version_id: 7)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-serial-ports (id: 1208, script_version_id: 11)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Starting maas-support-info (id: 1203, script_version_id: 6)
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-serial-ports (id: 1208, script_version_id: 11): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-list-modaliases (id: 1205, script_version_id: 8): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-support-info (id: 1203, script_version_id: 6): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-get-fruid-api-data (id: 1206, script_version_id: 9): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-kernel-cmdline (id: 1207, script_version_id: 10): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-lshw (id: 1204, script_version_id: 7): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Finished maas-capture-lldpd (id: 1209, script_version_id: 12): 0
2022-01-25T02:09:06+00:00 fair-mammal cloud-init[2557]: Installing apt packages for smartctl-validate (id: 1211, script_version_id: 13)
2022-01-25T02:09:07+00:00 fair-mammal cloud-init[2557]: Failed installing package(s) for smartctl-validate (id: 1211, script_version_id: 13)
2022-01-25T02:09:07+00:00 fair-mammal cloud-init[2557]: 1 test scripts failed to run

The proxy access.log has a TCP_MISS / 304 for the first line but is fine after that:

1643076409.862 36 172.27.41.196 TCP_MISS/304 370 GET http://archive.ubuntu.com/ubuntu/dists/focal/InRelease - HIER_DIRECT/91.189.88.152 -

Then we have a TCP_TUNNEL to snapcraft in the access.log, but I assume that is transparent to the firewall. So it seems to be correct in general, but that something like a cache miss in the proxy is causing that script not to install dependencies the first time.

Either way, I have kicked off another round of commisioning and will check it tomorrow!

@duibhneach, did you get this working, by chance?