Cannot Commission with 2 Separate Interfaces


#1

Hi Everyone,

I am running into a really odd problem here. Below is the network setup:

NETWORK 1 (10.X.X.X)
Internet access
DHCP/DNS Managed upstream
Unmanaged by MaaS
PXE for A and B Servers
BMC for A Servers

NETWORK 2 (192.168.0.X)
Completely private network
DHCP/DNS is managed by MaaS
BMC of B Servers

The cluster and region controller is on the same node with 2 NICs, 1 NIC on NETWORK1 and 1 NIC on NETWORK 2. When I setup DHCP only for NETWORK 2, I can see all of the BMCs on the B Servers populate in the subnet on MaaS. However, as soon as I attempt to add and commission a B Server, the attached screenshot happens (list of IPs for the subnet disappear and a huge blank space appears) and the B Server cannot commission. It also cannot enter Rescue Mode however I can see it power on and off.

I then try to add “A Servers” and they also cannot commission. After reinstalling MaaS I am able to commission “A Servers” and deploy but as soon as I add one of the “B Servers” none of the nodes can commission. It is not until I reinstall MaaS that I am able to get the “A Servers” commissioning again even after disabling interfaces on the MaaS server and removing any reference and setup for the “B Servers”.


#2

Uploading network schematic. Orange is the BMC connections (Uses IPMI) and Blue is data/PXE.

Capture


#3

Additional info.

I turned off all MaaS DHCP services and ran my own isc-dhcp-server on the same MaaS node so MaaS is no longer managing it. I am now able to get servers in SERVER A working and commissioning fine. I was only able to bring up SERVER B once and now it will no longer commission. I took a look at the console of SERVER B and found that PXE will immediately grab an IP, boot into the ephemeral kernel, but now cannot get an IP via DHCP. This does not make sense since if PXE worked then it should work again since it is the same interface. This bug listed below explains the issue I am seeing:

https://bugs.launchpad.net/ubuntu/+source/klibc/+bug/1327412

However, according to the squash.fs manifest found here:

https://images.maas.io/ephemeral-v3/daily/bionic/amd64/20190828/squashfs.manifest

for bionic, klibc had been updated to “klibc-utils 2.0.4-9ubuntu2”. There is a secondary issue that someone else mentioned regarding “portfast” on the switch which I will look into to see if it helps.


#4

Found the issue. We learned that our DHCP pool size was too small which is the reason for DHCP failing.