Rackd + HA

vasartori · 5 April 2024 14:53

Hello all,

I’m running some tests with Maas in HA.

My test environment:

2x regiond (connected a non ha database, at this time, is ok)
2x rackd

I follow this guide to install the HA mode.

The vlan are configured to use two rackd as follow:

When the “maas-rackd-01” are working, everything works as well, If I do a hard shutdown on rackd-01, the pxe boot stop to work.

The one line of log I see on rackd-02 is:

2024-04-05 14:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 172.16.0.50

I don’t see any error on logs of regiond (both) and no error logs on rackd-02.

A thing I have noted:
When the rackd is turned off abruptly, the status of controller is “alive” on interface.

And a this I don’t know if are relatted with this issue, All rackd have the task status “Region Importing”

Thanks in advice.

r00ta · 5 April 2024 17:35

Hi @vasartori ,

Could you try to query http://<region_ip>:5240/MAAS/metadata/latest/by-id/<machine_system_id>/?op=get_preseed

and paste here the result?

vasartori · 5 April 2024 18:46

Hi @r00ta

Huummm, seems no return…

How did I get the system_id:

maas admin rack-controllers read |jq '.[] | select(.hostname == "maas-rackd-01") |.system_id'
"fhfxxf"

maas-rackd-01 is the server has a hard shutdown.

*   Trying 172.16.0.20:5240...
* Connected to 172.16.0.20 (172.16.0.20) port 5240 (#0)
> GET /MAAS/metadata/latest/by-id/fhfxxf/?op=get_preseed HTTP/1.1
> Host: 172.16.0.20:5240
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found
< Server: nginx/1.18.0 (Ubuntu)
< Date: Fri, 05 Apr 2024 18:36:32 GMT
< Content-Type: text/html; charset=utf-8
< Content-Length: 0
< Connection: keep-alive
< Location: /MAAS/metadata/latest/by-id/fhfxxf/
< X-Frame-Options: DENY
< 
* Connection #0 to host 172.16.0.20 left intact
* Issue another request to this URL: 'http://172.16.0.20:5240/MAAS/metadata/latest/by-id/fhfxxf/'
* Found bundle for host 172.16.0.20: 0x610ab958ebb0 [serially]
* Can not multiplex, even if we wanted to!
* Re-using existing connection! (#0) with host 172.16.0.20
* Connected to 172.16.0.20 (172.16.0.20) port 5240 (#0)
> GET /MAAS/metadata/latest/by-id/fhfxxf/ HTTP/1.1
> Host: 172.16.0.20:5240
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Server: nginx/1.18.0 (Ubuntu)
< Date: Fri, 05 Apr 2024 18:36:32 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 73
< Connection: keep-alive
< X-Frame-Options: DENY
< 
* Connection #0 to host 172.16.0.20 left intact
VersionIndexHandler.read() got an unexpected keyword argument 'system_id'

r00ta · 5 April 2024 18:52

I mean the system id of the machine you are trying to commission or deploy. You can use ‘maas admin machines read’ or click on the machine in the UI and look at the URL

vasartori · 5 April 2024 18:58

The machine doesn’t exists on maas, Is the first boot.
When the rackd-01 are enable and fully functional, I’m abble to enlist any server (turn on a server with pxe boot)
I don’t know if I understand the HA purpose on Maas: If you lost one instance of rackd, the another rackd can assume the job. Thats correct?

vasartori · 5 April 2024 19:23

And, reading again my frist post, I don’t known if I was clear.

My objective is check the behaviour of HA on rackd. What happen when a rackd “has gone away”?

So, first, I’m abble to enlist a server (start the server, boot using pxe… load the ephemeral OS and enlist it on maas)

If I turn off a rackd server (the first rack I have enabled) I’m not abble to boot any server…

The status on UI, of powered off server seems to be enable and the boot stops to work for any server.

r00ta · 5 April 2024 19:31

You understanding is correct. If one rack goes away the other one should take over. However, you must allow DNS resolution on the subnet. Can you send a screenshot of the entire page of the subnet and the output of http://maas_ip:5248/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed ?

vasartori · 5 April 2024 19:49

output:

root@maas-regiond-01:~# curl http://172.16.0.24:5248/MAAS/metadata/latest/enlist-preseed/?op=get_enlist_preseed
#cloud-config
apt:
  preserve_sources_list: false
  primary:
  - arches:
    - amd64
    - i386
    uri: http://archive.ubuntu.com/ubuntu
  - arches:
    - default
    uri: http://ports.ubuntu.com/ubuntu-ports
  security:
  - arches:
    - amd64
    - i386
    uri: http://archive.ubuntu.com/ubuntu
  - arches:
    - default
    uri: http://ports.ubuntu.com/ubuntu-ports
  sources_list: 'deb $PRIMARY $RELEASE restricted main universe multiverse

    # deb-src $PRIMARY $RELEASE restricted main universe multiverse

    deb $PRIMARY $RELEASE-updates restricted main universe multiverse

    # deb-src $PRIMARY $RELEASE-updates restricted main universe multiverse

    deb $PRIMARY $RELEASE-backports restricted main universe multiverse

    # deb-src $PRIMARY $RELEASE-backports restricted main universe multiverse

    deb $SECURITY $RELEASE-security restricted main universe multiverse

    # deb-src $SECURITY $RELEASE-security restricted main universe multiverse

    '
datasource:
  MAAS:
    metadata_url: http://172.16.0.24:5248/MAAS/metadata/
manage_etc_hosts: true
packages:
- python3-yaml
- python3-oauthlib
power_state:
  condition: test ! -e /tmp/block-poweroff
  delay: now
  mode: poweroff
  timeout: 1800
rsyslog:
  remotes:
    maas: 172.16.0.20:5247

Screenshot of subnet page:

Sorry about the lot of screenshots. I’m using a jumpbox who has a small screen

r00ta · 5 April 2024 19:56

Your subnet has 8.8.8.8 as DNS server. This will force MAAS to use IP addresses for the racks (as you can see in the preseed output I asked you to extract), which means that your setup can’t work in HA mode.

If you remove the DNS server setting, MAAS will use a domain name for the racks and you your env should work in HA mode.

Once you remove the dns server, check again the preseed file. Let me know if that worked

vasartori · 8 April 2024 13:41

Hi @r00ta , It worked!

So, I think I have a idea about this “issue”…
I’m creating this demo environment using virtualbox, the UEFI PXE client (vbox) has some kind of problems.

Today I moved this config for a poc environment (using real servers, etc…) and works like a charm…