What is the proper way to make MAAS (3.6) HA?

Hello folks!

We are evaluating MAAS for a new infrastructure we are deploying along with Ubuntu Pro and we started by building a dry run on the new servers itself.

We went thru the motions of deploying Postgres HA with Patroni and have a floating VIP using Keepalived which tracks the Patroni leader. That part works perfectly fine.

After that, we deployed HAProxy on top of that, which provides a Postgres unique endpoint to the leader which should be used by all MAAS nodes.

So, with the database layer ready, we went to the actual MAAS installation and configuration. We’ve followed the https://canonical.com/maas/docs/how-to-get-maas-up-and-running guide and then enabled TLS providing the certificates issued by our internal CA. As a single node, it works perfectly fine.

For reference this is the setup in terms of networking:

  • maas.mydomain.internal - The VIP DNS which points to the HAProxy floating IP 10.200.20.9. This has been configured as a bypass on the TLS so it is not terminated at HAProxy but at the MAAS node itself. The certificate is issued with SANs for all VIP and node FQDNs, along with the VIP and node IPs, so it is the same certificate everywhere.
  • maas-01.mydomain.internal - The first MAAS node on 10.200.20.10.
  • maas-02.mydomain.internal - The second MAAS node on 10.200.20.11.
  • maas-03.mydomain.internal - The third MAAS node on 10.200.20.12.

All nodes are part of the Patroni cluster and can connect to the database VIP without any issues.

The deployment of the initial node was made using the following:

sudo maas init region+rack --database-uri "postgres://maas:maaspw@maas.mydomain.internal:5000/maas" --maas-url http://maas.mydomain.internal:5240/MAAS

Then after the deployment, we created the initial user and went thru the wizard on the Web UI without any issues.

After that, we enabled the TLS using the following:

sudo maas config-tls enable -p 5443 --cacert /var/snap/maas/common/ca-chain.crt /var/snap/maas/common/cert.key /var/snap/maas/common/cert.crt

Then acessing thru the browser using https://maas.mydomain.internal:5443/MAAS works perfectly fine, and so does with the VIP https://maas.mydomain.internal/MAAS thru HAProxy.

Everything works well so far, and extremely smooth.

Then, we started try to follow the steps for the HA setup, adding the other nodes (from https://canonical.com/maas/docs/how-to-manage-high-availability and How to enable high availability (deb/3.1/CLI)) and used the same command to init both region and rack as in the first node.

The command complete, and I can see on the UI the controllers being added for maas-01 and maas-03. Eventually (after a while) they report “green”. Then I ran the same TLS command on both nodes to enable TLS.

After a few seconds or trying to use the UI we start to get timeouts, and slugish responses. After a while, the UI becomes completely unresponsive either thru the VIP or directly to any of the nodes. I have the impression that something is being lost when adding new nodes. Even the database connections I can’t access anymore with an external tool like DataGrip or the psql CLI (and no, the machines are not overloaded, CPU and RAM are fine).

We tried to check the logs on all nodes but there is nothing really useful there.

I’m a little confuse by reading the docs and given the number of unresolved conversations in the forum regarding HA setups, I think there is something missing in the docs or some step that is not being properly executed because of the lack of information.

The major question is whether or not we should call the init on the other nodes, using the original maas-url which was not using the TLS yet (as it still appear like that on the maasserver_node table) or if we should use the TLS endpoint now that it is enabled on the first node, or if the init should be called on each node with their individual FQDNs instead of the VIP and leave the VIP only for the HAProxy. It is very confusing the docs regarding this and I believe this may be the source of the problem.

Can someone please clarify the proper steps to have a fully working MAAS HA setup with TLS enabled on all nodes?

For reference, the HAProxy config is as follows:

global
    maxconn 100
    log /dev/log local0
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

# DNS resolver for FQDN resolution
resolvers dns_resolver
    nameserver dns1 10.200.20.1:53
    nameserver dns2 1.1.1.1:53
    resolve_retries 3
    timeout resolve 1s
    timeout retry 1s
    hold valid 10s

defaults
    log     global
    retries 3
    option  redispatch
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout tunnel 1h
    timeout check 5s

# HAProxy Stats Page
listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /
    stats refresh 5s

# MAAS UI/API (HTTPS - SSL Passthrough with Sticky Sessions)
frontend maas_https
    mode tcp
    bind *:443
    default_backend maas_https

backend maas_https
    mode tcp
    balance source
    hash-type consistent
    # Sticky sessions for WebSocket support
    stick-table type ip size 200k expire 30m
    stick on src
    # SSL health check
    option ssl-hello-chk
    # Longer timeout for WebSocket connections
    timeout server 3600s
    timeout tunnel 3600s
    server maas-01 maas-01.mydomain.internal:5443 check inter 5s fall 3 rise 2 resolvers dns_resolver
    server maas-02 maas-02.mydomain.internal:5443 check inter 5s fall 3 rise 2 resolvers dns_resolver
    server maas-03 maas-03.mydomain.internal:5443 check inter 5s fall 3 rise 2 resolvers dns_resolver

# PostgreSQL Write (Primary only)
listen postgres_write
    mode tcp
    bind *:5000
    option httpchk
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server maas-01 maas-01.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
    server maas-02 maas-02.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
    server maas-03 maas-03.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver

# PostgreSQL Read (Replicas)
listen postgres_read
    mode tcp
    bind *:5001
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server maas-01 maas-01.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
    server maas-02 maas-02.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
    server maas-03 maas-03.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver

Thank you! I appreciate any help.

Best regards,
Gutemberg

You might be hitting https://bugs.launchpad.net/maas/+bug/2130237 . Could you retry with 3.6/edge?

Thanks for the reply. But do you think the steps we took are correct? More importantly, when calling init on the second and third node.

Do not init other nodes while other controllers are running. This is a know bug fixed from 3.7 (not released yet). You’d better stop the nodes, add new ones and start everything

Ok, good. I’ll change to 3.6/edge and upgrade, then stop the first node, add the second, then stop it and add the 3rd and then start everything.

But when running init, should I use the VIP or the node FQDN as the Maas-url? With or without the TLS?

Thanks

Yeah, same behavior with the 3.6/edge and adding nodes one by one. If 3 nodes are “running” nothing works. The UI only get back to work once we stop the nodes 2 and 3.

It must be without tls and you can use the VIP

Hi @gutemberg-veezla ,

What about PostgreSQL max_connections? Did you raise them over the default value which is 100? For example 300?

This is a related section from MAAS docs

Hey! I did raised to 800. This was my first attempt. I know Temporal uses tons of connections (we use it in production for our system).

I’d suggest to look at the logs of the regions that are not working