Hello folks!
We are evaluating MAAS for a new infrastructure we are deploying along with Ubuntu Pro and we started by building a dry run on the new servers itself.
We went thru the motions of deploying Postgres HA with Patroni and have a floating VIP using Keepalived which tracks the Patroni leader. That part works perfectly fine.
After that, we deployed HAProxy on top of that, which provides a Postgres unique endpoint to the leader which should be used by all MAAS nodes.
So, with the database layer ready, we went to the actual MAAS installation and configuration. We’ve followed the https://canonical.com/maas/docs/how-to-get-maas-up-and-running guide and then enabled TLS providing the certificates issued by our internal CA. As a single node, it works perfectly fine.
For reference this is the setup in terms of networking:
maas.mydomain.internal- The VIP DNS which points to the HAProxy floating IP10.200.20.9. This has been configured as a bypass on the TLS so it is not terminated at HAProxy but at the MAAS node itself. The certificate is issued with SANs for all VIP and node FQDNs, along with the VIP and node IPs, so it is the same certificate everywhere.maas-01.mydomain.internal- The first MAAS node on10.200.20.10.maas-02.mydomain.internal- The second MAAS node on10.200.20.11.maas-03.mydomain.internal- The third MAAS node on10.200.20.12.
All nodes are part of the Patroni cluster and can connect to the database VIP without any issues.
The deployment of the initial node was made using the following:
sudo maas init region+rack --database-uri "postgres://maas:maaspw@maas.mydomain.internal:5000/maas" --maas-url http://maas.mydomain.internal:5240/MAAS
Then after the deployment, we created the initial user and went thru the wizard on the Web UI without any issues.
After that, we enabled the TLS using the following:
sudo maas config-tls enable -p 5443 --cacert /var/snap/maas/common/ca-chain.crt /var/snap/maas/common/cert.key /var/snap/maas/common/cert.crt
Then acessing thru the browser using https://maas.mydomain.internal:5443/MAAS works perfectly fine, and so does with the VIP https://maas.mydomain.internal/MAAS thru HAProxy.
Everything works well so far, and extremely smooth.
Then, we started try to follow the steps for the HA setup, adding the other nodes (from https://canonical.com/maas/docs/how-to-manage-high-availability and How to enable high availability (deb/3.1/CLI)) and used the same command to init both region and rack as in the first node.
The command complete, and I can see on the UI the controllers being added for maas-01 and maas-03. Eventually (after a while) they report “green”. Then I ran the same TLS command on both nodes to enable TLS.
After a few seconds or trying to use the UI we start to get timeouts, and slugish responses. After a while, the UI becomes completely unresponsive either thru the VIP or directly to any of the nodes. I have the impression that something is being lost when adding new nodes. Even the database connections I can’t access anymore with an external tool like DataGrip or the psql CLI (and no, the machines are not overloaded, CPU and RAM are fine).
We tried to check the logs on all nodes but there is nothing really useful there.
I’m a little confuse by reading the docs and given the number of unresolved conversations in the forum regarding HA setups, I think there is something missing in the docs or some step that is not being properly executed because of the lack of information.
The major question is whether or not we should call the init on the other nodes, using the original maas-url which was not using the TLS yet (as it still appear like that on the maasserver_node table) or if we should use the TLS endpoint now that it is enabled on the first node, or if the init should be called on each node with their individual FQDNs instead of the VIP and leave the VIP only for the HAProxy. It is very confusing the docs regarding this and I believe this may be the source of the problem.
Can someone please clarify the proper steps to have a fully working MAAS HA setup with TLS enabled on all nodes?
For reference, the HAProxy config is as follows:
global
maxconn 100
log /dev/log local0
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
# DNS resolver for FQDN resolution
resolvers dns_resolver
nameserver dns1 10.200.20.1:53
nameserver dns2 1.1.1.1:53
resolve_retries 3
timeout resolve 1s
timeout retry 1s
hold valid 10s
defaults
log global
retries 3
option redispatch
timeout client 30m
timeout connect 4s
timeout server 30m
timeout tunnel 1h
timeout check 5s
# HAProxy Stats Page
listen stats
mode http
bind *:7000
stats enable
stats uri /
stats refresh 5s
# MAAS UI/API (HTTPS - SSL Passthrough with Sticky Sessions)
frontend maas_https
mode tcp
bind *:443
default_backend maas_https
backend maas_https
mode tcp
balance source
hash-type consistent
# Sticky sessions for WebSocket support
stick-table type ip size 200k expire 30m
stick on src
# SSL health check
option ssl-hello-chk
# Longer timeout for WebSocket connections
timeout server 3600s
timeout tunnel 3600s
server maas-01 maas-01.mydomain.internal:5443 check inter 5s fall 3 rise 2 resolvers dns_resolver
server maas-02 maas-02.mydomain.internal:5443 check inter 5s fall 3 rise 2 resolvers dns_resolver
server maas-03 maas-03.mydomain.internal:5443 check inter 5s fall 3 rise 2 resolvers dns_resolver
# PostgreSQL Write (Primary only)
listen postgres_write
mode tcp
bind *:5000
option httpchk
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server maas-01 maas-01.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
server maas-02 maas-02.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
server maas-03 maas-03.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
# PostgreSQL Read (Replicas)
listen postgres_read
mode tcp
bind *:5001
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server maas-01 maas-01.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
server maas-02 maas-02.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
server maas-03 maas-03.mydomain.internal:5432 maxconn 100 check port 8008 resolvers dns_resolver
Thank you! I appreciate any help.
Best regards,
Gutemberg