Dual Region Controllers

I’m testing out a 2 server HA configuration. Both controllers were created as region+rackd and point to an external psql db.

If I stop MAAS on one of the controllers I am still able to access the web dashboard through the 2nd, but actions such as commissioning do not work.

Commissioning hangs indefinitely, I see some messages in the logs related to the other controller: maas-agent[1264426]: ERR Temporal client error error="failed reaching server: context deadline exceeded"

Is this expected behaviour? Both controllers are running MAAS 3.5

Is the machine you are trying to commission reachable from both the rack controllers?

Also, it would be beneficial if you provide all the steps to reproduce this (including your network setup/topology)

My testing has involved composing virsh kvm instances.

I’ve verified that both controllers can connect to virsh on the KVM nodes via ssh.

Network traffic is routed through a single management interface. I’ve flushed IP tables, to verify firewall is not impacting.

Reproduction steps:

  1. Stop first controller via snap stop maas.
  2. Use second controller dashboard to compose new KVM Virsh instance.
  3. From the virsh host. Run virsh list --all.
    I can see the instance has been defined in virsh on the node
  4. Dashboard is stuck commissioning.

From the logs I see:

Aug 15 11:47:54 maashead2-test maas-log[1426907]: maas.node: [info] test: Status transition from NEW to COMMISSIONING
Aug 15 11:48:09 maashead2-test maas-temporal[1426894]: {"level":"error","ts":"2024-08-15T11:48:09.532+1000","msg":"service failures","operation":"GetTaskQueueUserData","wf-namespace":"default","error":"task queue closed","logging-call-at":"telemetry.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/common/rpc/interceptor/telemetry.go:341\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/common/rpc/interceptor/telemetry.go:174\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1195\ngo.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1195\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1195\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1195\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1186\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1376\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:1753\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/build/temporal-8dYnj9/temporal-1.22.5/src/vendor/google.golang.org/grpc/server.go:998"}

Further testing reveals the 2nd controller is unable commission when the 1st is down. But the 1st controller can commission if the 2nd is down.