MAAS 3.6.3 migration: maas-agent TLS error after backup & restore (is this supported?)

Hi all,

I’m trying to migrate a MAAS setup on 3.6.3, PostgreSQL 16 and would appreciate some guidance.

I followed the steps here How to back up MAAS :

  • Backed up the database and /var/lib/maas from the source instance
  • Restored both onto a new VM running the same MAAS and PostgreSQL versions
  • Services start, but the UI keeps reconnecting and isn’t usable

Issue:
maas-agent is failing with a TLS/certificate error:

ERR Temporal client error error="failed reaching server: last connection error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \\\"crypto/rsa: verification error\\\" while trying to verify candidate authority certificate \\\"maas-ca\\\")"

I am not sure if this is related to restoring /var/lib/maas/certificates from the old instance, so I tried:

  • Removing /var/lib/maas/certificates/*
  • Running maas-region dbupgrade
  • Restarting services

But I either get missing cert errors (e.g. cluster.pem not found) or the TLS issue persists.

Questions:

  1. Is this type of migration (backup + restore to a new instance) officially supported?
  2. What is the correct way to handle certs when migrating MAAS to a new instance?

Any advice or recommended migration approach would be greatly appreciated. Thanks!

Hi @yings17,

A couple of questions:

  • Are you running in HA mode (multiple region controllers)? If so, the certificate situation is more complex, since each region controller has its own key material that needs to be consistent
  • What MAAS version were you migrating from?

About the TLS error, x509: certificate signed by unknown authority suggests that the certificates on the new instance are inconsistent. Likely the region regenerated a new CA/cert set, but the maas-agent/rack controller is still presenting certificates signed by the old CA from the backup. You need to make sure both sides share the same CA

Hey, thanks for the prompt reply! This is a single node setup. Same versions on both sides, I’m migrating from 3.6.3.

How can I ensure that both sides share the same CA?

Specifically, is there a supported method to fully regenerate the internal CA and propagate it to all components (region, rack, mass-agent)?

The steps that you provided seem correct to me. When you start the services can you start first the region, wait that it is up and running and the certificates are created, and then start the rest of the services?

So the sequence would be like this:

  • stop all services
  • remove /var/lib/maas/certificates/
  • run maas-region dbupgrade
  • start only maas-regiond, and wait to see that the certificates are regenerate
  • then, start the rest of the services (maas-rackd, maas-agent…)

I tried the sequence again and now i’m seeing “Failed to synchronise boot resources: Child Workflow execution failed” error on the UI, while it keeps trying to reconnect.

maas-agent keeps retrying with this error:
ERR Workflow configure-agent failed error=“workflow execution error (type: configure-agent, workflowID: configure-agent:b6kr3d, runID: 07d1c23d-5ae6-4757-a6e7-d7578fba270b): Workflow timeout (type: StartToClose)”

Is Temporal itself healthy?
sudo journalctl -u maas-temporal --since "10 minutes ago" | grep -i "error\|fatal\|panic"

If temporal is working correctly, the issue might be due to a problem in the communication between the agent and temporal. Check that with the command below:

nc -zv -w 3 $MAAS_IP 5241

Temporal is not healthy. I’m seeing a bunch of errors:

Apr 20 16:26:11 maas temporal-server[4134576]: {"level":"warn","ts":"2026-04-20T16:26:11.287+0800","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-history-scanner-taskqueue-0","WorkerID":"4134576@maas-dev-1@","WorkerType":"ActivityWorker","Error":"closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: NO_ERROR, debug data: \"graceful_stop\"","logging-call-at":"internal_worker_base.go:333"}