[solved] maas-agent constantly failing

Hi,

I’m having issues deploying MaaS (with the help of FCE) and it constantly fail because of maas-agent failing.
I do have many issues with this component, it might be an improvement but compared to older MaaS release, I find it a lot less reliable.

MaaS version : 3.6.1-17573-g.bc3a12219 (snap version, 3.6/stable channel)

Here is the error I get, looks like it fails with the proxy configuration as can be seen in this log snippet (we’re behind a proxy but I checked our proxy configuration and see no problems here) :

Sep 04 07:35:27 hpdl1v03802 maas-agent[66172]: INF Logger is configured with log level "info"
Sep 04 07:35:27 hpdl1v03802 maas-agent[66172]: INF Started Worker Namespace=default TaskQueue=cygabp@agent:main WorkerID=cygabp@agent:66172
Sep 04 07:35:28 hpdl1v03802 maas-agent[66172]: INF Configuring power-service Attempt=1 Namespace=default RunID=ef07ee01-9417-4fac-aee4-c3402ef19256 SpanID=0000000000000000 TaskQueue=cygabp@agent:main TraceID=00000000000000000000000000000000 WorkerID=cygabp@agent:66172 WorkflowID=configure-power-service:cygabp WorkflowType=configure-power-service
Sep 04 07:35:28 hpdl1v03802 maas-agent[66172]: INF Started Worker Namespace=default TaskQueue=cygabp@agent:power WorkerID=cygabp@agent:66172
Sep 04 07:35:28 hpdl1v03802 maas-agent[66172]: INF Started power-service Attempt=1 Namespace=default RunID=ef07ee01-9417-4fac-aee4-c3402ef19256 SpanID=0000000000000000 TaskQueue=cygabp@agent:main TraceID=00000000000000000000000000000000 WorkerID=cygabp@agent:66172 WorkflowID=configure-power-service:cygabp WorkflowType=configure-power-service
Sep 04 07:35:28 hpdl1v03802 maas-agent[66172]: INF Configuring httpproxy-service Attempt=1 Namespace=default RunID=0c07df26-2450-4833-9fbf-1a220961a012 SpanID=0000000000000000 TaskQueue=cygabp@agent:main TraceID=00000000000000000000000000000000 WorkerID=cygabp@agent:66172 WorkflowID=configure-httpproxy-service:cygabp WorkflowType=configure-httpproxy-service
Sep 04 07:35:31 hpdl1v03802 maas-agent[66172]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: configure-agent:cygabp, runID: f82252b3-a47d-4efa-9d6b-a44a6e3e1f5f): child workflow execution error (type: configure-httpproxy-service, workflowID: configure-httpproxy-service:cygabp, runID: 0c07df26-2450-4833-9fbf-1a220961a012, initiatedEventID: 14, startedEventID: 15): targets cannot be empty"

How can I debug maas-agent in a snap environment ? How can I check if the configuration it gets is good ? What I suspect is that it tries to configure a proxy without having a value for that but I don’t see where it takes it from.

What I don’t understand is that it is not my first deployment of the kind, same bundle, same proxy settings, only different networks/environments and I managed to deploy my previous clusters but not the new ones I’m working on right now.

Any help would be gladly appreciated.

This usually happens when the region/rack controllers do not have an IP showing on the UI yet. My suggestion is to ensure that all the controllers show an IP in the network page, and if any of them do not restart MAAS on these infra nodes.

Hi r00ta, thanks for your help !

Effectively, the 3 controllers appears up and green (and in sync) through the UI but when I display the details of a controller, I get this and the “Region importing” process never ends and details are stuck to “unknown” :

I restarted maas on the 3 nodes many times but this doesn’t change anything.
I suspect this is the root cause of my issue but I don’t know how to debug this.
I have the same network configuration on my other clusters which were deployed successfully, one bond, multiple bridges but only one default gateway so the default interface should be easily discovered and detected.
As I said, this is deployed through FCE, everything is done through the bundle but I see my MaaS VIP and Postgresql VIP for HA, they are both accessible … well, everything looks “normal” until you take a look at the logs and you see that there are some unknown issues.

Are there any logs I can grab to try to understand this ?

Probably the commissioning scripts for the infra nodes are not being processed.

We know that there are issues when the host have really a lot of network interfaces. Is that the case?

Output of ip a?

Humm … you might have pointed out something … I have an already configured LXD instance prior to MaaS deployment (it is apparently required for MaaS LXD support) but I have a few LXD VMs already running which created a few tap interfaces.
I’ll remove all of them and try again, I’ll keep you posted.

Great !
You were right, those LXD network interfaces were messing with the deployment.
And in fact, there is no need to have LXD pre-configured beforehand as said in the documentation, I cleaned up everything and redeployed with a base server install and FCE deployed everything properly without issues, including LXD.

Many thanks for your help and there is probably some things to do regarding this kind of situation, might be an improved MaaS deployment script or at least, some KB about this somewhere.

Best regards !

For the record this is the bug https://bugs.launchpad.net/maas/+bug/2114255