maas-temporal service failed to start configure-httpproxy-service workflow

Hi,

I have a case where the maas-agent, occasionally fails to spawn the configure-httpproxy-service workflow. The service still runs, but nodes on this rack can’t get bootx64.efi from the region during PXE boot.

I’m not sure what’s causing the issue or how to fix it. Does anyone have any idea what might be going wrong?

root@bmaas-rackd-al-1:~# curl http://<bmaas-rackd-al-1>:5248/images/bootx64.efi
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.18.0 (Ubuntu)</center>
</body>
</html>
Progress:
--------------- [1] WorkflowExecutionStarted ---------------
attempt: 1
eventTime: 2025-06-12T08:19:49.470545554Z
firstExecutionRunId: 0ab2b7a3-6a84-4d4a-ba20-dff9d8f564b2
firstWorkflowTaskBackoff: 0s
input.payloads[0].data: a2wrFOvmHaA0zNCkCEyxyv0a1dp1upiBQ9DuapJcd8bUjSdDsirvVnYvawMjHGVbRPl/8y7NbuGDj8aRQ7c=
input.payloads[0].metadata.encoding: YmluYXJ5L2VuY3J5cHRlZA==
originalExecutionRunId: 0ab2b7a3-6a84-4d4a-ba20-dff9d8f564b2
parentInitiatedEventId: 14
parentWorkflowExecution.runId: bc808909-d838-4297-8101-9f875363441f
parentWorkflowExecution.workflowId: e47f476c-174c-48bb-908a-3b87836df74c
parentWorkflowNamespace: default
parentWorkflowNamespaceId: 61e7dad0-b9a0-4469-86b7-eedbbfbc581e
retryPolicy.backoffCoefficient: 2
retryPolicy.initialInterval: 1s
retryPolicy.maximumAttempts: 1
retryPolicy.maximumInterval: 100s
taskId: 35353897
taskQueue.kind: TASK_QUEUE_KIND_NORMAL
taskQueue.name: hsdw4k@agent:main
workflowId: configure-httpproxy-service:hsdw4k
workflowRunTimeout: 0s
workflowTaskTimeout: 10s
workflowType.name: configure-httpproxy-service

--------------- [2] WorkflowTaskScheduled ---------------
attempt: 1
eventTime: 2025-06-12T08:19:49.490110756Z
startToCloseTimeout: 10s
taskId: 35353902
taskQueue.kind: TASK_QUEUE_KIND_NORMAL
taskQueue.name: hsdw4k@agent:main

--------------- [3] WorkflowTaskStarted ---------------
eventTime: 2025-06-12T08:19:49.500487518Z
historySizeBytes: 988
identity: hsdw4k@agent:209557
requestId: eea26969-b1f7-4048-ae72-66224dc94b0f
scheduledEventId: 2
taskId: 35353905

--------------- [4] WorkflowTaskCompleted ---------------
eventTime: 2025-06-12T08:19:49.511233422Z
identity: hsdw4k@agent:209557
scheduledEventId: 2
sdkMetadata.langUsedFlags[0]: 3
sdkMetadata.sdkName: temporal-go
sdkMetadata.sdkVersion: 1.25.1
startedEventId: 3
taskId: 35353909
workerVersion.buildId: 3164ae3d8ca1ff1b7f8ec4aa864f4625

--------------- [5] ActivityTaskScheduled ---------------
activityId: 5
activityType.name: get-region-controller-endpoints
eventTime: 2025-06-12T08:19:49.511279664Z
heartbeatTimeout: 0s
retryPolicy.backoffCoefficient: 2
retryPolicy.initialInterval: 1s
retryPolicy.maximumInterval: 100s
scheduleToCloseTimeout: 60s
scheduleToStartTimeout: 60s
startToCloseTimeout: 60s
taskId: 35353910
taskQueue.kind: TASK_QUEUE_KIND_NORMAL
taskQueue.name: region
workflowTaskCompletedEventId: 4

--------------- [6] ActivityTaskStarted ---------------
attempt: 1
eventTime: 2025-06-12T08:19:49.520809777Z
identity: hp44kr@region:1036
requestId: cc32be9e-c78b-410a-b846-1836b7f20f31
scheduledEventId: 5
taskId: 35353951

--------------- [7] ActivityTaskTimedOut ---------------
eventTime: 2025-06-12T08:20:49.512539539Z
failure.message: activity ScheduleToClose timeout
failure.source: Server
failure.timeoutFailureInfo.timeoutType: TIMEOUT_TYPE_SCHEDULE_TO_CLOSE
retryState: RETRY_STATE_NON_RETRYABLE_FAILURE
scheduledEventId: 5
startedEventId: 6
taskId: 35353952

--------------- [8] WorkflowTaskScheduled ---------------
attempt: 1
eventTime: 2025-06-12T08:20:49.512546829Z
startToCloseTimeout: 10s
taskId: 35353953
taskQueue.kind: TASK_QUEUE_KIND_STICKY
taskQueue.name: bmaas-rackd-al-1:994ea484-8844-47f5-8dc1-baa1bf8be4af
taskQueue.normalName: hsdw4k@agent:main

--------------- [9] WorkflowTaskStarted ---------------
eventTime: 2025-06-12T08:20:49.521378147Z
historySizeBytes: 1592
identity: hsdw4k@agent:209557
requestId: 79c20cd0-0c7a-437d-a2a9-3482eb8a7099
scheduledEventId: 8
taskId: 35353957

--------------- [10] WorkflowTaskCompleted ---------------
eventTime: 2025-06-12T08:20:49.531829098Z
identity: hsdw4k@agent:209557
scheduledEventId: 8
startedEventId: 9
taskId: 35353961
workerVersion.buildId: 3164ae3d8ca1ff1b7f8ec4aa864f4625

--------------- [11] WorkflowExecutionFailed ---------------
eventTime: 2025-06-12T08:20:49.531859623Z
failure.activityFailureInfo.activityId: 5
failure.activityFailureInfo.activityType.name: get-region-controller-endpoints
failure.activityFailureInfo.retryState: RETRY_STATE_NON_RETRYABLE_FAILURE
failure.activityFailureInfo.scheduledEventId: 5
failure.activityFailureInfo.startedEventId: 6
failure.cause.message: activity ScheduleToClose timeout
failure.cause.source: Server
failure.cause.timeoutFailureInfo.timeoutType: TIMEOUT_TYPE_SCHEDULE_TO_CLOSE
failure.message: activity error
failure.source: GoSDK
retryState: RETRY_STATE_MAXIMUM_ATTEMPTS_REACHED
taskId: 35353962
workflowTaskCompletedEventId: 10

Results:
  Status   FAILED
  Failure
    Message: activity error
    Cause:
        Message: activity ScheduleToClose timeout

Hi @huy123

That doesn’t look normal, however there might be an explanation what exactly is failing. In order to setup httpproxy-service it tries to fetch IP addresses of the Region controller via API call and that might be where it can potentially time out.

Do you have more logs from maas-agent?

Also what MAAS version are you running?

Hi @troyanov

maas-agent seem to keep crashing down sometimes

Jun 06 19:16:12 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 06 19:16:12 bmaas-rackd-al-1 maas-agent[524794]: INF Logger is configured with log level "info"
Jun 06 19:16:13 bmaas-rackd-al-1 maas-agent[524794]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:524794
Jun 06 19:16:13 bmaas-rackd-al-1 maas-agent[524794]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: a38a6cc5-9247-4d72-9b47-13b5edf53238, runID: e93699e7-b090-4282-95a6-ccb0d94f615d): Workflow execution already >Jun 06 19:16:13 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 19:16:13 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Failed with result 'exit-code'.
Jun 06 19:16:42 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 06 19:16:42 bmaas-rackd-al-1 maas-agent[524877]: INF Logger is configured with log level "info"
Jun 06 19:16:43 bmaas-rackd-al-1 maas-agent[524877]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:524877
Jun 06 19:16:43 bmaas-rackd-al-1 maas-agent[524877]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: 860a8edc-d4a7-4f5a-8d20-f133459fd0ad, runID: 6bcbdf4c-3377-4ad6-b756-3cdd47fb74a1): Workflow execution already >Jun 06 19:16:43 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 19:16:43 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Failed with result 'exit-code'.
Jun 06 19:17:17 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 06 19:17:17 bmaas-rackd-al-1 maas-agent[525283]: INF Logger is configured with log level "info"
Jun 06 19:17:17 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-250 WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Starting power-service Attempt=1 Namespace=default RunID=ccc20ff0-cbbe-4281-b98c-bbe83ae2d1a9 TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283 WorkflowID=configure-power-service:hsdw4k WorkflowType=confi>Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Starting httpproxy-service Attempt=1 Namespace=default RunID=89a9875a-785e-4cd4-b69a-6740f044b601 TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283 WorkflowID=configure-httpproxy-service:hsdw4k WorkflowTy>Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Service MAAS Agent started
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Stopped Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Stopped Worker Namespace=default TaskQueue=agent:power@vlan-250 WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Stopped Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Starting power-service Attempt=1 Namespace=default RunID=4d6fd153-f58c-433d-8d16-9af6d740145f TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283 WorkflowID=configure-power-service:hsdw4k WorkflowType=confi>Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: Stopping MAAS Agent daemon...
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Deactivated successfully.
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: Stopped MAAS Agent daemon.
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Consumed 1min 19.204s CPU time.
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Logger is configured with log level "info"
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:1050153
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:1050153
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:1050153
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Starting power-service Attempt=1 Namespace=default RunID=c1c1cb7d-17e7-4ab9-bceb-460ec40616ca TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:1050153 WorkflowID=configure-power-service:hsdw4k WorkflowType=con>Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: 87ea81c2-b039-48f8-8f09-1b7507edaa1d, runID: 186eac6f-59fd-4c06-b4d5-3750cdf76af9): Workflow execution already>Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Failed with result 'exit-code'.
Jun 09 10:42:13 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 09 10:42:13 bmaas-rackd-al-1 maas-agent[1050242]: INF Logger is configured with log level "info"
Jun 09 10:42:13 bmaas-rackd-al-1 maas-agent[1050242]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:1050242
Jun 09 10:43:01 bmaas-rackd-al-1 systemd[1]: Stopping MAAS Agent daemon...
Jun 09 10:43:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Deactivated successfully.

and I’m using maas ver 3.5.4

Jun 06 19:16:13 bmaas-rackd-al-1 maas-agent[524794]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: a38a6cc5-9247-4d72-9b47-13b5edf53238, runID: e93699e7-b090-4282-95a6-ccb0d94f615d): Workflow execution already >Jun 06 19:16:13 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE

I guess that one is about Workflow execution already started.
May I ask you to do the following:

  1. stop maas-agent
  2. wait for 5 minutes
  3. start maas-agent

There is a logic that doesn’t not allow duplicate workflows to be spawned and I am curious if/why for some reason you are hitting this

Yes, I’ve tried it and it worked, but this issue has occurred multiple times. Sometimes, it takes four or five attempts for the maas-agent to successfully start the workflow again. I suspect that maas-rackd is trying to restart the maas-agent, while the maas-temporal service attempts to create a new workflow—resulting in duplicate workflows.

Is there a way to detect and terminate any duplicate processes before starting a new one, or is this something the maas-temporal service is supposed to handle automatically? This issue is seriously impacting production, so a more reliable solution would be greatly appreciated.

i’m using temporal CLI to query all the workflows related to that rackd and this is the results

  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  3 hours ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  3 hours ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  3 hours ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  3 hours ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  3 hours ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  5 hours ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  5 hours ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  5 hours ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  15 hours ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  15 hours ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  1 day ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  1 day ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  1 day ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed     configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed     configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed     configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed     configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed  configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Completed   configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago
  Failed      configure-httpproxy-service:hsdw4k    configure-httpproxy-service  2 days ago

@huy123 may I ask you to file a bug at MAAS in Launchpad?

The reason of this behaviour is because there is a missing id_reuse_policy on every child workflow at src/maasserver/workflow/configure.py:186

id_reuse_policy=WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,

If you are using deb packages, maybe you can patch it yourself?

@troyanov

Hi,
I have submitted a bug report here: Bug #2114240 “duplicate “configure-httpproxy-service” workflow” : Bugs : MAAS

and I will try to patch the code

Thank you for the help

Hi @troyanov,
I also have a question. I have about 50 pairs of maas-rackd. Will having that many affect the performance of MAAS if I only have 4 maas-regiond node?

In theory it should work just fine, but it really depends on the usage scenario.