Hi,
I have a case where the maas-agent
, occasionally fails to spawn the configure-httpproxy-service
workflow. The service still runs, but nodes on this rack can’t get bootx64.efi
from the region during PXE boot.
I’m not sure what’s causing the issue or how to fix it. Does anyone have any idea what might be going wrong?
root@bmaas-rackd-al-1:~# curl http://<bmaas-rackd-al-1>:5248/images/bootx64.efi
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.18.0 (Ubuntu)</center>
</body>
</html>
Progress:
--------------- [1] WorkflowExecutionStarted ---------------
attempt: 1
eventTime: 2025-06-12T08:19:49.470545554Z
firstExecutionRunId: 0ab2b7a3-6a84-4d4a-ba20-dff9d8f564b2
firstWorkflowTaskBackoff: 0s
input.payloads[0].data: a2wrFOvmHaA0zNCkCEyxyv0a1dp1upiBQ9DuapJcd8bUjSdDsirvVnYvawMjHGVbRPl/8y7NbuGDj8aRQ7c=
input.payloads[0].metadata.encoding: YmluYXJ5L2VuY3J5cHRlZA==
originalExecutionRunId: 0ab2b7a3-6a84-4d4a-ba20-dff9d8f564b2
parentInitiatedEventId: 14
parentWorkflowExecution.runId: bc808909-d838-4297-8101-9f875363441f
parentWorkflowExecution.workflowId: e47f476c-174c-48bb-908a-3b87836df74c
parentWorkflowNamespace: default
parentWorkflowNamespaceId: 61e7dad0-b9a0-4469-86b7-eedbbfbc581e
retryPolicy.backoffCoefficient: 2
retryPolicy.initialInterval: 1s
retryPolicy.maximumAttempts: 1
retryPolicy.maximumInterval: 100s
taskId: 35353897
taskQueue.kind: TASK_QUEUE_KIND_NORMAL
taskQueue.name: hsdw4k@agent:main
workflowId: configure-httpproxy-service:hsdw4k
workflowRunTimeout: 0s
workflowTaskTimeout: 10s
workflowType.name: configure-httpproxy-service
--------------- [2] WorkflowTaskScheduled ---------------
attempt: 1
eventTime: 2025-06-12T08:19:49.490110756Z
startToCloseTimeout: 10s
taskId: 35353902
taskQueue.kind: TASK_QUEUE_KIND_NORMAL
taskQueue.name: hsdw4k@agent:main
--------------- [3] WorkflowTaskStarted ---------------
eventTime: 2025-06-12T08:19:49.500487518Z
historySizeBytes: 988
identity: hsdw4k@agent:209557
requestId: eea26969-b1f7-4048-ae72-66224dc94b0f
scheduledEventId: 2
taskId: 35353905
--------------- [4] WorkflowTaskCompleted ---------------
eventTime: 2025-06-12T08:19:49.511233422Z
identity: hsdw4k@agent:209557
scheduledEventId: 2
sdkMetadata.langUsedFlags[0]: 3
sdkMetadata.sdkName: temporal-go
sdkMetadata.sdkVersion: 1.25.1
startedEventId: 3
taskId: 35353909
workerVersion.buildId: 3164ae3d8ca1ff1b7f8ec4aa864f4625
--------------- [5] ActivityTaskScheduled ---------------
activityId: 5
activityType.name: get-region-controller-endpoints
eventTime: 2025-06-12T08:19:49.511279664Z
heartbeatTimeout: 0s
retryPolicy.backoffCoefficient: 2
retryPolicy.initialInterval: 1s
retryPolicy.maximumInterval: 100s
scheduleToCloseTimeout: 60s
scheduleToStartTimeout: 60s
startToCloseTimeout: 60s
taskId: 35353910
taskQueue.kind: TASK_QUEUE_KIND_NORMAL
taskQueue.name: region
workflowTaskCompletedEventId: 4
--------------- [6] ActivityTaskStarted ---------------
attempt: 1
eventTime: 2025-06-12T08:19:49.520809777Z
identity: hp44kr@region:1036
requestId: cc32be9e-c78b-410a-b846-1836b7f20f31
scheduledEventId: 5
taskId: 35353951
--------------- [7] ActivityTaskTimedOut ---------------
eventTime: 2025-06-12T08:20:49.512539539Z
failure.message: activity ScheduleToClose timeout
failure.source: Server
failure.timeoutFailureInfo.timeoutType: TIMEOUT_TYPE_SCHEDULE_TO_CLOSE
retryState: RETRY_STATE_NON_RETRYABLE_FAILURE
scheduledEventId: 5
startedEventId: 6
taskId: 35353952
--------------- [8] WorkflowTaskScheduled ---------------
attempt: 1
eventTime: 2025-06-12T08:20:49.512546829Z
startToCloseTimeout: 10s
taskId: 35353953
taskQueue.kind: TASK_QUEUE_KIND_STICKY
taskQueue.name: bmaas-rackd-al-1:994ea484-8844-47f5-8dc1-baa1bf8be4af
taskQueue.normalName: hsdw4k@agent:main
--------------- [9] WorkflowTaskStarted ---------------
eventTime: 2025-06-12T08:20:49.521378147Z
historySizeBytes: 1592
identity: hsdw4k@agent:209557
requestId: 79c20cd0-0c7a-437d-a2a9-3482eb8a7099
scheduledEventId: 8
taskId: 35353957
--------------- [10] WorkflowTaskCompleted ---------------
eventTime: 2025-06-12T08:20:49.531829098Z
identity: hsdw4k@agent:209557
scheduledEventId: 8
startedEventId: 9
taskId: 35353961
workerVersion.buildId: 3164ae3d8ca1ff1b7f8ec4aa864f4625
--------------- [11] WorkflowExecutionFailed ---------------
eventTime: 2025-06-12T08:20:49.531859623Z
failure.activityFailureInfo.activityId: 5
failure.activityFailureInfo.activityType.name: get-region-controller-endpoints
failure.activityFailureInfo.retryState: RETRY_STATE_NON_RETRYABLE_FAILURE
failure.activityFailureInfo.scheduledEventId: 5
failure.activityFailureInfo.startedEventId: 6
failure.cause.message: activity ScheduleToClose timeout
failure.cause.source: Server
failure.cause.timeoutFailureInfo.timeoutType: TIMEOUT_TYPE_SCHEDULE_TO_CLOSE
failure.message: activity error
failure.source: GoSDK
retryState: RETRY_STATE_MAXIMUM_ATTEMPTS_REACHED
taskId: 35353962
workflowTaskCompletedEventId: 10
Results:
Status FAILED
Failure
Message: activity error
Cause:
Message: activity ScheduleToClose timeout
Hi @huy123
That doesn’t look normal, however there might be an explanation what exactly is failing. In order to setup httpproxy-service
it tries to fetch IP addresses of the Region controller via API call and that might be where it can potentially time out.
Do you have more logs from maas-agent
?
Also what MAAS version are you running?
Hi @troyanov
maas-agent seem to keep crashing down sometimes
Jun 06 19:16:12 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 06 19:16:12 bmaas-rackd-al-1 maas-agent[524794]: INF Logger is configured with log level "info"
Jun 06 19:16:13 bmaas-rackd-al-1 maas-agent[524794]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:524794
Jun 06 19:16:13 bmaas-rackd-al-1 maas-agent[524794]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: a38a6cc5-9247-4d72-9b47-13b5edf53238, runID: e93699e7-b090-4282-95a6-ccb0d94f615d): Workflow execution already >Jun 06 19:16:13 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 19:16:13 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Failed with result 'exit-code'.
Jun 06 19:16:42 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 06 19:16:42 bmaas-rackd-al-1 maas-agent[524877]: INF Logger is configured with log level "info"
Jun 06 19:16:43 bmaas-rackd-al-1 maas-agent[524877]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:524877
Jun 06 19:16:43 bmaas-rackd-al-1 maas-agent[524877]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: 860a8edc-d4a7-4f5a-8d20-f133459fd0ad, runID: 6bcbdf4c-3377-4ad6-b756-3cdd47fb74a1): Workflow execution already >Jun 06 19:16:43 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 19:16:43 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Failed with result 'exit-code'.
Jun 06 19:17:17 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 06 19:17:17 bmaas-rackd-al-1 maas-agent[525283]: INF Logger is configured with log level "info"
Jun 06 19:17:17 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-250 WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:525283
Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Starting power-service Attempt=1 Namespace=default RunID=ccc20ff0-cbbe-4281-b98c-bbe83ae2d1a9 TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283 WorkflowID=configure-power-service:hsdw4k WorkflowType=confi>Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Starting httpproxy-service Attempt=1 Namespace=default RunID=89a9875a-785e-4cd4-b69a-6740f044b601 TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283 WorkflowID=configure-httpproxy-service:hsdw4k WorkflowTy>Jun 06 19:17:18 bmaas-rackd-al-1 maas-agent[525283]: INF Service MAAS Agent started
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Stopped Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Stopped Worker Namespace=default TaskQueue=agent:power@vlan-250 WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Stopped Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:525283
Jun 09 10:41:49 bmaas-rackd-al-1 maas-agent[525283]: INF Starting power-service Attempt=1 Namespace=default RunID=4d6fd153-f58c-433d-8d16-9af6d740145f TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:525283 WorkflowID=configure-power-service:hsdw4k WorkflowType=confi>Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: Stopping MAAS Agent daemon...
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Deactivated successfully.
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: Stopped MAAS Agent daemon.
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Consumed 1min 19.204s CPU time.
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Logger is configured with log level "info"
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:1050153
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Started Worker Namespace=default TaskQueue=agent:power@vlan-1 WorkerID=hsdw4k@agent:1050153
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:power WorkerID=hsdw4k@agent:1050153
Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: INF Starting power-service Attempt=1 Namespace=default RunID=c1c1cb7d-17e7-4ab9-bceb-460ec40616ca TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:1050153 WorkflowID=configure-power-service:hsdw4k WorkflowType=con>Jun 09 10:42:01 bmaas-rackd-al-1 maas-agent[1050153]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: 87ea81c2-b039-48f8-8f09-1b7507edaa1d, runID: 186eac6f-59fd-4c06-b4d5-3750cdf76af9): Workflow execution already>Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
Jun 09 10:42:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Failed with result 'exit-code'.
Jun 09 10:42:13 bmaas-rackd-al-1 systemd[1]: Started MAAS Agent daemon.
Jun 09 10:42:13 bmaas-rackd-al-1 maas-agent[1050242]: INF Logger is configured with log level "info"
Jun 09 10:42:13 bmaas-rackd-al-1 maas-agent[1050242]: INF Started Worker Namespace=default TaskQueue=hsdw4k@agent:main WorkerID=hsdw4k@agent:1050242
Jun 09 10:43:01 bmaas-rackd-al-1 systemd[1]: Stopping MAAS Agent daemon...
Jun 09 10:43:01 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Deactivated successfully.
and I’m using maas ver 3.5.4
Jun 06 19:16:13 bmaas-rackd-al-1 maas-agent[524794]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: a38a6cc5-9247-4d72-9b47-13b5edf53238, runID: e93699e7-b090-4282-95a6-ccb0d94f615d): Workflow execution already >Jun 06 19:16:13 bmaas-rackd-al-1 systemd[1]: maas-agent.service: Main process exited, code=exited, status=1/FAILURE
I guess that one is about Workflow execution already started
.
May I ask you to do the following:
- stop maas-agent
- wait for 5 minutes
- start maas-agent
There is a logic that doesn’t not allow duplicate workflows to be spawned and I am curious if/why for some reason you are hitting this
Yes, I’ve tried it and it worked, but this issue has occurred multiple times. Sometimes, it takes four or five attempts for the maas-agent
to successfully start the workflow again. I suspect that maas-rackd
is trying to restart the maas-agent
, while the maas-temporal
service attempts to create a new workflow—resulting in duplicate workflows.
Is there a way to detect and terminate any duplicate processes before starting a new one, or is this something the maas-temporal
service is supposed to handle automatically? This issue is seriously impacting production, so a more reliable solution would be greatly appreciated.
i’m using temporal CLI to query all the workflows related to that rackd and this is the results
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 3 hours ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 3 hours ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 3 hours ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 3 hours ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 3 hours ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 5 hours ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 5 hours ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 5 hours ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 15 hours ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 15 hours ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 1 day ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 1 day ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 1 day ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Completed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
Failed configure-httpproxy-service:hsdw4k configure-httpproxy-service 2 days ago
@huy123 may I ask you to file a bug at MAAS in Launchpad?
The reason of this behaviour is because there is a missing id_reuse_policy
on every child workflow at src/maasserver/workflow/configure.py:186
id_reuse_policy=WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,
If you are using deb packages, maybe you can patch it yourself?
@troyanov
Hi,
I have submitted a bug report here: Bug #2114240 “duplicate “configure-httpproxy-service” workflow” : Bugs : MAAS
and I will try to patch the code
Thank you for the help
Hi @troyanov,
I also have a question. I have about 50 pairs of maas-rackd
. Will having that many affect the performance of MAAS if I only have 4 maas-regiond
node?
In theory it should work just fine, but it really depends on the usage scenario.