MAAS restarting node while deployment - tested with Intel AMT but affect others too

Hi all,

I am building my new homelab with MAAS after having a successful homelab with ARM64 based boards.
For this one I have Dell Optiplex 5070 Small Form Factor nodes with Intel AMT enabled on them.

I am trying the latest MAAS 3.6 version from Snap and having issues with deployment.

When I try to deploy, the following happens

  1. Machine is powered on
  2. The PXE is trying to boot
  3. At some random time (between pxe boot and installation complete) the machine will reboot.
  4. The machine will start installation again over PXE.

I was going crazy to figure out the issue, if this was

  • Kernel crashing
  • BIOS Watchdog Timer
  • any other possibility.

I thought to try powering the node myself using “meshcmd” tool from MeshCommander and set the MAAS power control to Manual. This made things work as expected, no unexpected reboots between deployment start and finish.

meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123 --poweron --bootdevice pxe

To figure out the issue, I setup a test instance of maaspower - maaspower — maaspower 1.0.1.dev2+g5923ff0 documentation in front and set the power configuration as webhook

maaspower config file used

name: my maas power control webhooks
ip_address: 0.0.0.0
port: 5000
username: maas_user
password: maas_pass

devices:
- type: CommandLine
  name: '192.168.\d{1,3}.\d{1,3}'
  on: 'meshcmd AmtPower --host \g<0> --pass MyPAAS!123  --poweron --bootdevice pxe'
  off: 'meshcmd AmtPower --host \g<0> --pass MyPAAS!123 --poweroff'
  query: 'meshcmd AmtPower --host \g<0> --pass MyPAAS!123'

This made things work as expected, no reboots while deploying and while staring at the maaspower logs, I saw that MAAS is actually trying to start the node more than once while the deployment started.

device: 192.168.20.101 command: query
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123
Current power state: Soft off

response: status : stopped
192.168.20.2 - - [14/Jun/2025 10:40:12] "GET /maaspower/192.168.20.101/query HTTP/1.1" 200 -
device: 192.168.20.101 command: on
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123 --poweron --bootdevice pxe
SUCCESS

response: None
192.168.20.2 - - [14/Jun/2025 10:40:25] "POST /maaspower/192.168.20.101/on HTTP/1.1" 200 -
device: 192.168.20.101 command: query
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123
Current power state: Power on

response: status : running
192.168.20.2 - - [14/Jun/2025 10:40:46] "GET /maaspower/192.168.20.101/query HTTP/1.1" 200 -
device: 192.168.20.101 command: on
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123 --poweron --bootdevice pxe
SUCCESS

response: None
192.168.20.2 - - [14/Jun/2025 10:40:56] "POST /maaspower/192.168.20.101/on HTTP/1.1" 200 -
device: 192.168.20.101 command: query
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123
Current power state: Power on

response: status : running

This isn’t right but still wasn’t sure why this would work with meshcmd and MAAS restarting when working directly.

Looking at MAAS power driver for AMT things made sense -

            if (
                self.wsman_query_state(
                    ip_address, power_user, power_pass, port
                )
                == "on"
            ):
                self.wsman_power_on(
                    ip_address, power_user, power_pass, port, restart=True
                )
            else:
                self.wsman_power_on(ip_address, power_user, power_pass, port)

AMT Power driver is being smart by checking if the machine is already powered on, then it tells it to restart.

Combining the 2 issues together and you get machine that keeps restarting while deployment.

  1. MAAS is trying to power on a machine twice when deployment is started.
  2. AMT Power driver is trying to restart machine when MAAS is asking it to poweron and the machine is already powered on.

Looking at other power drivers, it seems like the expectation of the power driver is to restart the machine if its already powered on, so I take it that issue is with MAAS trying to power on the machine twice.

Please let me know if I am doing something incorrect.

Thanks

Will be available in 3.6.1. Or if you are brave enough you can use 3.6/edge

Already on 3.6/edge refreshed last night
I won’t expect the node to power on at all if I was hitting that bug.

Also, the AMT version I am using uses wsman not amttool

I believe it’s worth to capture a tcpdump, extract the logs and open a bug

I am experiencing exact same problem, with redfish power driver and IPMI power driver
Also tested 3.6 edge, but it still has got the bug present

what servers do you have? HPE?

Yes, the bug is present when I use HPE

Can you try Comment #3 : Bug #2112206 : Bugs : MAAS ?

No, I haven’t I will give it a go, thanks,
Ideally I would like to use redfish power driver, I hope this bug will be fixed in the near future

If the comment above helps then it’s not a MAAS bug but rather a more strict requirement for these newer firmwares

@r00ta - This bug affects multiple power drivers. @peterw71 confirms for IPMI and I have seen the same for Intel AMT and Webhook drivers.
With Webhook driver and using maaspower, it is clear with maaspower logs that MAAS was requesting Power to be turned on, even when power was already on.
All power drivers check if the power is on, if it is already on, they perform a reset.

Check the logs below

This is when deployment starts

  1. MAAS queries power state which is - Soft off
  2. MAAS requests power to be turned on
  3. MAAS queries power state again - Now running
  4. MAAS again requests power to be turned on
  5. MAAS queries again.
device: 192.168.20.101 command: query
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123
Current power state: Soft off

response: status : stopped
192.168.20.2 - - [14/Jun/2025 10:40:12] "GET /maaspower/192.168.20.101/query HTTP/1.1" 200 -
device: 192.168.20.101 command: on
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123 --poweron --bootdevice pxe
SUCCESS

response: None
192.168.20.2 - - [14/Jun/2025 10:40:25] "POST /maaspower/192.168.20.101/on HTTP/1.1" 200 -
device: 192.168.20.101 command: query
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123
Current power state: Power on

response: status : running
192.168.20.2 - - [14/Jun/2025 10:40:46] "GET /maaspower/192.168.20.101/query HTTP/1.1" 200 -
device: 192.168.20.101 command: on
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123 --poweron --bootdevice pxe
SUCCESS

response: None
192.168.20.2 - - [14/Jun/2025 10:40:56] "POST /maaspower/192.168.20.101/on HTTP/1.1" 200 -
device: 192.168.20.101 command: query
EXECUTE command line: meshcmd AmtPower --host 192.168.20.101 --pass MyPAAS!123
Current power state: Power on

response: status : running

For IPMI please see the link I posted above

Redfish power driver is also affected and sends restart 2nd time during deployment.
Big thanks to @shantur for detailed bug information

1 Like

@r00ta -

I was looking through what may be causing this issue and I came across - https://github.com/canonical/maas/blob/master/workflows/deploy/deploy_flow.mmd

It mentions that there is a timeout that may lead to PowerCycle while deployment happens.

Looking at the code here -

Does this mean that the power state should transition in 30 seconds? I see that in my logs the second power request comes in around 34 seconds.

that makes sense. Thank you @shantur , I’ll send a patch

1 Like

@peterw71 while @r00ta is making the patch and it gets merged in, I used to the following to try the fix.

sudo cp /snap/maas/current/lib/python3.12/site-packages/maastemporalworker/workflow/deploy.py .
sudo vi deploy.py
# Change 
# DEFAULT_DEPLOY_ACTIVITY_TIMEOUT = timedelta(seconds=30) to DEFAULT_DEPLOY_ACTIVITY_TIMEOUT = timedelta(seconds=300)
# Save
sudo mount -o bind,ro deploy.py  /snap/maas/current/lib/python3.12/site-packages/maastemporalworker/workflow/deploy.py
sudo snap restart maas 

You should be able to deploy now.

PS: This change will revert on reboot of host running maas.

For visiblity, it is being worked on here - https://code.launchpad.net/~r00ta/maas/+git/maas/+merge/487647/+index?

@r00ta - Thanks for merging in the fix.
When should we expect this to be available in 3.6/edge ?

in the next 3/4 hours

1 Like