Maas-rackd hangs after a day in LXD container on rpi4

arif-ali · 20 February 2022 16:33

I probably have a unique setup, but I am hopeful someone can shed some light and get me some guidance.

I have a rpi4 now running impish, this has several LXD cotainers including MAAS
MAAS runs as a SNAP
The DB was maas-test-db snap, but this was recently migrated to a dedicated postgresql setup

MAAS version: 2.9.2
OS: impish (in LXD container and on rpi4)
LXD version: latest/stable (currently 4.23)

What versions of items I have tried

MAAS: 2.9.2, 3.0 and 3.1
OS: focal, groovy, hirsute, impish
DBs: maas-test-db snap and now dedicated postgres setup

I have bootstrapped MAAS from fresh ~5 times in the last 6 months, with similar issues.

My lab setup is such, that I turn all machines off in the evenig, and then turn them on next working day using MAAS cli/python-libmaas, I only figure out the issue in the morning if none of the machines have started to boot up.

In my most recent case, it stopped responding at 14:04 (based on logs) after 1.5 of approx uptime.

Below are some steps that I have tried to recover MAAS, and typically haven’t worked

killing the rackd process using kill
using systemctl stop snap.maas.supervisor.service
snap stop maas
Stop the lxd container, but never stops

Each time, I still see the following 2 processes, which I am unable to kill

root@maas:/var/snap/maas/common/log# ps -ef | grep rackd
root        4209       1 96 Feb18 ?        2-06:40:45 python3 /snap/maas/12552/sbin/rackd
root       29216       1 96 Feb18 ?        2-00:39:43 python3 /snap/maas/12552/sbin/rackd

The final thing that I typically have to do is reboot the rpi4, and once rebooted everything works as per usual. So, at the moment during my working week, I have a cron job that reboots the rpi4 before my working day, so that the system is usable.

Below are last few lines from rackd before it didn’t respond. which I don’t think would give much of useful information

Any thought or ideas on any other debugging or issue resolution on this would be appreciated.

alexsander-souza · 21 February 2022 19:35

Hi @arif-ali,

I don’t have a setup like this to test, but I can give you some ideas to debug:

MAAS is only supported on LTS releases, so stick to Focal
Check the free memory and available storage when the system hangs
Check the network configuration before and after the problem
Check /var/snap/maas/common/log/{regiond,rackd,maas}.log for anything unusual
Test if LXD is responsive (e.g. lxc list)
Test if the DB is responding (using psql or other similar tool)

arif-ali · 11 March 2022 19:34

So, I had to restart the maas service today, and aas I was restarting the services via snap restart maas, the main one failing was rackd

Now, looking at the logs, I see the following

2022-03-11 15:12:16 -: [info] Received SIGTERM, shutting down.
2022-03-11 15:12:16 asyncio: [error] Exception in callback AsyncioSelectorReactor.callLater.<locals>.run() at /snap/maas/12552/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py:287
handle: <TimerHandle when=26025.375752958 AsyncioSelectorReactor.callLater.<locals>.run() at /snap/maas/12552/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py:287>
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/snap/maas/12552/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 290, in run
    f(*args, **kwargs)
  File "/snap/maas/12552/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 273, in stop
    super().stop()
  File "/snap/maas/12552/usr/lib/python3/dist-packages/twisted/internet/base.py", line 635, in stop
    raise error.ReactorNotRunning(
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.

So, basically, the rackd daemon somehow hangs, and kill doesn’t work as per my first comment. I wonder

below is an excerpt from the journal logs

Mar 11 15:12:58 maas systemd[1]: snap.maas.supervisor.service: State 'final-sigterm' timed out. Killing.
Mar 11 15:12:58 maas systemd[1]: snap.maas.supervisor.service: Killing process 11966 (python3) with signal SIGKILL.
Mar 11 15:12:58 maas systemd[1]: snap.maas.supervisor.service: Failed with result 'timeout'.
Mar 11 15:12:58 maas systemd[1]: snap.maas.supervisor.service: Unit process 11966 (python3) remains running after unit stopped.
Mar 11 15:12:58 maas systemd[1]: Stopped Service for snap application maas.supervisor.
Mar 11 15:12:58 maas systemd[1]: snap.maas.supervisor.service: Consumed 13h 2min 11.443s CPU time.
Mar 11 15:12:58 maas systemd[1]: snap.maas.supervisor.service: Found left-over process 11966 (python3) in control group while starting unit. Ignoring.
Mar 11 15:12:58 maas systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

So, wonder what is causing it not stop properly

billwear · 15 March 2022 22:16

hey, @arif-ali,

kudos for the rPi4 work. fun little boxes, aren’t they?

i’ve asked someone with more Pi experience to take a look. were this not an rPi4 issue, i’d immediately classify it as a bug. you can certainly file one if you want, tho it might be rejected as an unsupported configuration.

i’ll let you know when i’ve gotten some kind of answer here.

arif-ali · 6 May 2022 07:04

so, it seems like after upgrading both the rpi4 as well as the lxd container to jammy, the issue is now resolved.

I now have an uptime of 10 days, and this morning (as well as the last few days), the MAAS libpython was able to turn all my MAAS nodes on

arif-ali · 8 May 2022 07:04

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.