I probably have a unique setup, but I am hopeful someone can shed some light and get me some guidance.
- I have a rpi4 now running impish, this has several LXD cotainers including MAAS
- MAAS runs as a SNAP
- The DB was
maas-test-db
snap, but this was recently migrated to a dedicatedpostgresql
setup
MAAS version: 2.9.2
OS: impish (in LXD container and on rpi4)
LXD version: latest/stable
(currently 4.23)
What versions of items I have tried
MAAS: 2.9.2, 3.0 and 3.1
OS: focal
, groovy
, hirsute
, impish
DBs: maas-test-db
snap and now dedicated postgres setup
I have bootstrapped MAAS from fresh ~5 times in the last 6 months, with similar issues.
My lab setup is such, that I turn all machines off in the evenig, and then turn them on next working day using MAAS cli/python-libmaas, I only figure out the issue in the morning if none of the machines have started to boot up.
In my most recent case, it stopped responding at 14:04 (based on logs) after 1.5 of approx uptime.
Below are some steps that I have tried to recover MAAS, and typically haven’t worked
- killing the rackd process using kill
- using
systemctl stop snap.maas.supervisor.service
snap stop maas
- Stop the lxd container, but never stops
Each time, I still see the following 2 processes, which I am unable to kill
root@maas:/var/snap/maas/common/log# ps -ef | grep rackd
root 4209 1 96 Feb18 ? 2-06:40:45 python3 /snap/maas/12552/sbin/rackd
root 29216 1 96 Feb18 ? 2-00:39:43 python3 /snap/maas/12552/sbin/rackd
The final thing that I typically have to do is reboot the rpi4, and once rebooted everything works as per usual. So, at the moment during my working week, I have a cron job that reboots the rpi4 before my working day, so that the system is usable.
Below are last few lines from rackd before it didn’t respond. which I don’t think would give much of useful information
Any thought or ideas on any other debugging or issue resolution on this would be appreciated.