Clear cache? (/var/snap/maas/common/maas/boot-resources/cache)

jeffberrymrc · 26 November 2024 11:20

I’ve got an issue quite similar to this one from a few years ago -

Essentially, my cache is now about 20GB and bumping into my file system size. It looks like that’s causing the rack sync to hang.

I suspect that it’s not deleting/removing obsolete images from the cache, even if I bin them from the webui.

Is there a way to clean the cache?

Kind regards,
Jeff Berry, MRC Cognition and Brain Sciences Unit

r00ta · 26 November 2024 13:00

Hi Jeff,

3.5 has a new mechanism to handle the images between the regions and racks. It will solve your issue by design, so I’d suggest you to upgrade to 3.5

jeffberrymrc · 26 November 2024 13:17

Excellent. I’m on 3.3.9, so I’ll give that try. Thanks …

Jeff

jeffberrymrc · 27 November 2024 12:13

I upgraded to 3.5.2 and my images (both larger than 2GB) were hanging at Queued for Download. I rebuilt /var to give the snap more room to work, but they’re still hanging.

The webui is up and running, and the snap connections look OK, I think.

Interface              Plug                        Slot                    Notes
avahi-observe          maas:avahi-observe          :avahi-observe          -
content                -                           maas:maas-logs          -
content                maas:test-db-socket         -                       -
hardware-observe       maas:hardware-observe       :hardware-observe       -
home                   maas:home                   :home                   -
kernel-module-observe  maas:kernel-module-observe  :kernel-module-observe  -
mount-observe          maas:mount-observe          :mount-observe          -
network                maas:network                :network                -
network-bind           maas:network-bind           :network-bind           -
network-control        maas:network-control        :network-control        -
network-observe        maas:network-observe        :network-observe        -
snap-refresh-control   maas:snap-refresh-control   :snap-refresh-control   -
system-observe         maas:system-observe         :system-observe         -
time-control           maas:time-control           :time-control           -

But when I try to load an image, it times out with [Errno 111] Connection refused.
There’s something else funny going on, I think. Since I rebuilt and moved /var, the logs in /var/snap/maas/common/log aren’t being written - they may be being written to the regular syslog.

The snap was throwing some apparmor errors, but setting it to complain didn’t do the trick. I’m not terribly familiar with snap - did moving /var break the snap somehow, or is there something else going on?

r00ta · 27 November 2024 12:53

wdym?

Could you df -h on your region controller? Also, if you could extract the regiond logs (see here) we might understand if there is anything wrong

jeffberrymrc · 27 November 2024 13:32

Surely -

Filesystem             Size  Used Avail Use% Mounted on
tmpfs                   26G  3.0M   26G   1% /run
efivarfs               304K  193K  107K  65% /sys/firmware/efi/efivars
/dev/mapper/vg0-lv--0  373G   42G  312G  12% /
tmpfs                  126G  1.1M  126G   1% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  126G     0  126G   0% /run/qemu
/dev/mapper/vg0-lv--1   20G  183M   19G   1% /boot
/dev/mapper/vg0-lv--4  4.9G  735M  3.9G  16% /var/log
/dev/sda1              1.1G  6.2M  1.1G   1% /boot/efi
/dev/mapper/vg0-lv--2   15G  5.3G  8.7G  38% /home
/dev/mapper/vg0-lv--5  2.0G   48K  1.8G   1% /var/tmp
/dev/mapper/vg0-lv--3   20G   11G  7.7G  59% /var.old
/dev/sda2              1.5T   86G  1.4T   6% /install
tmpfs                   26G   16K   26G   1% /run/user/1000

So /var is now just part of / with oodles of room.

The regiond logs look like they’re proxying through pebble since the upgrade - is that the default for 3.5, btw? Here’s a pull from syslog . I can pull more if needed. There’s a ‘lost burst connection’ error showing up.

Nov 27 13:02:04 lsr-cluster-01 maas-regiond[4504]: RegionServer,9,::ffff:172.31.120.3: [info] RegionServer connec
tion lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:IPv6Ad
dress(type='TCP', host='::ffff:172.31.120.3', port=42500, flowInfo=0, scopeID=0))
Nov 27 13:02:04 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 lost burst connection t
o ('172.31.120.3', 5252).
Nov 27 13:02:04 lsr-cluster-01 maas-regiond[4504]: RegionServer,10,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=42506, flowInfo=0, scopeID=0))
Nov 27 13:02:04 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 lost burst connection t
o ('172.31.120.3', 5252).
Nov 27 13:02:24 lsr-cluster-01 maas-regiond[4504]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=52780, flowInfo=0, scopeID=0))
Nov 27 13:02:24 lsr-cluster-01 maas-regiond[4504]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=52782, flowInfo=0, scopeID=0))
Nov 27 13:02:24 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:52780'.
Nov 27 13:02:24 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Connection 39a32cb5-9abe-
44c0-a278-59d708534db7 is trusted and ready to respond/serve commands.
Nov 27 13:02:24 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:52782'.
Nov 27 13:02:24 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Connection f588ceba-d23b-
4be9-9929-c744ced8a754 is trusted and ready to respond/serve commands.
Nov 27 13:02:25 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 registered RPC connecti
on to ('d6dgg8', '172.31.120.3', 5252).
Nov 27 13:02:26 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 registered RPC connecti
on to ('d6dgg8', '172.31.120.3', 5252).
Nov 27 13:02:34 lsr-cluster-01 maas-regiond[4504]: RegionServer,11,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=42924, flowInfo=0, scopeID=0))
Nov 27 13:02:34 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 lost burst connection t
o ('172.31.120.3', 5252).
Nov 27 13:02:34 lsr-cluster-01 maas-regiond[4504]: RegionServer,12,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=42934, flowInfo=0, scopeID=0))
Nov 27 13:02:34 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 lost burst connection t
o ('172.31.120.3', 5252).
Nov 27 13:02:35 lsr-cluster-01 maas-regiond[4500]: regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (
referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
Nov 27 13:02:54 lsr-cluster-01 maas-regiond[4504]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=55176, flowInfo=0, scopeID=0))
Nov 27 13:02:54 lsr-cluster-01 maas-regiond[4504]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5252, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=55182, flowInfo=0, scopeID=0))
Nov 27 13:02:54 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:55176'.
Nov 27 13:02:54 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Connection 7aed6894-9d2d-
446e-a971-6be1972fdfa9 is trusted and ready to respond/serve commands.
Nov 27 13:02:54 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:55182'.
Nov 27 13:02:54 lsr-cluster-01 maas-regiond[4504]: maasserver.rpc.regionservice: [info] Connection e5aaae5d-8cb7-
4a3d-a0fe-222d27c8e45b is trusted and ready to respond/serve commands.
Nov 27 13:02:56 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4504 registered RPC connecti
on to ('d6dgg8', '172.31.120.3', 5252).
Nov 27 13:03:34 lsr-cluster-01 maas-regiond[4506]: RegionServer,10,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=54106, flowInfo=0, scopeID=0))
Nov 27 13:03:34 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4506 lost burst connection t
o ('172.31.120.3', 5253).
Nov 27 13:03:34 lsr-cluster-01 maas-regiond[4506]: RegionServer,11,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=54114, flowInfo=0, scopeID=0))
Nov 27 13:03:34 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4506 lost burst connection t
o ('172.31.120.3', 5253).
Nov 27 13:03:54 lsr-cluster-01 maas-regiond[4506]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=58066, flowInfo=0, scopeID=0))
Nov 27 13:03:54 lsr-cluster-01 maas-regiond[4506]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=58074, flowInfo=0, scopeID=0))
Nov 27 13:03:55 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:58074'.
Nov 27 13:03:55 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Connection 75e59673-b975-
48a1-a30e-8942889ac771 is trusted and ready to respond/serve commands.
Nov 27 13:03:55 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4506 registered RPC connecti
on to ('d6dgg8', '172.31.120.3', 5253).
Nov 27 13:03:56 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:58066'.
Nov 27 13:03:56 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Connection 89ab068e-16e8-
4492-9467-ade990ff8f46 is trusted and ready to respond/serve commands.
Nov 27 13:03:56 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4506 registered RPC connecti
on to ('d6dgg8', '172.31.120.3', 5253).
Nov 27 13:04:04 lsr-cluster-01 maas-regiond[4506]: RegionServer,12,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=57598, flowInfo=0, scopeID=0))
Nov 27 13:04:04 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4506 lost burst connection t
o ('172.31.120.3', 5253).
Nov 27 13:04:04 lsr-cluster-01 maas-regiond[4506]: RegionServer,13,::ffff:172.31.120.3: [info] RegionServer conne
ction lost (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6A
ddress(type='TCP', host='::ffff:172.31.120.3', port=57600, flowInfo=0, scopeID=0))
Nov 27 13:04:04 lsr-cluster-01 maas-regiond[4467]: maasserver.ipc: [info] Worker pid:4506 lost burst connection t
o ('172.31.120.3', 5253).
Nov 27 13:04:24 lsr-cluster-01 maas-regiond[4506]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=33934, flowInfo=0, scopeID=0))
Nov 27 13:04:24 lsr-cluster-01 maas-regiond[4506]: twisted.internet.protocol.Factory: [info] RegionServer connect
ion established (HOST:IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=5253, flowInfo=0, scopeID=0) PEER:
IPv6Address(type='TCP', host='::ffff:172.31.120.3', port=33948, flowInfo=0, scopeID=0))
Nov 27 13:04:24 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:33934'.
Nov 27 13:04:24 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Connection 9c79fbe1-4dba-
42c0-89b8-6b9bb6304e70 is trusted and ready to respond/serve commands.
Nov 27 13:04:24 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Rack controller authentic
ated from '::ffff:172.31.120.3:33948'.
Nov 27 13:04:24 lsr-cluster-01 maas-regiond[4506]: maasserver.rpc.regionservice: [info] Connection 91f0e45c-8cf9-
4ea0-a2f9-2561a8512700 is trusted and ready to respond/serve commands.
'''

(And thank you!)

r00ta · 27 November 2024 13:51

These are looking good. How many regions and racks do you have in your environment? What images are stuck in the Downloading state? Have you tried to stop/start the download process in the UI?

jeffberrymrc · 27 November 2024 14:10

Just one controller -region+rack. Two images stuck downloading, both custom .tgz. One 2.3 GB, one 2.9GB. I was wondering about the size, because a couple of <1GB images did sync.

If I delete the big one via the webUI, it disappears. Then if I kick it off from the CLI with ‘maas admin boot-resources,’ it shows up again as ‘Queued for download’ and then … it worked this time.

Bloody hell. I wonder if the apparmor was causing the issue? When I tried earlier, it failed with the err 111 timeout. But maybe I didn’t try again after disabling apparmor on the snap.

jeffberrymrc · 27 November 2024 15:17

Slightly different problem - releasing machines is now failing, sort of. They are being released, but marked as release failed. If I mark them broken, and then as fixed, I can deploy to them.

 Wed, 27 Nov. 2024 14:57:32	Node status event - 'cloudinit' running modules for final
 Wed, 27 Nov. 2024 14:57:32	Marking node failed - Failed to release machine.
 Wed, 27 Nov. 2024 14:57:32	Node changed status - From 'Releasing' to 'Releasing failed'
 Wed, 27 Nov. 2024 14:57:32	Node status event - 'cloudinit' running config-scripts_user with frequency once-per-instance
 Wed, 27 Nov. 2024 14:57:30	Script result - wipe-disks changed status from 'Installing dependencies' to 'Running'
 Wed, 27 Nov. 2024 14:57:30	Script result - wipe-disks changed status from 'Running' to 'Passed'
 Wed, 27 Nov. 2024 14:57:30	User releasing node
 Wed, 27 Nov. 2024 14:57:30	Releasing

This fails after a redeploy and rerelease. There’s a workaround, as I say, but it’s odd.

r00ta · 27 November 2024 15:20

I’d say to look at the regiond logs as there might be additional info about the failure

jeffberrymrc · 28 November 2024 09:37

Looks like it might be the fact that we’ve got a self-signed certificate. At least, there’s a ‘SSL: CERTIFICATE_VERIFY_FAILED’ error. Releasing was working in 3.3.9 with this certificate, and the UI is still working with https. Deploys are working, and there’s a workaround for release.

I suppose I’d better check comissioning.

FOLLOWUP - Commissioning seems to work.