Failed deployment - Device or resource busy/No space left on device


#1

Hi,
from time to time I encounter a problem during node deployment (centos 7), on MaaS 2.4.2
The deployment log show entries similar to this:

Sep 2 14:26:09 calm-mouse systemd[1]: Reached target ZFS startup target.
Sep 2 14:26:09 calm-mouse cloud-init[1642]: Processing triggers for man-db (2.8.3-2ubuntu0.1) …
Sep 2 14:26:11 calm-mouse cloud-init[1642]: Processing triggers for libc-bin (2.27-3ubuntu1) …
Sep 2 14:26:13 calm-mouse cloud-init[1642]: curtin: Installation started. (19.1-7-g37a7a0f4-0ubuntu1~18.04.1)
Sep 2 14:26:14 calm-mouse cloud-init[1642]: blockdev: ioctl error on BLKRRPART: Device or resource busy
Sep 2 14:26:15 calm-mouse kernel: [ 65.236879] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Sep 2 14:26:15 calm-mouse cloud-init[1642]: blockdev: ioctl error on BLKRRPART: Device or resource busy
Sep 2 14:26:16 calm-mouse cloud-init[1642]: --2019-09-02 14:26:16-- http://MAASCONTROLLER_IP:5248/images/centos/amd64/generic/centos70/daily/root-tgz
Sep 2 14:26:16 calm-mouse cloud-init[1642]: Connecting to MAASCONTROLLER_IP:5248… connected.
Sep 2 14:26:16 calm-mouse cloud-init[1642]: HTTP request sent, awaiting response… 200 OK
Sep 2 14:26:16 calm-mouse cloud-init[1642]: Length: 481532459 (459M) [text/html]
Sep 2 14:26:16 calm-mouse cloud-init[1642]: Saving to: ‘STDOUT’
Sep 2 14:26:17 calm-mouse cloud-init[1642]: 0K … … … … … … 0% 6.49M 70s
Sep 2 14:26:17 calm-mouse cloud-init[1642]: 3072K … … … … … … 1% 51.0M 39s
[…]
Sep 2 14:26:34 calm-mouse cloud-init[1642]: 362496K … … … … … … 77% 34.6M 5s
Sep 2 14:26:34 calm-mouse cloud-init[1642]: 365568K … … … …tar: ./boot/grub2/i386-pc/gdb.mod: Wrote only 9216 of 10240 bytes
Sep 2 14:26:34 calm-mouse cloud-init[1642]: tar: ./boot/grub2/i386-pc/blscfg.mod: Cannot write: No space left on device

So, the problem seems related to cloud-init.

If I retry the deployment it run successfully.
However , having to repurpose 20/30 nodes of a HPC cluster, retry the operation means two days of tries…

I searched into MaaS/cloud-init bug database but didn’t find anything apart from this CoreOS bug, which is similar to the problem I hitted:
https://github.com/coreos/bugs/issues/152

I suspect this is related to timeout due to firmware upgrade (via MaaS commissioning script).
Is there a configurable timeout ?

Anyone has this problem ?
How did you resolve this ?

Roberto