Default cloud-init (/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py) fail

snafuxnj · 11 October 2019 06:36

What could cause this? This machine is booting default curtin/cloud init configs but fails to deploy. I found the following error in /var/log/maas/rsyslog/<machine>/2019-10-11/messages

2019-10-11T06:19:20+00:00 njdcsbcpu01 cloud-init[3191]: Cloud-init v. 18.5-45-g3554ffe8-0ubuntu1~16.04.1 running 'modules:final' at Fri, 11 Oct 2019 06:18:38 +0000. Up 147.33 seconds.
2019-10-11T06:19:20+00:00 njdcsbcpu01 cloud-init[3191]: 2019-10-11 06:19:20,647 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [100]
2019-10-11T06:19:20+00:00 njdcsbcpu01 cloud-init[3191]: 2019-10-11 06:19:20,652 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2019-10-11T06:19:20+00:00 njdcsbcpu01 cloud-init[3191]: 2019-10-11 06:19:20,661 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed . <<================= HERE
...
...
...
2019-10-11T06:19:20+00:00 njdcsbcpu01 cloud-init[3191]: Cloud-init v. 18.5-45-g3554ffe8-0ubuntu1~16.04.1 finished at Fri, 11 Oct 2019 06:19:20 +0000. Datasource DataSourceMAAS [http://192.168.2.1:5248/MAAS/metadata/curtin].  Up 188.82 seconds

I booted the failed machine into rescue mode and found that there are no scripts, directories, or files in /var/lib/cloud/instance/scripts

On deploy, this machine was not passed any userdata. All preseeds on the maas controller are still default. The only thing possibly responsible for this is the package repository used is hosted locally internal to the network with no external access to the internet authorized.

ltrager · 11 October 2019 18:04

Could you post of contents of /var/log/cloud-init.log and /var/log/cloud-init-output.log?

We also recently had a bug in cloud-init which caused deployments to fail if bonds or bridges were used. Try resyncing your images to ensure you have the fixed version.

snafuxnj · 11 October 2019 20:57

logs:

https://drive.google.com/open?id=1_3WWqJ-IDuIS0NwOPlWc6dRx-KRjDX7m

We’re using version 2.6.1-7832-g17912cdc9

snafuxnj · 11 October 2019 22:10

Just checked our maas. Looks like we’re running the latest version of 2.6.1.

snafuxnj · 14 October 2019 19:32

I’m still trying to figure this out. I’m really not seeing a smoking gun here but I have a couple of clues.

The first one is the error I pasted in the beginning of this thread (see above). The second one might be:

Oct 14 16:57:42 njdcsbcpu02 systemd[1]: Started Execute cloud user/final scripts.
Oct 14 16:57:42 njdcsbcpu02 systemd[1]: Reached target Cloud-init target.
Oct 14 16:57:42 njdcsbcpu02 systemd[1]: Startup finished in 12.773s (kernel) + 2min 16.493s (userspace) = 2min 29.267s.
Oct 14 17:03:18 njdcsbcpu02 ntpd[3184]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Oct 14 17:10:48 njdcsbcpu02 systemd[1]: Starting Cleanup of Temporary Directories...

Could NTP be causing this issue?

snafuxnj · 14 October 2019 22:25

Until today, I thought this might have just been a red-herring but I’m wondering if the LD_PRELOAD error occurring when libeatmydata is attempting to be loaded might be the root cause of the problem here. I’ve seen this error often before we started seeing this inability to deploy machines, though, so I’m not yet convinced. I’m also attaching a video of the machine booting. It doesn’t get past the boot of the ephemeral image it seems.

https://drive.google.com/open?id=1p1_SPKHD6vfooHWBK1wRFJoqymBmY0Qb

snafuxnj · 15 October 2019 22:16

I believe I found the root cause of this. Turns out that the maas makes explicit apt install commands during stage 1 boot. The internal apt repo I concocted failed to have a completely valid Package index. Once that was fixed, it turned out all was OK. However, this is a preliminary diagnosis and I’m still in the process of working out kinks. So, if anything else comes up then I’ll post here my findings.