Erase is not actually...erasing

snafuxnj · 15 August 2019 20:25

maas 2.5.3

Our nodes were set to quick erase.

I realized a week ago that our nodes deployed but wonky things were happening when it came to software deployments via cloud-init; things I’d never seen before. I chalked the issue up to our software, which is in alpha and put the project down.

Yesterday I picked the project back up and started digging into it. I learned that cloud-init was failing to run, which led me to the problem! The machines don’t do anything during quick erase or in other words, the last image burned in is still on the machine after a release/deploy, so the last cloud-init, which has already run does not do anything to deploy our software.

So, I have tried a couple of things. I tried switching from quick erase to full erase. No dice. I wondered if it might be the fact that we’re using LVM storage. So I changed it to flat and that didn’t work (using full erase).

Basically, nothing is working. So I watched the console of an erase session. DHCP is failing to serve up whatever image is used for erasing, I think, so the machine is falling back to the installed OS, which is why we are still seeing the old OS every reboot.

Any idea how to approach this?

dvnt · 16 August 2019 00:37

Hey @snafuxnj

I noticed a similar thing in my lab.
I have some older hardware that I use ipmi to hook into the BMC.
What I noticed is, on ‘release’ maas doesn’t set the next boot order to pxe, so the host just boots whatever is on disk. After a while in maas it returns ‘disk erasing failed’ after it fails to check in.
I hacked up a workaround by running a loop in the background while the server rebooted. This forcefully set the BMC to set next boot to pxe

while true ; do ipmitool -I lanplus -H 192.168.123.25 -U Administrator -P yourpassword chassis bootdev pxe ; sleep 10; done

I never ever logged a bug, I just assumed my hardware was crap and old.

ggoldman · 16 August 2019 14:04

Hi, we started to notice this as well, but only for Ubuntu builds (RHEL was unaffected). We determined that quick erase only erases the first and last 1MB on the disk but an Ubuntu prep partition is 8MB and the installer still saw data. Why it all of a sudden showed is somewhat of a mystery. This is supposed to be fixed in a newer version of curtin (it should be in maas 2.6.x - but the performance issues had us back it out). Our current workaround (which 100% resolves the issue for now) is to put the following at the end of the early_commands in your preseeds - wipe 10M on each installed disk:
dd_00: [“sh”, “-c”, “for d in $(lsblk -dlpno NAME | grep -Ev ‘/loop|/sr|/scd’) ; do echo Wiping $d ; dd if=/dev/zero of=$d bs=1M count=10 ; done”]

Geoff

snafuxnj · 16 August 2019 17:35

I’m gonna try this. I’ll let you know if this solves our problem.

snafuxnj · 16 August 2019 17:35

I can see that our hosts are actually PXE booting so I don’t think this is our problem, but I’ll double-check to make sure. Thank you for your reply!!

ggoldman · 16 August 2019 19:57

I don’t think the issue is pxe boot - it seems that the installer is detecting data where there should be none - I forgot to mention and should point out that most of our builds are to Power (IBM :-)) and not Intel - but it’s never a bad thing to do a better wipe of the disk before starting a build

snafuxnj · 4 September 2019 21:55

The problem was actually that the DHCP server stopped running and would not come back up. As soon as we restarted the regiond controller and restarted DHCP, everything recovered.