Addressing duplicate UUIDs ('hardware_uuid') in MAAS 2.9

knaledge · 23 August 2020 15:30

Following up on the Google-top-hit post for “MAAS duplicate UUID” (link), what is the likelihood of incorporating a change akin to the contribution in that thread from @georcon?

The change in that thread captures a relatively elegant solution to something that we’ve encountered a few times now - duplicate ‘hardware_uuid’ values being consumed by MAAS (at least in 2.7 and 2.8.1).

In our deployment, 6 physical hardware nodes comprised of identical hardware yield the following UUID: 03000200-0400-0500-0006-000700080009. We’ve had to resort to modifying the database as the most direct, trivial method - but it is cumbersome when scaling/re-commissioning/experimenting (and there are more nodes coming).

The change itself facilitates a mechanism by which an absolutely-unique value (serial of a motherboard) is incorporated into the 36-char ‘hardware_uuid’, thus ensuring an always-unique value.

Most importantly, that change seemingly does not jeopardize the sanctity of what we perceive as a core operating tenet of MAAS: “this Machine entry in MAAS means something - it represents what MAAS knows to be true: that this entry is the ‘tracked’ machine, hardware make up and all”. This is the perceivable and tangible value of the UUID - a major step up from MAC address-only (which is implied to be one of the potential failover solutions when None’ing the ‘hardware_uuid’).

@ltrager - is there any chance, at all, that something like this change - an option, a toggle, something - could be incorporated into 2.9 prior to beta ending around Sept 25~ 2020?

Thank you for your time! We’re really liking MAAS!

georcon · 23 August 2020 16:48

@knaledge You have no idea how relieved I am that others have been having this issue!

I am starting to hate Dell with a passion. There seems to be no way to change the Service Number from which the UUID is generated. I tried flashing the bios (and bricked 2 blades) and had to unsolder and reflash from a healthy node. Unacceptable!

In any case, regarding this issue:

I understand that in early MAAS versions, they used the MAC address of the first NIC enumerated. This seems to be unsatisfactory, as NICs change, the order of enumeration may also change.

I believe the solution to this is two-fold (Even though I agree with MAAS developers that this issue should never, ever appear in proper production systems with proper hardware)

The duplicate UUID warning/error should be shown in the UI. I had to spend hours going through logs and the db to find the issue.
An option to select how the metal is ID’ed. It could, on a UUID collision, check other identifying factors (in my case it was Motherboard S/N - chassis, server ids where also the same). This would allow verification that the metal being enlisted is different/unique - and possibly also allow configuration of some solution.

In the mean time, my solution was to install MAAS via snap, as per usual.

(As you can’t modify files within a snap mount)

Before doing maas init,

Copy the hooks.py file from /snap/maas/current/lib/python3.6/metadataserver/builtin-scripts/hooks.py to another folder (e.g. /usr/src/maas_custom/hooks.py)

Apply the patch (or some solution for UUIDs) to hooks.py.

Bind-mount the file to the location of hooks.py (this allows the custom file to be used instead of the original hooks.py. I tried to add this to fstab, with no luck - so I have the mount happening on boot.

Best of luck!

knaledge · 23 August 2020 17:31

@georcon - You have our most heartfelt gratitude, by the way. We too were like, “Oookaaay… so ‘commissioning failed…’ but only on ‘lshw’. Why?” Like the experience you had, ours was a bit arduous as well. Despite having found your (initial) thread, we went through the following before “figuring it out”:

Re-commission node (hey, why not?)
Delete then re-commission the node
More searching via Google, calling it a day and eventually resigning to sleep
Coming at it fresh, we then removed MAAS (2.7), re-built the MAAS-hosting VM, then deployed 2.8.1
Commission - no go
Then we thought, “Maybe it’s the physical hardware composition? Let’s move the troubleshooting-GPU to another node”
Same result (no-go on commissioning)

We then finally examined the “UUID” when parsing the logs - and there it was. On MAAS 2.7, we captured the ‘lshw’ commissioning log for “perceived-as-bad Node1” and then compared it line by line to “perceived-as-bad, somehow, Node2” in MAAS 2.8.1

Duplicate UUID. Your initial thread made sense, doubly-so for your follow-up thread - and here we are.

In our case, we’re making use of MAAS 2.8.1 on a VM (Ubuntu Server 20.04), and we have 6x physical nodes comprised of:

ASRock B450M (mini-ITX; FW: P3.30)
AMD Ryzen 5 1600 (AM4; 6c/12t)
G.Skill DDR4 32GB (2x16GB)
ADATA SX8200 Pro (nVME; 512GB)
Meanwell 200 PSU
Custom power controller (160W DC-DC)

Some (if not all) of that hardware, in combination, results in an identical UUID of “03000200-0400-0500-0006-000700080009” for each node.

We realize our cluster is certainly not the Enterprise-class hardware that MAAS is capable of handling, though it’s a testament to the endeavors of the MAAS-related folks that it has gotten us this far.

And that’s what makes this all the more desirable to see a solution become first-class within MAAS itself: from consumer–grade to production-grade, MAAS is more than capable. If these duplicate UUIDs could be remedied by some help from MAAS itself, we’d be golden (and likely many others too, including those that are prototyping with Enterprise-class hardware).

@georcon - is it possible to employ your patch without needing to start from scratch on MAAS? (init, etc.) Snap is a bit new to us, so we’re unsure how to “rebuild” the Snap (as suggested in that follow-up thread of yours). Your suggestion here (wrt Snap) seems viable in lieu of an official patch/etc., even though init’ing would be ideal to avoid.

If not, so be it! Your contributions are appreciated, and it’s helped us a ton. So thank you, either way.

georcon · 24 August 2020 15:57

@knaledge

My apologies, I thought I had responded - looks like I forgot to hit reply.

Yes, there is a way to ‘apply’ this patch, without rebuilding a snap (or installing using the deprecated Debian packages.

The general issue with snap packages is that the files are mounted read-only. This doesn’t allow changes to the software.

You can, however, have a file ‘shadow’ the original from the filesystem. I don’t know how safe or stable this is, so your milage may vary.

To use a different hooks.py (or modify any other file for that matter)

Make a copy of the hooks.py file from /snap/maas/current/lib/python3.6/metadataserver/builtin-scripts/hooks.py to a writable folder. I copied to `/usr/src/maas-hooks/

sudo mkdir -p /usr/src/maas-hooks/
sudo cp /snap/maas/current/lib/python3.6/metadataserver/builtin-scripts/hooks.py /usr/src/maas-hooks/

Modify hooks.py as you see fit.
Bind-mount the modified hooks.py to the location of the original.

sudo mount --bind /usr/src/maas-hooks/hooks.py /snap/maas/current/lib/python3.6/metadataserver/builtin-scripts/hooks.py

NOTE: This mount will not persist across reboots. Either add the mount command to rc.local to have it execute on boot (not tested), or remount and restart MAAS after every reboot.

Restart MAAS for good measure:
sudo snap restart maas.supervisor

Turn on your nodes, so they can PXE boot and enlist.

You can monitor the process via the logs.
/var/snap/maas/common/logs/{region, rack, maas}.log

While debugging, I cleared the logs beforehand:
echo "" | sudo tee /var/snap/maas/common/logs/regiond.log (as an example)

Then had separate terminals showing (and following) the logs

tail -f /var/snap/maas/common/logs/regiond.log (as an example)

If you used my version, I prepended all uuid-related logging with [UUID] , so you can filter based on that:

tail -f /var/snap/maas/common/logs/regiond.log | grep "[UUID]" (as an example).

I hope this helps! MAAS is great, once it works. I deployed kubernetes with Juju in less than 5 minutes!

knaledge · 25 August 2020 00:08

In the interim of an official solution, this is so helpful @georcon! Thanks again I’ll report back once we’ve had a go of it.

It’s such an interesting hurdle, too, these duplicate UUIDs. In the vein of “MAAS is great” - we agree. It’s just a bit ironic that the very thing MAAS helps with (especially in our case) - rapid experimentation and eventual deployment - is partially encumbered/impeded by this dupe UUID handling.

For example, our latest endeavor the other night was to deploy bare-metal k8s on the 6-node set I described earlier. Well, one very unfortunate-and-broadcasted keystroke saw ‘ufw’ enabled - without also allowing ssh/22 traffic via ‘ufw’. All the effort to mod the UUIDs in the database serially aaaaaaaand-- it’s gone. (“Rescue Mode” didn’t seem to allow us to mod the actual data on the remote host, despite allowing ssh - though we absolutely concede we may just not yet grok “Rescue Mode” - so we’re taking the L, and we’ll just have to be extra careful while scaffolding the provisioning script next go-round heh).

Having this “patch” in the interim of an official handling/solution certainly helps. Makes those mistakes and experimentation way less painful Again, thank you!

Looking forward to response from MAAS folks (@billwear, @ltrager, et al)! I commit to beta time and pizzas should we find a way to rally together and get this addressed in the code permanently.

knaledge · 30 August 2020 21:19

@georcon , @ltrager, @billwear - Just a quick check-in to report success!

The patch referenced/outlined here (and described in detail in the original thread) absolutely works for us We went from duplicate UUIDs on enlistment to fully-unique UUIDs, no hassle.

I sincerely hope this solution - or something like it - makes its way into MAAS officially. I’ll follow up with more details here this coming week, since the original thread is closed.

Seriously - there’s food and donations and gratitude coming your way when this makes it into MAAS

ltrager · 1 September 2020 01:47

Thanks for the patch and figuring out the root cause! I think this is a firmware bug from Dell that they should fix however I’ve come up with a work around in the meantime.

A little background on how MAAS uses the UUID. Some firmware and boot loaders identify themselves using the UUID. IBM Z series LPAR’s only identify themselves using the UUID and never provide a MAC address. PXELinux which is used for legacy BIOS booting tries the UUID first and then falls back on the MAC address. GRUB UEFI only tries the MAC address currently. Thus the UUID MAAS stores must be the same UUID the system will provide during boot.

MAAS still supports booting using MAC addresses only so if the UUID isn’t really unique there isn’t a value in storing a random one. I created a patch which detects this case and removes the UUID from machines which duplicate it.

There is an edge case which may be triggered in this case. If the machines are set to boot using a legacy BIOS because the UUIDs are duplicated MAAS may think the machines are the same. Thus if you commission and deploy one machine, then after that is done, try to commission and deploy another, MAAS will think the second machine is the first and try to local boot. You should be able to avoid this by sticking with UEFI netbooting.

knaledge · 1 September 2020 02:06

That’s great to hear, @ltrager! One point of clarity: our experience with MAAS, on very not-Dell hardware, was exactly as described by @georcon (minus all the “Service Tab” Dell-ness).

We have 6x ASRock B450M mini-ITX motherboards that report the same UUID during the “Commissioning” phase.

Interestingly, and perhaps an ironic bit of self-induced cargo-culting, when we figured out what was causing our issue with “MAAS + PXE + UEFI”, that was figured out after we had encountered the issue @georcon outlined and provided a “patch” for.

In other words, when we finally figured out things on our end such that MAAS was able to PXE our UEFI nodes without issue, we had already considered “applying @georcon’s patch” as a necessity - as our only experience with MAAS, up to that point, had included encountering duped UUIDs without their “patch”.

Now, in hindsight, we never tried MAAS without @georcon’s “patch” - even though we’ve definitely got our nodes PXE’ing UEFI now.

Does this help add some clarity? We would try removing @georcon’s patch and re-provisioning a node to see if it gets the “dupe” UUID (what we know as the UUID-that-can-only-be-used-once).

knaledge · 1 September 2020 05:51

As a follow-up:

Without @georcon’s “patch”, the UUID for our nodes are duplicated despite having been PXE booted in UEFI

Commissioning a node results in maasserver_node.hardware_uuid having the value “03000200-0400-0500-0006-000700080009”

dnsmasq.conf (general content)

...
...
dhcp-host=52:54:00:GG:QQ:XD,set:52:54:00:GG:QQ:XD,10.0.1.43
dhcp-name-match=set:wpad-ignore,wpad
dhcp-ignore-names=tag:wpad-ignore
dhcp-script=/sbin/dhcpc_lease
script-arp
dhcp-boot=pxelinux.0,,10.0.1.43
dhcp-match=set:efi-x86_64,option:client-arch,7
dhcp-boot=tag:efi-x86_64,bootx64.efi,,10.0.1.43

Commissioning (var/snap/maas/common/log/rackd.log)

2020-09-01 05:36:11 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.0.0.68
2020-09-01 05:36:11 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.0.0.68
2020-09-01 05:36:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-70:85:c2:LO:LN:O1 requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-default-amd64 requested by 10.0.0.68
2020-09-01 05:36:24 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-20.04/focal/daily/boot-kernel requested by 10.0.0.68
2020-09-01 05:36:27 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-20.04/focal/daily/boot-initrd requested by 10.0.0.68
2020-09-01 05:36:55 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-20.04/focal/daily/squashfs requested by 10.0.0.68

ltrager · 1 September 2020 20:45

Is commissioning/deploying working on all machines? The way my patch works is by always collecting the given UUID but removing it when a duplicate is found. It assumes that any vendor that has duplicated UUIDs won’t use a UUID to identify itself during booting. So if you commission 3 machines serially the first one will have the UUID stored, the second one will detect the conflict and remove the UUID from the first and not store it on the second, the third machine won’t see a conflict and keep it stored. Keeping the duplicated UUID shouldn’t be a problem on UEFI as UEFI netboot doesn’t use it.

The reason MAAS can’t use @georcon’s patch is that some systems identify themselves during netboot using the UUID. If the UUID is regenerated it won’t match which will cause net booting to fail.

knaledge · 2 September 2020 02:38

Thanks for the insight @ltrager! It’s been fun engaging with the MAAS folks (and the community) this past week.

I’ll follow up when we have the opportunity to re-commission the cluster, both with your patch, no patch, and @georcon’s patch to demo results.

kruisdraad · 13 November 2020 21:09

Thnx @knaledge for this info, it helped me a LOT.

Most vendors dont use this UUID anymore and leave it default

Also the system does not log this in the GUI so you search for the problem forever. Would have been great if the server side logs are mergens in the GUI visisble logs of a node.

kruisdraad · 15 November 2020 21:53

looks like the issue is address in 2.9 -rc1