Expected behaviour of booting machines when MAAS is down

szeestraten · 12 November 2021 10:18

Hey folks, another simple question for you this Friday.

I was wondering about the expected behaviour when booting a MAAS deployed machine if MAAS is down (stopped service or not reachable).

I tested with a machine which had the following boot order when deployed by MAAS 2.9.2 (not sure if this is is by MAAS, curtin or something else changes it):

PXE
HDD
Others

Observing the boot sequence:

Machine tries to PXE boot
PXE boot fails
Machine tries to HDD boot
HDD boot starts
Machine brings up network
Machine is then unavailable (connection refused for SSH for example) for at least 10 mins while cloud-init init-network runs and fails to post events to MAAS (see logs cloud-init-output.log below)
cloud-init init-network finally finishes
Machine becomes available

So the machine manages to successfully boot which is great.
But I was wondering if there’s a way to shorten this period of unavailability due to cloud-init?

2021-11-12 09:57:17,771 - handlers.py[WARNING]: failed posting event: finish: init-local: SUCCESS: searching for local datasources
Cloud-init v. 21.3-1-g6803368d-0ubuntu1~20.04.4 running 'init' at Fri, 12 Nov 2021 09:57:23 +0000. Up 14.48 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++++
ci-info: +-----------+-------+------------------------------+---------------+--------+-------------------+
ci-info: |   Device  |   Up  |           Address            |      Mask     | Scope  |     Hw-Address    |
ci-info: +-----------+-------+------------------------------+---------------+--------+-------------------+
ci-info: |    eno1   |  True |         10.42.198.18         | 255.255.254.0 | global | 4c:d9:8f:a4:6c:1f |
ci-info: |    eno1   |  True | fe80::4ed9:8fff:fea4:6c1f/64 |       .       |  link  | 4c:d9:8f:a4:6c:1f |
ci-info: |    eno2   | False |              .               |       .       |   .    | 4c:d9:8f:a4:6c:20 |
ci-info: | ens2f0np0 |  True |         10.42.200.2          | 255.255.254.0 | global | b8:59:9f:cf:f5:3a |
ci-info: | ens2f0np0 |  True | fe80::ba59:9fff:fecf:f53a/64 |       .       |  link  | b8:59:9f:cf:f5:3a |
ci-info: | ens2f1np1 |  True | fe80::ba59:9fff:fecf:f53b/64 |       .       |  link  | b8:59:9f:cf:f5:3b |
ci-info: |     lo    |  True |          127.0.0.1           |   255.0.0.0   |  host  |         .         |
ci-info: |     lo    |  True |           ::1/128            |       .       |  host  |         .         |
ci-info: +-----------+-------+------------------------------+---------------+--------+-------------------+
ci-info: +++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+---------------+-----------+-------+
ci-info: | Route | Destination |   Gateway   |    Genmask    | Interface | Flags |
ci-info: +-------+-------------+-------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 10.42.198.1 |    0.0.0.0    |    eno1   |   UG  |
ci-info: |   1   | 10.42.198.0 |   0.0.0.0   | 255.255.254.0 |    eno1   |   U   |
ci-info: |   2   | 10.42.200.0 |   0.0.0.0   | 255.255.254.0 | ens2f0np0 |   U   |
ci-info: +-------+-------------+-------------+---------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   1   |  fe80::/64  |    ::   | ens2f1np1 |   U   |
ci-info: |   2   |  fe80::/64  |    ::   | ens2f0np0 |   U   |
ci-info: |   3   |  fe80::/64  |    ::   |    eno1   |   U   |
ci-info: |   5   |    local    |    ::   |    eno1   |   U   |
ci-info: |   6   |    local    |    ::   | ens2f0np0 |   U   |
ci-info: |   7   |    local    |    ::   | ens2f1np1 |   U   |
ci-info: |   8   |  multicast  |    ::   | ens2f1np1 |   U   |
ci-info: |   9   |  multicast  |    ::   | ens2f0np0 |   U   |
ci-info: |   10  |  multicast  |    ::   |    eno1   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
2021-11-12 09:57:47,873 - handlers.py[WARNING]: failed posting event: start: init-network/check-cache: attempting to read from cache [trust]
2021-11-12 09:58:03,349 - handlers.py[WARNING]: failed posting event: finish: init-network/check-cache: SUCCESS: restored from cache with run check: DataSourceMAAS [http://10-42-198-0--23.maas-internal:5248/MAAS/metadata/]
2021-11-12 09:58:28,499 - handlers.py[WARNING]: failed posting event: start: init-network/setup-datasource: setting up datasource
2021-11-12 09:58:48,524 - handlers.py[WARNING]: failed posting event: finish: init-network/setup-datasource: SUCCESS: setting up datasource
2021-11-12 09:59:13,568 - handlers.py[WARNING]: failed posting event: start: init-network/consume-user-data: reading and applying user-data
2021-11-12 09:59:33,596 - handlers.py[WARNING]: failed posting event: finish: init-network/consume-user-data: SUCCESS: reading and applying user-data
2021-11-12 10:00:03,629 - handlers.py[WARNING]: failed posting event: start: init-network/consume-vendor-data: reading and applying vendor-data
2021-11-12 10:00:23,666 - handlers.py[WARNING]: failed posting event: finish: init-network/consume-vendor-data: SUCCESS: reading and applying vendor-data
2021-11-12 10:00:43,690 - handlers.py[WARNING]: failed posting event: start: init-network/consume-vendor-data2: reading and applying vendor-data2
2021-11-12 10:01:03,714 - handlers.py[WARNING]: failed posting event: finish: init-network/consume-vendor-data2: SUCCESS: reading and applying vendor-data2
2021-11-12 10:01:23,813 - handlers.py[WARNING]: failed posting event: start: init-network/activate-datasource: activating datasource
2021-11-12 10:01:43,842 - handlers.py[WARNING]: failed posting event: finish: init-network/activate-datasource: SUCCESS: activating datasource
2021-11-12 10:02:03,896 - handlers.py[WARNING]: failed posting event: start: init-network/config-migrator: running config-migrator with frequency always
2021-11-12 10:02:23,922 - handlers.py[WARNING]: failed posting event: finish: init-network/config-migrator: SUCCESS: config-migrator ran successfully
2021-11-12 10:02:43,948 - handlers.py[WARNING]: failed posting event: start: init-network/config-seed_random: running config-seed_random with frequency once-per-instance
2021-11-12 10:03:03,973 - handlers.py[WARNING]: failed posting event: finish: init-network/config-seed_random: SUCCESS: config-seed_random previously ran
2021-11-12 10:03:23,998 - handlers.py[WARNING]: failed posting event: start: init-network/config-bootcmd: running config-bootcmd with frequency always
2021-11-12 10:03:44,022 - handlers.py[WARNING]: failed posting event: finish: init-network/config-bootcmd: SUCCESS: config-bootcmd ran successfully
2021-11-12 10:04:04,048 - handlers.py[WARNING]: failed posting event: start: init-network/config-write-files: running config-write-files with frequency once-per-instance
2021-11-12 10:04:24,073 - handlers.py[WARNING]: failed posting event: finish: init-network/config-write-files: SUCCESS: config-write-files previously ran
2021-11-12 10:04:44,098 - handlers.py[WARNING]: failed posting event: start: init-network/config-growpart: running config-growpart with frequency always
2021-11-12 10:05:04,176 - handlers.py[WARNING]: failed posting event: finish: init-network/config-growpart: SUCCESS: config-growpart ran successfully
2021-11-12 10:05:24,202 - handlers.py[WARNING]: failed posting event: start: init-network/config-resizefs: running config-resizefs with frequency always
2021-11-12 10:05:44,284 - handlers.py[WARNING]: failed posting event: finish: init-network/config-resizefs: SUCCESS: config-resizefs ran successfully
2021-11-12 10:06:04,310 - handlers.py[WARNING]: failed posting event: start: init-network/config-disk_setup: running config-disk_setup with frequency once-per-instance
2021-11-12 10:06:24,335 - handlers.py[WARNING]: failed posting event: finish: init-network/config-disk_setup: SUCCESS: config-disk_setup previously ran
2021-11-12 10:06:44,360 - handlers.py[WARNING]: failed posting event: start: init-network/config-mounts: running config-mounts with frequency once-per-instance
2021-11-12 10:07:04,385 - handlers.py[WARNING]: failed posting event: finish: init-network/config-mounts: SUCCESS: config-mounts previously ran
2021-11-12 10:07:24,410 - handlers.py[WARNING]: failed posting event: start: init-network/config-set_hostname: running config-set_hostname with frequency once-per-instance
2021-11-12 10:07:44,435 - handlers.py[WARNING]: failed posting event: finish: init-network/config-set_hostname: SUCCESS: config-set_hostname previously ran
2021-11-12 10:08:04,460 - handlers.py[WARNING]: failed posting event: start: init-network/config-update_hostname: running config-update_hostname with frequency always
2021-11-12 10:08:24,487 - handlers.py[WARNING]: failed posting event: finish: init-network/config-update_hostname: SUCCESS: config-update_hostname ran successfully
2021-11-12 10:08:44,512 - handlers.py[WARNING]: failed posting event: start: init-network/config-update_etc_hosts: running config-update_etc_hosts with frequency always
2021-11-12 10:09:04,547 - handlers.py[WARNING]: failed posting event: finish: init-network/config-update_etc_hosts: SUCCESS: config-update_etc_hosts ran successfully
2021-11-12 10:09:24,572 - handlers.py[WARNING]: failed posting event: start: init-network/config-ca-certs: running config-ca-certs with frequency once-per-instance
2021-11-12 10:09:44,597 - handlers.py[WARNING]: failed posting event: finish: init-network/config-ca-certs: SUCCESS: config-ca-certs previously ran
2021-11-12 10:10:04,622 - handlers.py[WARNING]: failed posting event: start: init-network/config-rsyslog: running config-rsyslog with frequency once-per-instance
2021-11-12 10:10:24,647 - handlers.py[WARNING]: failed posting event: finish: init-network/config-rsyslog: SUCCESS: config-rsyslog previously ran
2021-11-12 10:10:44,672 - handlers.py[WARNING]: failed posting event: start: init-network/config-users-groups: running config-users-groups with frequency once-per-instance
2021-11-12 10:11:04,696 - handlers.py[WARNING]: failed posting event: finish: init-network/config-users-groups: SUCCESS: config-users-groups previously ran
2021-11-12 10:11:24,721 - handlers.py[WARNING]: failed posting event: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
2021-11-12 10:11:44,746 - handlers.py[WARNING]: failed posting event: finish: init-network/config-ssh: SUCCESS: config-ssh previously ran
2021-11-12 10:12:04,772 - handlers.py[WARNING]: failed posting event: finish: init-network: SUCCESS: searching for network datasources
Cloud-init v. 21.3-1-g6803368d-0ubuntu1~20.04.4 running 'modules:config' at Fri, 12 Nov 2021 10:12:37 +0000. Up 928.13 seconds.

anton5mith · 12 November 2021 19:39

good question. will come back to you hopefully soon. Ideally, servers should be able to start and run normally in the event MAAS isn’t available - but there might be reasons why cloud-init takes a long time to time out.

afreiberger · 17 November 2021 14:22

The cloud-init config file example notes that you can add timeout variables to the datasource configuration.

# Example datasource config
# datasource:
#    Ec2:
#      metadata_urls: [ 'blah.com' ]
#      timeout: 5 # (defaults to 50 seconds)
#      max_wait: 10 # (defaults to 120 seconds)

There’s a file that MAAS lands on systems it builds at /etc/cloud/cloud.cfg.d/90_dpkg_maas.cfg

# written by cloud-init debian package per preseed entries
# cloud-init/{maas-metadata-url,/maas-metadata-credentials}
datasource:
  MAAS:
    consumer_key: **KEY**
    metadata_url: http://**MAAS_URL**:5248/MAAS/metadata/
    token_key: **TOKEN**
    token_secret: **SECRET**

I imagine there may be a way to customize the preseed data to add in timeout and max_wait in this section of the cloud-init config to adjust the standard timeout values. The default timeout is 50 seconds, and the max_wait is 120 seconds in the code.

I am seeing that your logs show that MAAS provider in cloudinit appears to try and times out after 20 seconds each time it starts and ends running a module. I’m wondering if there’s somthing that can be improved in the cloud-init MAAS datasource code to change the notification timeouts to MAAS API for start/end of cloud-init modules if it detects that it’s using cached metadata.

Ultimately, it appears you’re having a 2 minute timeout trying to read the metadata URL, it falls back to cached metadata, then you have 44 different 20 second timeouts for publish_event notifications, which is resulting in the ~10 minute boot-time delay.

What I’m hoping the MAAS team can determine is where publish_event or report_event is called from, and how to add MAAS to the list of excluded handlers if it’s determined to be using cached metadata.

dandruczyk · 19 November 2021 14:17

My personal preference is that once the first boot is able to phone home to MaaS to get its initial configuration that the node should NO LONGER HAVE ANY DEPENDENCY ON MAAS and thus be independent if MaaS lives or dies. I supposed I could write a script to rip out cloudinit after first boot to accomplish that but I prefer not to.

bjornt · 22 November 2021 11:00

Yeah, I agree that after the first boot, the machine should be able to boot alright even when MAAS is down.

I’m actually not sure why we configure cloud-init for every boot, but I suspect it’s because we rely on cloud-init to post to MAAS to indicate that the first boot went ok and the machine is deployed.

I think we can simply put a script in place that does that, though, and not rely on cloud-init to do the reporting.

anton5mith · 23 November 2021 08:30

Could you create a bug for this please and we can look into it.

szeestraten · 23 November 2021 08:41

@anton5mith, who do you mean should create the bug? I think @afreiberger and @bjornt have more details/insight than me. Just checking as I haven’t had my coffee yet.

afreiberger · 23 November 2021 17:59

@szeestraten, I think it would be beneficial for this bug to be filed with your user story as stated in the original post. You can reference this discourse topic in the bug description as a reference to the underlying process that results in your experience and the discussion of how to address the root cause.

szeestraten · 23 November 2021 19:53

@afreiberger OK. I created https://bugs.launchpad.net/maas/+bug/1951995

Please edit/change as you see fit.

anton5mith · 24 November 2021 10:47

Thanks. exactly, well put @afreiberger. @szeestraten - I’ll discuss internally with the team on this one, because it really doesn’t sound right with this behaviour.

cmills · 10 May 2022 16:40

Has there been a fix for this topic?

billwear · 16 May 2022 20:21

judging by the bug report, this hasn’t been taken up again. care to add some weight to the bug? actually, if you are experiencing this, can you use this page to give us better info, like log files, etc? that might help us find a handle.