400 Bad Request errors during deploy starting yesterday

Over the past 24 hours or so, I’ve started seeing a lot of deploy failures. I’m
using the maas Python API, and the failures do not appear to be specific to any
particular machine.

CLIENT

The client script uses python3-libmaas version 0.6.4-0ubuntu1 and is essentially this:

self.__client = maas.client.connect(MAAS_BASE_URI, apikey=self.config['apikey'])
machines = self.__client.machines.list()
# Some filtering of machines list occurs, then:
for machine in machines:
    logger.info('Asking MAAS to deploy %s', machine.fqdn)
    machine.deploy()

This tool has been working pretty reliably for the past year or two, and has not changed in months.

The failures look like this on the client side:

INFO     egd: Asking MAAS to deploy vm-008-4269.maas.eng.exagrid.com
POST https://maas.eng.exagrid.com/MAAS/api/2.0/machines/hsnt44/?op=deploy -> HTTP 400 Bad Request (Failed to render preseed: Unable to find MAAS ser…)

SERVER

# snap list maas
Name  Version                  Rev    Tracking    Publisher   Notes
maas  3.1.0-10901-g.f1f8f1505  19835  3.1/stable  canonical✓  -

On the server, if I do tail -F /var/snap/maas/common/log/*.log, I see stuff like this:

==> /var/snap/maas/common/log/maas.log <==
2022-06-14T16:35:46.247632+00:00 eng1713 maas.api: [info] Request from user gitlab-runner to acquire machine: vm-008-4269.maas.eng.exagrid.com (hsnt44)
2022-06-14T16:35:46.282865+00:00 eng1713 maas.node: [info] vm-008-4269: Status transition from READY to ALLOCATED
2022-06-14T16:35:46.295518+00:00 eng1713 maas.node: [info] vm-008-4269: allocated to user gitlab-runner

==> /var/snap/maas/common/log/regiond.log <==
2022-06-14 16:35:46 maasserver.preseed: [warn] WARNING: '/snap/maas/19835/etc/maas/preseeds/curtin_userdata' contains deprecated preseed variables. Please remove: main_archive_directory, ports_archive_directory
2022-06-14 16:35:47 regiond: [info] 127.0.0.1 POST /MAAS/api/2.0/machines/hsnt44/?op=deploy HTTP/1.1 --> 400 BAD_REQUEST (referrer: -; agent: Python/3.8 aiohttp/3.6.2)

Can I turn up tracing somewhere to get more details about why maas thinks these are bad requests?

did you also scan rackd.log with that tail? what did you get there?

Yes, rackd.log was included in that tail. No messages near the time of the bad request message. Here’s what I see in rackd.log for the few minutes before and after:

2022-06-14 16:29:49 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: bond0.
2022-06-14 16:29:59 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2022-06-14 16:39:49 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: bond0.
2022-06-14 16:39:59 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2022-06-14 16:40:28 provisioningserver.rpc.clusterservice: [info] Failure on ping dropping connection to event-loop: eng1713:pid=1490880
2022-06-14 16:40:28 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=34024, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=5252, flowInfo=0, s
copeID=0))
2022-06-14 16:40:28 provisioningserver.rackdservices.service_monitor_service: [critical] Failed to monitor services and update region.
        Traceback (most recent call last):
        --- <exception caught here> ---
          File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/rackdservices/service_monitor_service.py", line 92, in _updateRegion
            yield client(
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 1994, in _massageError
            error.trap(RemoteAmpError)
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
            self.raiseException()
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
        twisted.internet.error.ConnectionDone: Connection was closed cleanly.
        
2022-06-14 16:40:29 provisioningserver.rpc.clusterservice: [info] Making connections to event-loops: eng1713:pid=1490880
2022-06-14 16:40:29 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=44988, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=5252, flowInfo=0, s
copeID=0))
2022-06-14 16:40:34 provisioningserver.rpc.clusterservice: [info] Fully connected to all 4 event-loops on all 1 region controllers (eng1713).
2022-06-14 16:41:07 provisioningserver.rpc.clusterservice: [info] Event-loop 'eng1713:pid=1490880' authenticated.
2022-06-14 16:41:08 provisioningserver.rpc.clusterservice: [info] Rack controller 'dygakk' registered (via eng1713:pid=1490880) with MAAS version 3.1.0-10901-g.f1f8f1505.

That stack trace seems a little odd, but seem likely unrelated. Thoughts?

not off the top of my head. i’m going to ask someone in CET in the morning, they might have seen this.