400 Bad Request errors during deploy starting yesterday

nturner · 14 June 2022 17:16

Over the past 24 hours or so, I’ve started seeing a lot of deploy failures. I’m
using the maas Python API, and the failures do not appear to be specific to any
particular machine.

CLIENT

The client script uses python3-libmaas version 0.6.4-0ubuntu1 and is essentially this:

self.__client = maas.client.connect(MAAS_BASE_URI, apikey=self.config['apikey'])
machines = self.__client.machines.list()
# Some filtering of machines list occurs, then:
for machine in machines:
    logger.info('Asking MAAS to deploy %s', machine.fqdn)
    machine.deploy()

This tool has been working pretty reliably for the past year or two, and has not changed in months.

The failures look like this on the client side:

INFO     egd: Asking MAAS to deploy vm-008-4269.maas.eng.exagrid.com
POST https://maas.eng.exagrid.com/MAAS/api/2.0/machines/hsnt44/?op=deploy -> HTTP 400 Bad Request (Failed to render preseed: Unable to find MAAS ser…)

SERVER

# snap list maas
Name  Version                  Rev    Tracking    Publisher   Notes
maas  3.1.0-10901-g.f1f8f1505  19835  3.1/stable  canonical✓  -

On the server, if I do tail -F /var/snap/maas/common/log/*.log, I see stuff like this:

==> /var/snap/maas/common/log/maas.log <==
2022-06-14T16:35:46.247632+00:00 eng1713 maas.api: [info] Request from user gitlab-runner to acquire machine: vm-008-4269.maas.eng.exagrid.com (hsnt44)
2022-06-14T16:35:46.282865+00:00 eng1713 maas.node: [info] vm-008-4269: Status transition from READY to ALLOCATED
2022-06-14T16:35:46.295518+00:00 eng1713 maas.node: [info] vm-008-4269: allocated to user gitlab-runner

==> /var/snap/maas/common/log/regiond.log <==
2022-06-14 16:35:46 maasserver.preseed: [warn] WARNING: '/snap/maas/19835/etc/maas/preseeds/curtin_userdata' contains deprecated preseed variables. Please remove: main_archive_directory, ports_archive_directory
2022-06-14 16:35:47 regiond: [info] 127.0.0.1 POST /MAAS/api/2.0/machines/hsnt44/?op=deploy HTTP/1.1 --> 400 BAD_REQUEST (referrer: -; agent: Python/3.8 aiohttp/3.6.2)

Can I turn up tracing somewhere to get more details about why maas thinks these are bad requests?

billwear · 14 June 2022 18:35

did you also scan rackd.log with that tail? what did you get there?

nturner · 14 June 2022 18:44

Yes, rackd.log was included in that tail. No messages near the time of the bad request message. Here’s what I see in rackd.log for the few minutes before and after:

2022-06-14 16:29:49 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: bond0.
2022-06-14 16:29:59 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2022-06-14 16:39:49 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: bond0.
2022-06-14 16:39:59 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2022-06-14 16:40:28 provisioningserver.rpc.clusterservice: [info] Failure on ping dropping connection to event-loop: eng1713:pid=1490880
2022-06-14 16:40:28 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=34024, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=5252, flowInfo=0, s
copeID=0))
2022-06-14 16:40:28 provisioningserver.rackdservices.service_monitor_service: [critical] Failed to monitor services and update region.
        Traceback (most recent call last):
        --- <exception caught here> ---
          File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/rackdservices/service_monitor_service.py", line 92, in _updateRegion
            yield client(
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 1994, in _massageError
            error.trap(RemoteAmpError)
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
            self.raiseException()
          File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
        twisted.internet.error.ConnectionDone: Connection was closed cleanly.
        
2022-06-14 16:40:29 provisioningserver.rpc.clusterservice: [info] Making connections to event-loops: eng1713:pid=1490880
2022-06-14 16:40:29 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=44988, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='2602:fbab:e0:3004::4', port=5252, flowInfo=0, s
copeID=0))
2022-06-14 16:40:34 provisioningserver.rpc.clusterservice: [info] Fully connected to all 4 event-loops on all 1 region controllers (eng1713).
2022-06-14 16:41:07 provisioningserver.rpc.clusterservice: [info] Event-loop 'eng1713:pid=1490880' authenticated.
2022-06-14 16:41:08 provisioningserver.rpc.clusterservice: [info] Rack controller 'dygakk' registered (via eng1713:pid=1490880) with MAAS version 3.1.0-10901-g.f1f8f1505.

That stack trace seems a little odd, but seem likely unrelated. Thoughts?

billwear · 14 June 2022 18:57

not off the top of my head. i’m going to ask someone in CET in the morning, they might have seen this.

billwear · 17 August 2022 21:50

@nturner, did this get resolved? if not, please file a bug, since we’re not able to get to a resolution here.