Maas to Provision Moonshot Nodes

neiltwist · 22 December 2021 13:02

I’m trying to deploy a number of chassis’ of Moonshot Cartridges (15-20), but have run into numerous problems. I’m not sure if I’m trying to mis-use MaaS, or whether there are configuration issues, or indeed some bugs. All help is appreciated!

IP Addresses

We like to have our nodes named and IP’d sequentially, so ‘node10a45’ would be the 45th cartridge in the 10th chassis. This presents a few challenges in the IP addresses, as the network is sized to be close to the right size, so I only have about 30 free IPs to try and commission hundreds of nodes.

The solution I came up with was to use DHCP snippets to control the IP addresses, and then set all the nodes to both PXE and run, once deployed, using the same IP controlled by the DHCP Snippets. This seemed to be working well until I think I hit a limit to the number of snippets I can add.

I add them to the subnet, as I can’t add them to a machine until the machine is commissioned, but get this error after a certain number, something like 225.

2021-12-22 12:52:24 maasserver: [error] ################################ Exception:  ################################
2021-12-22 12:52:24 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 284, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 56, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.8/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.8/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 308, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 158, in wrapper
    return func(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/dhcpsnippets.py", line 278, in create
    if form.is_valid():
  File "/usr/lib/python3/dist-packages/maasserver/forms/dhcpsnippet.py", line 133, in is_valid
    for error in validate_dhcp_config(self.instance):
  File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 1077, in validate_dhcp_config
    v4_response = client(
  File "/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 231, in wait
    result.raiseException()
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.protocols.amp.TooLong

2021-12-22 12:52:24 regiond: [info] xx.xx.xx.xx POST /MAAS/api/2.0/dhcp-snippets/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: Python-httplib2/0.14.0 (gzip))

Is there a better way for me to do this?

Mass Deployment (forgive the pun)

When I try to commission a whole bunch of nodes, it fails. I can’t explain why, but part of it might be trying to send too many commands to the Moonshot Chassis Manager simulaneously. Is this expeced?

2021-12-22 10:11:25 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:25 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:26 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:26 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:26 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:26 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:27 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:28 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:28 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:29 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:29 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:29 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:29 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:30 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:30 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:31 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:31 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:32 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:32 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:32 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:32 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)

All chassis use the same password, and it seems to be random which ones fail to authenticate.

DNS errors

Finally, and this only seems to happen when deploying many nodes simultaneously, I get errors in the rackd.log.

2021-12-22 10:34:14 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://region.maas.internal:5240/MAAS/).
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
            self._startRunCallbacks(result)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1294, in _doUpdate
            eventloops, maas_url = yield self._get_rpc_info(urls)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
            raise config_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
            errors[0].raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
            raise last_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
            raise config_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
            errors[0].raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
            raise last_exc
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
        twisted.internet.error.DNSLookupError: DNS lookup failed: Couldn't find the hostname 'region.maas.internal'.

I can confirm that I can use DNS on the host and it resolves correctly. This error also doesn’t appear all the time.

Any help appreciated, and any improvements to the question are welcome.

cgrabowski · 12 January 2022 21:46

Hi there,

So the first error appears to be due to too large of a payload being sent to generate the DHCP configuration. If I understand correctly, the subnet in question is for addressing the boot interfaces for each cartridge, or the node itself? It’d be helpful if you could provide some more detail into the intend network topology.

The second error is likely some of the commands timing out. To confirm, when powering on/off a smaller number of machines, these commands succeed consistent?

The third issue, while not certain, is likely due to MAAS attempting to query the underlying BIND server as it is reloading to add new DNS entries for each node. Is region.maas.internal a DNS record you’ve provided for the rack controller(s) to connect to the region with?