I’m trying to deploy a number of chassis’ of Moonshot Cartridges (15-20), but have run into numerous problems. I’m not sure if I’m trying to mis-use MaaS, or whether there are configuration issues, or indeed some bugs. All help is appreciated!
IP Addresses
We like to have our nodes named and IP’d sequentially, so ‘node10a45’ would be the 45th cartridge in the 10th chassis. This presents a few challenges in the IP addresses, as the network is sized to be close to the right size, so I only have about 30 free IPs to try and commission hundreds of nodes.
The solution I came up with was to use DHCP snippets to control the IP addresses, and then set all the nodes to both PXE and run, once deployed, using the same IP controlled by the DHCP Snippets. This seemed to be working well until I think I hit a limit to the number of snippets I can add.
I add them to the subnet, as I can’t add them to a machine until the machine is commissioned, but get this error after a certain number, something like 225.
2021-12-22 12:52:24 maasserver: [error] ################################ Exception: ################################
2021-12-22 12:52:24 maasserver: [error] Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 284, in view_atomic_with_post_commit_savepoint
return view_atomic(*args, **kwargs)
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 56, in __call__
response = super().__call__(request, *args, **kwargs)
File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
response = func(*args, **kwargs)
File "/usr/lib/python3.8/dist-packages/piston3/resource.py", line 197, in __call__
result = self.error_handler(e, request, meth, em_format)
File "/usr/lib/python3.8/dist-packages/piston3/resource.py", line 195, in __call__
result = meth(request, *args, **kwargs)
File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 308, in dispatch
return function(self, request, *args, **kwargs)
File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 158, in wrapper
return func(self, request, *args, **kwargs)
File "/usr/lib/python3/dist-packages/maasserver/api/dhcpsnippets.py", line 278, in create
if form.is_valid():
File "/usr/lib/python3/dist-packages/maasserver/forms/dhcpsnippet.py", line 133, in is_valid
for error in validate_dhcp_config(self.instance):
File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 1077, in validate_dhcp_config
v4_response = client(
File "/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 231, in wait
result.raiseException()
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
raise self.value.with_traceback(self.tb)
twisted.protocols.amp.TooLong
2021-12-22 12:52:24 regiond: [info] xx.xx.xx.xx POST /MAAS/api/2.0/dhcp-snippets/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: Python-httplib2/0.14.0 (gzip))
Is there a better way for me to do this?
Mass Deployment (forgive the pun)
When I try to commission a whole bunch of nodes, it fails. I can’t explain why, but part of it might be trying to send too many commands to the Moonshot Chassis Manager simulaneously. Is this expeced?
2021-12-22 10:11:25 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:25 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:26 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:26 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:26 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:26 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:27 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:28 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:28 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:29 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:29 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:29 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:29 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:30 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:30 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:31 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:31 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:32 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
2021-12-22 10:11:32 paramiko.transport: [info] Authentication (password) successful!
2021-12-22 10:11:32 paramiko.transport: [info] Authentication (password) failed.
2021-12-22 10:11:32 paramiko.transport: [info] Connected (version 2.0, client mpSSH_0.2.0)
All chassis use the same password, and it seems to be random which ones fail to authenticate.
DNS errors
Finally, and this only seems to happen when deploying many nodes simultaneously, I get errors in the rackd.log
.
2021-12-22 10:34:14 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://region.maas.internal:5240/MAAS/).
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
self._startRunCallbacks(result)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
self._runCallbacks()
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
_inlineCallbacks(r, g, status)
--- <exception caught here> ---
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1294, in _doUpdate
eventloops, maas_url = yield self._get_rpc_info(urls)
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
raise config_exc
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
errors[0].raiseException()
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
raise self.value.with_traceback(self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
raise last_exc
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
response = yield self._fetch_rpc_info(url, orig_url)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
raise config_exc
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
errors[0].raiseException()
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
raise self.value.with_traceback(self.tb)
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
raise last_exc
File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
response = yield self._fetch_rpc_info(url, orig_url)
twisted.internet.error.DNSLookupError: DNS lookup failed: Couldn't find the hostname 'region.maas.internal'.
I can confirm that I can use DNS on the host and it resolves correctly. This error also doesn’t appear all the time.
Any help appreciated, and any improvements to the question are welcome.