Weired 504 timeout issue when running maas-enlist and regiond running at cpu 100%

see full issue here:

I noticed when this is happening, the node don’t have random hostname, but called somelike maas-enlistment
login, does anyone run into this?

what could be the reason that random hostname is missing?

seems that random hostname is unrelated to this issue, the enlistment hang because those server was deployed and there are some uncleaned records in database

some node report duplicate static ip address error when add machine manually
some node just hang there when add through web console, with no error, just timeout, regiond cpu running ta 100%

regiond.log:

2019-02-18 12:43:55 maasserver.utils.views: [error] Attempt #1 for /MAAS/api/2.0/machines/ failed; giving up (624.4s elapsed in total)
2019-02-18 12:43:55 -: [critical] WSGI application error
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/twisted/python/context.py”, line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File “/usr/lib/python3/dist-packages/twisted/python/context.py”, line 87, in callWithContext
self.contexts.pop()
File “/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py”, line 885, in callInContext
return func(*args, **kwargs)
File “/usr/lib/python3/dist-packages/twisted/web/wsgi.py”, line 522, in run
self.started = True
— —
File “/usr/lib/python3/dist-packages/twisted/web/wsgi.py”, line 500, in run
self.write(elem)
File “/usr/lib/python3/dist-packages/twisted/web/wsgi.py”, line 455, in write
self.reactor, wsgiWrite, self.started)
File “/usr/lib/python3/dist-packages/twisted/internet/threads.py”, line 122, in blockingCallFromThread
result.raiseException()
File “/usr/lib/python3/dist-packages/twisted/python/failure.py”, line 385, in raiseException
raise self.value.with_traceback(self.tb)
builtins.AttributeError: ‘NoneType’ object has no attribute ‘writeHeaders’

2019-02-18 12:43:55 asyncio: [error] Exception in callback <function AsyncioSelectorReactor.callLater..run at 0x7f236ed82620>
handle: <Handle AsyncioSelectorReactor.callLater..run>
Traceback (most recent call last):
File “uvloop/cbhandles.pyx”, line 47, in uvloop.loop.Handle._run
File “/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py”, line 290, in run
f(*args, **kwargs)
File “/usr/lib/python3/dist-packages/twisted/web/wsgi.py”, line 510, in wsgiError
self.request.loseConnection()
File “/usr/lib/python3/dist-packages/twisted/web/http.py”, line 1474, in loseConnection
self.channel.loseConnection()
AttributeError: ‘NoneType’ object has no attribute ‘loseConnection’

Updates: I found this issue happen when dhcp don’t assign a random hostname to node, and the node running
enlistment with empty hostname will hang and timeout.

I’m trying to understand what goes wrong here, not sure if my understanding is correct:

  1. the node get dhcp boot correctly (with empty hostname at this time)
  2. initramfs bootstrap -> it should get a random hostname here, but get fixed maas-enlist
  3. running cloud-init, do maas-enlist with empty hostname, enlistment stuck.

Looks like something is wrong in step 2, can anyone help explain where the random hostname generated?

Ok, I tried create the node manually and get error:

get() returned more than one StaticIPAddress – it returned 2!

00%20PM

I think I know the reason now,

those node was running ok, a few days before the rack controller server disconnect to the nodes because network topology change, I tried release the deployed nodes but it hang forever.

After I bring up a new rack controller and the old node can no longer get enlisted, I think there must be some dirty records left in database

I wonder if this is the same as bug #1816651. Can you look in /var/log/maas/regiond.log and check for the traceback associated with the get() returned more than one StaticIPAddress – it returned 2 message?

If you can test the patch I posted as a comment on the bug to see if that’s a fix, that would be great, too.

it does feel like the same issue
I’m unable to reproduce the issue after upgrading to 2.5.2

thanks a lot!