Issues after 3.7 upgrade

Hi all,

I have upgraded MAAS to 3.7.0 and since this upgrade I have some issues.

The first one is about power on, power off and check power on all my machines. Indeed, if I try one of those options, it always timeout.

The second one is about add new machine. When I turn on a machine and make network boot, the machine show me these message :

>>Checking Media Presence.........
>>Media Presence........
>>Start PXE over IPv4.
  Station IP address is xxx
  
  Server IP address is xxx
  NBP filename is bootx64.efi
  NBP filesize is 0 bytes
  PXE-E23: Client received TFTP error from serveur

The IP address given by DHCP is correct and the server IP address is my MAAS server, so DHCP is working.

After looking in docs, I think that my rackd has a problem but I cannot find which one and how fix it. My regiond and rackd are on the same server.

Thanks per advance if you can help me !

Hi @hugobert1
Can you please post the logs you get when you do those operations? You can follow this guide about logging

Hi, thanks for your anserw.

Here are the logs of rackd and regiond when I turn on a new machine

Jan 26 15:40:11 user maas-regiond[12592]: regiond: [info] 127.0.0.1 GET /MAAS/rpc/HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
Jan 26 15:40:27 user maas-regiond[12593]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
Jan 26 15:40:35 user maas-rackd[12513]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by <IP_client>
Jan 26 15:40:35 user maas-rackd[12513]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by 127.0.0.1
Jan 26 15:40:38 user maas-regiond[12592]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
Jan 26 15:40:41 user maas-regiond[12592]: regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
Jan 26 15:40:58 user maas-regiond[12593]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
Jan 26 15:41:02 user maas-regiond[11881]: maasserver.regiondservices.dns: [info] BIND is already up to date. Skipping update and reload.

I also put the logs of maas.agent because I saw that there is something strange

Jan 26 15:43:08 user maas-agent[62886]: ERR Failed to setup cluster-service Attempt=1 Namespace=default RunID=6d1152ac-343a-4ea9-96db-60bccb628b6d SpanID=0000000000000000 TaskQueue=ckeq3e@agent:main TraceID=00000000000000000000000000000000 WorkerID=ckeq3e@agent:62886 WorkflowID=configure-cluster-service:ckeq3e WorkflowType=configure-cluster-service err="deadline exceeded (type: ScheduleToClose)"
Jan 26 15:43:08 user maas-agent[62886]: ERR Workflow configure-agent failed error="workflow execution error (type: configure-agent, workflowID: configure-agent:ckeq3e, runID: 1b8d9732-964d-4c43-a738-fc4759f3756c): child workflow execution error (type: configure-cluster-service, workflowID: configure-cluster-service:ckeq3e, runID: 6d1152ac-343a-4ea9-96db-60bccb628b6d, initiatedEventID: 5, startedEventID: 6): deadline exceeded (type: ScheduleToClose)"
Jan 26 15:43:09 user maas-agent[64008]: INF Logger is configured with log level "info"
Jan 26 15:43:09 user maas-agent[64008]: INF Started Worker Namespace=default TaskQueue=ckeq3e@agent:main WorkerID=ckeq3e@agent:64008
Jan 26 15:43:09 user maas-agent[64008]: INF Configuring cluster-service Attempt=1 Namespace=default RunID=953ac6c7-d632-45c6-9a56-34d92d6ec843 SpanID=0000000000000000 TaskQueue=ckeq3e@agent:main TraceID=00000000000000000000000000000000 WorkerID=ckeq3e@agent:64008 WorkflowID=configure-cluster-service:ckeq3e WorkflowType=configure-cluster-service
Jan 26 15:43:09 user maas-agent[64008]: ERR Failed to start Microcluster Attempt=1 Namespace=default RunID=953ac6c7-d632-45c6-9a56-34d92d6ec843 SpanID=0000000000000000 TaskQueue=ckeq3e@agent:main TraceID=00000000000000000000000000000000 WorkerID=ckeq3e@agent:64008 WorkflowID=configure-cluster-service:ckeq3e WorkflowType=configure-cluster-service err="Daemon stopped with error: Daemon failed to start: Failed to initialize trust store: too many open files"

Here are the logs of regiond and rackd when I check power on one of my machine. Nothing happens with the rackd

Jan 26 15:57:07 user maas-regiond[12592]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
Jan 26 15:57:11 user maas-regiond[12593]: regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
Jan 26 15:57:58 user maas-regiond[12593]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
Jan 26 15:58:02 user maas-regiond[11881]: maasserver.regiondservices.dns: [info] BIND is already up to date. Skipping update and reload.
Jan 26 15:58:11 user maas-regiond[12592]: regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
Jan 26 15:58:19 user maas-regiond[12592]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
Jan 26 15:58:21 user maas-regiond[12591]: root: [error]
Jan 26 15:58:21 user maas-regiond[12591]: Traceback (most recent call last):
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 197, in _get_response
Jan 26 15:58:21 user maas-regiond[12591]:     response = wrapped_callback(request, *callback_args, **callback_kwargs)
Jan 26 15:58:21 user maas-regiond[12591]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/lib/python3.12/site-packages/maasserver/utils/views.py", line 297, in view_atomic_with_post_commit_savepoint
Jan 26 15:58:21 user maas-regiond[12591]:     return view_atomic(*args, **kwargs)
Jan 26 15:58:21 user maas-regiond[12591]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/usr/lib/python3.12/contextlib.py", line 81, in inner
Jan 26 15:58:21 user maas-regiond[12591]:     return func(*args, **kwds)
Jan 26 15:58:21 user maas-regiond[12591]:            ^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/lib/python3.12/site-packages/maasserver/api/support.py", line 62, in __call__
Jan 26 15:58:21 user maas-regiond[12591]:     response = super().__call__(request, *args, **kwargs)
Jan 26 15:58:21 user maas-regiond[12591]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 21, in inner_func
Jan 26 15:58:21 user maas-regiond[12591]:     response = func(*args, **kwargs)
Jan 26 15:58:21 user maas-regiond[12591]:                ^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/usr/lib/python3/dist-packages/piston3/resource.py", line 208, in __call__
Jan 26 15:58:21 user maas-regiond[12591]:     result = self.error_handler(e, request, meth, em_format)
Jan 26 15:58:21 user maas-regiond[12591]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/usr/lib/python3/dist-packages/piston3/resource.py", line 206, in __call__
Jan 26 15:58:21 user maas-regiond[12591]:     result = meth(request, *args, **kwargs)
Jan 26 15:58:21 user maas-regiond[12591]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/lib/python3.12/site-packages/maasserver/api/support.py", line 371, in dispatch
Jan 26 15:58:21 user maas-regiond[12591]:     return function(self, request, *args, **kwargs)
Jan 26 15:58:21 user maas-regiond[12591]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/lib/python3.12/site-packages/maasserver/api/nodes.py", line 964, in query_power_state
Jan 26 15:58:21 user maas-regiond[12591]:     return {"state": node.power_query().wait(60)}
Jan 26 15:58:21 user maas-regiond[12591]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 194, in wait
Jan 26 15:58:21 user maas-regiond[12591]:     result = self._result(timeout)
Jan 26 15:58:21 user maas-regiond[12591]:              ^^^^^^^^^^^^^^^^^^^^^
Jan 26 15:58:21 user maas-regiond[12591]:   File "/snap/maas/40917/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 173, in _result
Jan 26 15:58:21 user maas-regiond[12591]:     raise TimeoutError()
Jan 26 15:58:21 user maas-regiond[12591]: crochet._eventloop.TimeoutError
Jan 26 15:58:21 user maas-regiond[12591]: regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/machines/d78e3g/?op=query_power_state HTTP/1.1 --> 504 GATEWAY_TIMEOUT (referrer: -; agent: Python-httplib2/0.20.4 (gzip))
Jan 26 15:58:29 user maas-regiond[12590]: regiond: [info] 127.0.0.1 GET /MAAS/accounts/login/ HTTP/1.1 --> 200 OK (referrer: http://<IP_MAAS>:5240/MAAS/r/machines; agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0)
 

Regarding the maas-agent logs, can you check the list of open files through lsof ?
The only bug I remember about it is this one, maybe you are affected.

Regarding the power checks: I see a timeout error, can you double check that the power drivers are configured correctly and that the IPs are reachable?

I found the solutions and all my problems were related. Due to errors in the maas.agent logs, I cannot add new machines and check their power supply, turn them on/off.

It was an inotify boundary issue. So I raised the limit and then I restarted snap.maas.pebble.service and it’s all working now!

Thanks for your help

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.