MAAS 2.9 named crashes due to too low allowed files limit

Hello,

I’ve recently switched to use MAAS 2.9 due to MAAS changing boot order. I’ve noticed however that after sime time of correct oprtation MAAS managed bind crashes wit following log message:

28-Oct-2020 13:24:54.653 uv_export failed: permission denied
28-Oct-2020 13:24:54.653 uv_export failed: permission denied
28-Oct-2020 13:24:54.653 listening on IPv6 interface vnet8, fe80::fc54:ff:fe03:7f67%28#53
28-Oct-2020 13:24:54.673 udp.c:83: INSIST(csock->fd >= 0) failed, back trace
28-Oct-2020 13:24:54.673 #0 0x555e9f8c2e43 in ??
28-Oct-2020 13:24:54.673 #1 0x7fc90796eac0 in ??
28-Oct-2020 13:24:54.673 #2 0x7fc90798bf4d in ??
28-Oct-2020 13:24:54.673 #3 0x7fc907c5f82b in ??
28-Oct-2020 13:24:54.673 #4 0x7fc907c605d0 in ??
28-Oct-2020 13:24:54.673 #5 0x7fc907c60c1e in ??
28-Oct-2020 13:24:54.673 #6 0x555e9f8e0a6b in ??
28-Oct-2020 13:24:54.673 #7 0x555e9f8e406e in ??
28-Oct-2020 13:24:54.673 #8 0x7fc907995fe1 in ??
28-Oct-2020 13:24:54.673 #9 0x7fc90745e609 in ??
28-Oct-2020 13:24:54.673 #10 0x7fc90737f293 in ??
28-Oct-2020 13:24:54.673 exiting (due to assertion failure)

It happens on all 3 machines I have my MAAS cluster on.
MAAS version is: 2.9.0~beta5 (9002-g.2a342196f) (snap)

Aditionally I’ve noticed that named is complaining about max open files limit (it is set on OS level to 65535 but SNAP seems to have it’s own limit of 4096):
28-Oct-2020 12:34:37.872 max open files (4096) is smaller than max sockets (21000)

1 Like

It seems that bind has changed max-socket settings:
https://kb.isc.org/docs/aa-01314 ->
2. ISC_SOCKET_MAXSOCKETS changed from 4096 to 21000

The SNAP profile however still keeps 4096:

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1543416              1543416              processes 
Max open files            4096                 4096                 files     
Max locked memory         16777216             16777216             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1543416              1543416              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us  

That poses a problem when named tries to start up.

Root cause is that initially by default on bionic following limits apply to snap.maas.supervisor:

LimitNOFILE=4096
LimitNOFILESoft=4096

As per message above named expects to have 21k sockets available (https://kb.isc.org/docs/aa-01314).

In this case, when upstream DNS was temporarily unreachable, stalled forwarded queries used up all 4096 available sockets. Named however expected that 21k sockets are available and crashed while trying to allocate more sockets.

Workaround:
systemctl edit snap.maas.supervisor.service and put:

LimitNOFILE=65535
LimitNOFILESoft=65535

That ensures that named will be able to allocate all expected 21k sockets

1 Like