MaaS 3.5.3 Prometheus Metrics - Process State Inconsistencies

Hi,

I’m encountering an issue with Prometheus metrics in my all-in-one MaaS 3.5.3 setup (installed via apt). I’ve enabled Prometheus and am scraping metrics from MaaSIP:5239.

The maas-regiond service spawns four child processes, which is confirmed by the output of ps aux:

maas     61271  1.3  0.6 1633224 206156 ?    Ssl  00:25   0:35 /usr/bin/python3 /usr/sbin/regiond
maas     61385  0.6  0.6 679580 201756 ?     Sl   00:25   0:17 \_ /usr/bin/python3 /usr/sbin/regiond
maas     61386  0.6  0.6 621688 207568 ?     Sl   00:25   0:18 \_ /usr/bin/python3 /usr/sbin/regiond
maas     61387  0.5  0.6 619676 201332 ?     Sl   00:25   0:13 \_ /usr/bin/python3 /usr/sbin/regiond
maas     61388  0.5  0.5 685260 187032 ?     Sl   0:25   0:13 \_ /usr/bin/python3 /usr/sbin/regiond`

The Prometheus metrics also reflect data from each of these child processes, as seen in the maas_service_availability metric:

maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61388",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61385",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61386",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61387",service="rackd",system_id="bx6466"} 1.0

However, I’ve observed a delay in the metric updates when a service’s state changes. For example, stopping the rackd service resulted in inconsistent values across the child processes’ metrics for a short period:

$ curl -s http://MASSIP:5239/metrics | grep maas_service_availability | grep rackd

maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61388",service="rackd",system_id="bx6466"} 3.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61385",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61386",service="rackd",system_id="bx6466"} 3.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61387",service="rackd",system_id="bx6466"} 1.0

This inconsistency makes it challenging to accurately represent service states in Grafana using queries like:

avg by (service) (maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1", service="rackd"})

Is this delay in metric propagation an expected behaviour? Has anyone else experienced this, and are there recommended approaches to mitigate this issue or work around it for reliable service state monitoring in Grafana?

Any insights or suggestions would be greatly appreciated.

Thank you.

I would consider it a bug. Could you please open it on launchpad?

Sure thing. Submitted the bug: Bug #2105469 “MaaS 3.5.x Prometheus Metrics - Process State Inco...” : Bugs : MAAS

1 Like