Hi,
I’m encountering an issue with Prometheus metrics in my all-in-one MaaS 3.5.3 setup (installed via apt). I’ve enabled Prometheus and am scraping metrics from MaaSIP:5239
.
The maas-regiond
service spawns four child processes, which is confirmed by the output of ps aux
:
maas 61271 1.3 0.6 1633224 206156 ? Ssl 00:25 0:35 /usr/bin/python3 /usr/sbin/regiond
maas 61385 0.6 0.6 679580 201756 ? Sl 00:25 0:17 \_ /usr/bin/python3 /usr/sbin/regiond
maas 61386 0.6 0.6 621688 207568 ? Sl 00:25 0:18 \_ /usr/bin/python3 /usr/sbin/regiond
maas 61387 0.5 0.6 619676 201332 ? Sl 00:25 0:13 \_ /usr/bin/python3 /usr/sbin/regiond
maas 61388 0.5 0.5 685260 187032 ? Sl 0:25 0:13 \_ /usr/bin/python3 /usr/sbin/regiond`
The Prometheus metrics also reflect data from each of these child processes, as seen in the maas_service_availability
metric:
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61388",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61385",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61386",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61387",service="rackd",system_id="bx6466"} 1.0
However, I’ve observed a delay in the metric updates when a service’s state changes. For example, stopping the rackd
service resulted in inconsistent values across the child processes’ metrics for a short period:
$ curl -s http://MASSIP:5239/metrics | grep maas_service_availability | grep rackd
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61388",service="rackd",system_id="bx6466"} 3.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61385",service="rackd",system_id="bx6466"} 1.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61386",service="rackd",system_id="bx6466"} 3.0
maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1",pid="61387",service="rackd",system_id="bx6466"} 1.0
This inconsistency makes it challenging to accurately represent service states in Grafana using queries like:
avg by (service) (maas_service_availability{maas_id="61091eef-44ef-4989-addc-7817a93fa0b1", service="rackd"})
Is this delay in metric propagation an expected behaviour? Has anyone else experienced this, and are there recommended approaches to mitigate this issue or work around it for reliable service state monitoring in Grafana?
Any insights or suggestions would be greatly appreciated.
Thank you.