New queries in 3.4(.1) result in Postgresql running 100% cpu, maas ui/api hangs

Since upgrading to v3.4, and continuing after v3.4.1 applied, we see the MAAS UI/API hang and the postgres server CPU at 100% at various times throughout the day. A restart of regiond is usually necessary.

The behavior seems to be related to two new expensive (UI?) queries that are both called frequently and take a long time to complete. Can we get some understanding of what these queries are and how to avoid them taking down the system?

SELECT DISTINCT "maasserver_node"."id", "maasserver_node"."created", "maasserver_node"."updated", "maasserver_node"."system_id", "maasserver_node"."hardware_uuid", "maasserver_node"."hostname", "maasserver_node"."description", "maasserver_node"."pool_id", "maasserver_node"."domain_id", "maasserver_node"."address_ttl", "maasserver_node"."status", "maasserver_node"."previous_status", "maasserver_node"."status_expires", "maasserver_node"."owner_id", "maasserver_node"."bios_boot_method", "maasserver_node"."osystem", "maasserver_node"."distro_series", "maasserver_node"."architecture", "maasserver_node"."min_hwe_kernel", "maasserver_node"."hwe_kernel", "maasserver_node"."node_type", "maasserver_node"."parent_id", "maasserver_node"."agent_name", "maasserver_node"."error_description", "maasserver_node"."zone_id", "maasserver_node"."cpu_count", "maasserver_node"."cpu_speed", "maasserver_node"."memory", "maasserver_node"."swap_size", "maasserver_node"."bmc_id", "maasserver_node"."instance_power_parameters", "maasserver_node"."power_state", "maasserver_node"."power_state_queried", "maasserver_node"."power_state_updated", "maasserver_node"."last_image_sync", "maasserver_node"."error", "maasserver_node"."netboot", "maasserver_node"."ephemeral_deploy", "maasserver_node"."license_key", "maasserver_node"."dynamic", "maasserver_node"."boot_interface_id", "maasserver_node"."boot_cluster_ip", "maasserver_node"."boot_disk_id", "maasserver_node"."gateway_link_ipv4_id", "maasserver_node"."gateway_link_ipv6_id", "maasserver_node"."default_user", "maasserver_node"."install_rackd", "maasserver_node"."install_kvm", "maasserver_node"."register_vmhost", "maasserver_node"."enable_ssh", "maasserver_node"."skip_networking", "maasserver_node"."skip_storage", "maasserver_node"."url", "maasserver_node"."dns_process_id", "maasserver_node"."managing_process_id", "maasserver_node"."current_commissioning_script_set_id", "maasserver_node"."current_installation_script_set_id", "maasserver_node"."current_testing_script_set_id", "maasserver_node"."locked", "maasserver_node"."last_applied_storage_layout", "maasserver_node"."current_config_id", "maasserver_node"."enable_hw_sync", "maasserver_node"."sync_interval", "maasserver_node"."last_sync", (SELECT CONCAT(U2."description", CONCAT(' - ', U0."description")) AS "message" FROM "maasserver_event" U0 INNER JOIN "maasserver_eventtype" U2 ON (U0."type_id" = U2."id") WHERE (U0."node_id" = "maasserver_node"."id" AND U2."level" >= 20) ORDER BY U0."created" DESC, U0."id" DESC LIMIT 1) AS "status_message_text", COUNT(DISTINCT "maasserver_physicalblockdevice"."blockdevice_ptr_id") AS "physical_disk_count", (SELECT SUM(U4."size") AS "storage" FROM "maasserver_node" U0 LEFT OUTER JOIN "maasserver_nodeconfig" U1 ON (U0."current_config_id" = U1."id") LEFT OUTER JOIN "maasserver_blockdevice" U2 ON (U1."id" = U2."node_config_id") LEFT OUTER JOIN "maasserver_physicalblockdevice" U3 ON (U2."id" = U3."blockdevice_ptr_id") LEFT OUTER JOIN "maasserver_blockdevice" U4 ON (U3."blockdevice_ptr_id" = U4."id") WHERE (U0."node_type" = 0 AND U0."id" = "maasserver_node"."id") GROUP BY U0."id") AS "total_storage", "maasserver_interface"."mac_address" AS "pxe_mac", "maasserver_fabric"."name" AS "fabric_name", CONCAT("maasserver_node"."hostname", CONCAT('.', "maasserver_domain"."name")) AS "node_fqdn", CASE WHEN "maasserver_node"."status" IN (10) THEN 'Allocated' WHEN "maasserver_node"."status" IN (8) THEN 'Broken' WHEN "maasserver_node"."status" IN (1) THEN 'Commissioning' WHEN "maasserver_node"."status" IN (6) THEN 'Deployed' WHEN "maasserver_node"."status" IN (9) THEN 'Deploying' WHEN "maasserver_node"."status" IN (2, 11, 15, 18, 20, 13, 22) THEN 'Failed' WHEN "maasserver_node"."status" IN (0) THEN 'New' WHEN "maasserver_node"."status" IN (4) THEN 'Ready' WHEN "maasserver_node"."status" IN (14, 12) THEN 'Releasing' WHEN "maasserver_node"."status" IN (17, 19, 16) THEN 'Rescue Mode' WHEN "maasserver_node"."status" IN (21) THEN 'Testing' ELSE 'Other' END AS "simple_status", "maasserver_domain"."id", "maasserver_domain"."created", "maasserver_domain"."updated", "maasserver_domain"."name", "maasserver_domain"."authoritative", "maasserver_domain"."ttl", "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined", "maasserver_zone"."id", "maasserver_zone"."created", "maasserver_zone"."updated", "maasserver_zone"."name", "maasserver_zone"."description", "maasserver_bmc"."id", "maasserver_bmc"."created", "maasserver_bmc"."updated", "maasserver_bmc"."bmc_type", "maasserver_bmc"."ip_address_id", "maasserver_bmc"."power_type", "maasserver_bmc"."power_parameters", "maasserver_bmc"."name", "maasserver_bmc"."version", "maasserver_bmc"."architectures", "maasserver_bmc"."capabilities", "maasserver_bmc"."cores", "maasserver_bmc"."cpu_speed", "maasserver_bmc"."memory", "maasserver_bmc"."local_storage", "maasserver_bmc"."pool_id", "maasserver_bmc"."zone_id", "maasserver_bmc"."tags", "maasserver_bmc"."cpu_over_commit_ratio", "maasserver_bmc"."memory_over_commit_ratio", "maasserver_bmc"."default_storage_pool_id", "maasserver_bmc"."default_macvlan_mode", "maasserver_bmc"."created_with_trust_password", "maasserver_bmc"."created_with_maas_generated_cert", "maasserver_bmc"."created_with_cert_expiration_days", "maasserver_bmc"."created_by_commissioning", "maasserver_nodeconfig"."id", "maasserver_nodeconfig"."created", "maasserver_nodeconfig"."updated", "maasserver_nodeconfig"."name", "maasserver_nodeconfig"."node_id" FROM "maasserver_node" LEFT OUTER JOIN "maasserver_nodeconfig" ON ("maasserver_node"."current_config_id" = "maasserver_nodeconfig"."id") LEFT OUTER JOIN "maasserver_blockdevice" ON ("maasserver_nodeconfig"."id" = "maasserver_blockdevice"."node_config_id") LEFT OUTER JOIN "maasserver_physicalblockdevice" ON ("maasserver_blockdevice"."id" = "maasserver_physicalblockdevice"."blockdevice_ptr_id") LEFT OUTER JOIN "maasserver_interface" ON ("maasserver_node"."boot_interface_id" = "maasserver_interface"."id") LEFT OUTER JOIN "maasserver_vlan" ON ("maasserver_interface"."vlan_id" = "maasserver_vlan"."id") LEFT OUTER JOIN "maasserver_fabric" ON ("maasserver_vlan"."fabric_id" = "maasserver_fabric"."id") LEFT OUTER JOIN "maasserver_domain" ON ("maasserver_node"."domain_id" = "maasserver_domain"."id") LEFT OUTER JOIN "auth_user" ON ("maasserver_node"."owner_id" = "auth_user"."id") LEFT OUTER JOIN "maasserver_bmc" ON ("maasserver_node"."bmc_id" = "maasserver_bmc"."id") LEFT OUTER JOIN "maasserver_interface" T11 ON ("maasserver_nodeconfig"."id" = T11."node_config_id") LEFT OUTER JOIN "maasserver_vlan" T12 ON (T11."vlan_id" = T12."id") LEFT OUTER JOIN "maasserver_space" ON (T12."space_id" = "maasserver_space"."id") LEFT OUTER JOIN "maasserver_resourcepool" ON ("maasserver_node"."pool_id" = "maasserver_resourcepool"."id") LEFT OUTER JOIN "maasserver_node_tags" ON ("maasserver_node"."id" = "maasserver_node_tags"."node_id") LEFT OUTER JOIN "maasserver_tag" ON ("maasserver_node_tags"."tag_id" = "maasserver_tag"."id") LEFT OUTER JOIN "maasserver_ownerdata" ON ("maasserver_node"."id" = "maasserver_ownerdata"."node_id") INNER JOIN "maasserver_zone" ON ("maasserver_node"."zone_id" = "maasserver_zone"."id") WHERE ("maasserver_node"."node_type" = 0 AND "maasserver_node"."node_type" = 0 AND NOT ("maasserver_node"."node_type" IN (2, 3, 4)) AND ("maasserver_node"."owner_id" IS NULL OR "maasserver_node"."owner_id" = 112) AND (UPPER("maasserver_node"."distro_series"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_fabric"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER(CONCAT("maasserver_node"."hostname", CONCAT('.', "maasserver_domain"."name"))::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_node"."osystem"::text) LIKE UPPER('%81nn7m2%') OR UPPER("auth_user"."username"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_bmc"."power_type"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_bmc"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_space"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_resourcepool"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_interface"."mac_address"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_tag"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_ownerdata"."value"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_zone"."name"::text) LIKE UPPER('%81nn7m2%'))) GROUP BY "maasserver_node"."id", "maasserver_interface"."mac_address", "maasserver_fabric"."name", CONCAT("maasserver_node"."hostname", CONCAT('.', "maasserver_domain"."name")), CASE WHEN "maasserver_node"."status" IN (10) THEN 'Allocated' WHEN "maasserver_node"."status" IN (8) THEN 'Broken' WHEN "maasserver_node"."status" IN (1) THEN 'Commissioning' WHEN "maasserver_node"."status" IN (6) THEN 'Deployed' WHEN "maasserver_node"."status" IN (9) THEN 'Deploying' WHEN "maasserver_node"."status" IN (2, 11, 15, 18, 20, 13, 22) THEN 'Failed' WHEN "maasserver_node"."status" IN (0) THEN 'New' WHEN "maasserver_node"."status" IN (4) THEN 'Ready' WHEN "maasserver_node"."status" IN (14, 12) THEN 'Releasing' WHEN "maasserver_node"."status" IN (17, 19, 16) THEN 'Rescue Mode' WHEN "maasserver_node"."status" IN (21) THEN 'Testing' ELSE 'Other' END, "maasserver_domain"."id", "auth_user"."id", "maasserver_zone"."id", "maasserver_bmc"."id", "maasserver_nodeconfig"."id" ORDER BY "maasserver_node"."status" ASC, "maasserver_node"."hostname" DESC, "maasserver_node"."id" ASC LIMIT 1; args=(' - ', 20, 0, '.', 10, 'Allocated', 8, 'Broken', 1, 'Commissioning', 6, 'Deployed', 9, 'Deploying', 2, 11, 15, 18, 20, 13, 22, 'Failed', 0, 'New', 4, 'Ready', 14, 12, 'Releasing', 17, 19, 16, 'Rescue Mode', 21, 'Testing', 'Other', 0, 0, 2, 3, 4, 112, '%81nn7m2%', '%81nn7m2%', '.', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '.', 10, 'Allocated', 8, 'Broken', 1, 'Commissioning', 6, 'Deployed', 9, 'Deploying', 2, 11, 15, 18, 20, 13, 22, 'Failed', 0, 'New', 4, 'Ready', 14, 12, 'Releasing', 17, 19, 16, 'Rescue Mode', 21, 'Testing', 'Other')

SELECT COUNT(*) FROM (SELECT DISTINCT "maasserver_node"."id" AS Col1, "maasserver_node"."created" AS Col2, "maasserver_node"."updated" AS Col3, "maasserver_node"."system_id" AS Col4, "maasserver_node"."hardware_uuid" AS Col5, "maasserver_node"."hostname" AS Col6, "maasserver_node"."description" AS Col7, "maasserver_node"."pool_id" AS Col8, "maasserver_node"."domain_id" AS Col9, "maasserver_node"."address_ttl" AS Col10, "maasserver_node"."status" AS Col11, "maasserver_node"."previous_status" AS Col12, "maasserver_node"."status_expires" AS Col13, "maasserver_node"."owner_id" AS Col14, "maasserver_node"."bios_boot_method" AS Col15, "maasserver_node"."osystem" AS Col16, "maasserver_node"."distro_series" AS Col17, "maasserver_node"."architecture" AS Col18, "maasserver_node"."min_hwe_kernel" AS Col19, "maasserver_node"."hwe_kernel" AS Col20, "maasserver_node"."node_type" AS Col21, "maasserver_node"."parent_id" AS Col22, "maasserver_node"."agent_name" AS Col23, "maasserver_node"."error_description" AS Col24, "maasserver_node"."zone_id" AS Col25, "maasserver_node"."cpu_count" AS Col26, "maasserver_node"."cpu_speed" AS Col27, "maasserver_node"."memory" AS Col28, "maasserver_node"."swap_size" AS Col29, "maasserver_node"."bmc_id" AS Col30, "maasserver_node"."instance_power_parameters" AS Col31, "maasserver_node"."power_state" AS Col32, "maasserver_node"."power_state_queried" AS Col33, "maasserver_node"."power_state_updated" AS Col34, "maasserver_node"."last_image_sync" AS Col35, "maasserver_node"."error" AS Col36, "maasserver_node"."netboot" AS Col37, "maasserver_node"."ephemeral_deploy" AS Col38, "maasserver_node"."license_key" AS Col39, "maasserver_node"."dynamic" AS Col40, "maasserver_node"."boot_interface_id" AS Col41, "maasserver_node"."boot_cluster_ip" AS Col42, "maasserver_node"."boot_disk_id" AS Col43, "maasserver_node"."gateway_link_ipv4_id" AS Col44, "maasserver_node"."gateway_link_ipv6_id" AS Col45, "maasserver_node"."default_user" AS Col46, "maasserver_node"."install_rackd" AS Col47, "maasserver_node"."install_kvm" AS Col48, "maasserver_node"."register_vmhost" AS Col49, "maasserver_node"."enable_ssh" AS Col50, "maasserver_node"."skip_networking" AS Col51, "maasserver_node"."skip_storage" AS Col52, "maasserver_node"."url" AS Col53, "maasserver_node"."dns_process_id" AS Col54, "maasserver_node"."managing_process_id" AS Col55, "maasserver_node"."current_commissioning_script_set_id" AS Col56, "maasserver_node"."current_installation_script_set_id" AS Col57, "maasserver_node"."current_testing_script_set_id" AS Col58, "maasserver_node"."locked" AS Col59, "maasserver_node"."last_applied_storage_layout" AS Col60, "maasserver_node"."current_config_id" AS Col61, "maasserver_node"."enable_hw_sync" AS Col62, "maasserver_node"."sync_interval" AS Col63, "maasserver_node"."last_sync" AS Col64, (SELECT CONCAT(U2."description", CONCAT(' - ', U0."description")) AS "message" FROM "maasserver_event" U0 INNER JOIN "maasserver_eventtype" U2 ON (U0."type_id" = U2."id") WHERE (U0."node_id" = "maasserver_node"."id" AND U2."level" >= 20) ORDER BY U0."created" DESC, U0."id" DESC LIMIT 1) AS "status_message_text", COUNT(DISTINCT "maasserver_physicalblockdevice"."blockdevice_ptr_id") AS "physical_disk_count", (SELECT SUM(U4."size") AS "storage" FROM "maasserver_node" U0 LEFT OUTER JOIN "maasserver_nodeconfig" U1 ON (U0."current_config_id" = U1."id") LEFT OUTER JOIN "maasserver_blockdevice" U2 ON (U1."id" = U2."node_config_id") LEFT OUTER JOIN "maasserver_physicalblockdevice" U3 ON (U2."id" = U3."blockdevice_ptr_id") LEFT OUTER JOIN "maasserver_blockdevice" U4 ON (U3."blockdevice_ptr_id" = U4."id") WHERE (U0."node_type" = 0 AND U0."id" = "maasserver_node"."id") GROUP BY U0."id") AS "total_storage", "maasserver_interface"."mac_address" AS "pxe_mac", "maasserver_fabric"."name" AS "fabric_name", CONCAT("maasserver_node"."hostname", CONCAT('.', "maasserver_domain"."name")) AS "node_fqdn", CASE WHEN "maasserver_node"."status" IN (10) THEN 'Allocated' WHEN "maasserver_node"."status" IN (8) THEN 'Broken' WHEN "maasserver_node"."status" IN (1) THEN 'Commissioning' WHEN "maasserver_node"."status" IN (6) THEN 'Deployed' WHEN "maasserver_node"."status" IN (9) THEN 'Deploying' WHEN "maasserver_node"."status" IN (2, 11, 15, 18, 20, 13, 22) THEN 'Failed' WHEN "maasserver_node"."status" IN (0) THEN 'New' WHEN "maasserver_node"."status" IN (4) THEN 'Ready' WHEN "maasserver_node"."status" IN (14, 12) THEN 'Releasing' WHEN "maasserver_node"."status" IN (17, 19, 16) THEN 'Rescue Mode' WHEN "maasserver_node"."status" IN (21) THEN 'Testing' ELSE 'Other' END AS "simple_status" FROM "maasserver_node" LEFT OUTER JOIN "maasserver_nodeconfig" ON ("maasserver_node"."current_config_id" = "maasserver_nodeconfig"."id") LEFT OUTER JOIN "maasserver_blockdevice" ON ("maasserver_nodeconfig"."id" = "maasserver_blockdevice"."node_config_id") LEFT OUTER JOIN "maasserver_physicalblockdevice" ON ("maasserver_blockdevice"."id" = "maasserver_physicalblockdevice"."blockdevice_ptr_id") LEFT OUTER JOIN "maasserver_interface" ON ("maasserver_node"."boot_interface_id" = "maasserver_interface"."id") LEFT OUTER JOIN "maasserver_vlan" ON ("maasserver_interface"."vlan_id" = "maasserver_vlan"."id") LEFT OUTER JOIN "maasserver_fabric" ON ("maasserver_vlan"."fabric_id" = "maasserver_fabric"."id") LEFT OUTER JOIN "maasserver_domain" ON ("maasserver_node"."domain_id" = "maasserver_domain"."id") LEFT OUTER JOIN "auth_user" ON ("maasserver_node"."owner_id" = "auth_user"."id") LEFT OUTER JOIN "maasserver_bmc" ON ("maasserver_node"."bmc_id" = "maasserver_bmc"."id") LEFT OUTER JOIN "maasserver_interface" T11 ON ("maasserver_nodeconfig"."id" = T11."node_config_id") LEFT OUTER JOIN "maasserver_vlan" T12 ON (T11."vlan_id" = T12."id") LEFT OUTER JOIN "maasserver_space" ON (T12."space_id" = "maasserver_space"."id") LEFT OUTER JOIN "maasserver_resourcepool" ON ("maasserver_node"."pool_id" = "maasserver_resourcepool"."id") LEFT OUTER JOIN "maasserver_node_tags" ON ("maasserver_node"."id" = "maasserver_node_tags"."node_id") LEFT OUTER JOIN "maasserver_tag" ON ("maasserver_node_tags"."tag_id" = "maasserver_tag"."id") LEFT OUTER JOIN "maasserver_ownerdata" ON ("maasserver_node"."id" = "maasserver_ownerdata"."node_id") INNER JOIN "maasserver_zone" ON ("maasserver_node"."zone_id" = "maasserver_zone"."id") WHERE ("maasserver_node"."node_type" = 0 AND "maasserver_node"."node_type" = 0 AND NOT ("maasserver_node"."node_type" IN (2, 3, 4)) AND ("maasserver_node"."owner_id" IS NULL OR "maasserver_node"."owner_id" = 112) AND (UPPER("maasserver_node"."distro_series"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_fabric"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER(CONCAT("maasserver_node"."hostname", CONCAT('.', "maasserver_domain"."name"))::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_node"."osystem"::text) LIKE UPPER('%81nn7m2%') OR UPPER("auth_user"."username"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_bmc"."power_type"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_bmc"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_space"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_resourcepool"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_interface"."mac_address"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_tag"."name"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_ownerdata"."value"::text) LIKE UPPER('%81nn7m2%') OR UPPER("maasserver_zone"."name"::text) LIKE UPPER('%81nn7m2%'))) GROUP BY "maasserver_node"."id", "maasserver_interface"."mac_address", "maasserver_fabric"."name", CONCAT("maasserver_node"."hostname", CONCAT('.', "maasserver_domain"."name")), CASE WHEN "maasserver_node"."status" IN (10) THEN 'Allocated' WHEN "maasserver_node"."status" IN (8) THEN 'Broken' WHEN "maasserver_node"."status" IN (1) THEN 'Commissioning' WHEN "maasserver_node"."status" IN (6) THEN 'Deployed' WHEN "maasserver_node"."status" IN (9) THEN 'Deploying' WHEN "maasserver_node"."status" IN (2, 11, 15, 18, 20, 13, 22) THEN 'Failed' WHEN "maasserver_node"."status" IN (0) THEN 'New' WHEN "maasserver_node"."status" IN (4) THEN 'Ready' WHEN "maasserver_node"."status" IN (14, 12) THEN 'Releasing' WHEN "maasserver_node"."status" IN (17, 19, 16) THEN 'Rescue Mode' WHEN "maasserver_node"."status" IN (21) THEN 'Testing' ELSE 'Other' END) subquery; args=(' - ', 20, 0, '.', 10, 'Allocated', 8, 'Broken', 1, 'Commissioning', 6, 'Deployed', 9, 'Deploying', 2, 11, 15, 18, 20, 13, 22, 'Failed', 0, 'New', 4, 'Ready', 14, 12, 'Releasing', 17, 19, 16, 'Rescue Mode', 21, 'Testing', 'Other', 0, 0, 2, 3, 4, 112, '%81nn7m2%', '%81nn7m2%', '.', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '%81nn7m2%', '.', 10, 'Allocated', 8, 'Broken', 1, 'Commissioning', 6, 'Deployed', 9, 'Deploying', 2, 11, 15, 18, 20, 13, 22, 'Failed', 0, 'New', 4, 'Ready', 14, 12, 'Releasing', 17, 19, 16, 'Rescue Mode', 21, 'Testing', 'Other')

Hi @pjonason there are no changes in 3.4.1 respect to 3.4.0 related to machine listing and UI-related queries.

When you upgraded to 3.4.1 it’s expected to have high cpu usage for a small period of time because some cleanups are run. But after that, you should expect MAAS to be much faster according to our investigation !Action required: this cleanup might speedup your MAAS (3.4.0 and below) environments

Do you experience peaks of CPU usage periodically? Do you have monitoring enabled so that you can provide some numbers and graphs?


Here’s an overview graph of CPU. The first short peak around midday on 3/11 is the 3.4 upgrade. Prior to that you can see nothing out of the ordinary. Since then, multiple spikes where the queries above seem…angry. There’s also a little reprieve over the weekend. We increased the number of CPUs from 4 to 6 at the beginning of this week which reduced the intensity of the spikes, but MAAS is still unusable during those times.

We did look at the link you posted, and our current count is 0 on that query.

Yeah that’s because you upgraded to 3.4.1 which had run the cleanup described in that post automatically.

Do you see such spikes only when you interact with the UI? Is there anything in the logs? Exceptions? How many machines do you have in this environment and how many concurrent deployments do you usually start?

OK, that makes sense.

I believe it’s fair to say it only spikes with the UI active. Seems like those queries relate to someone (slowly) typing a hostname (in our case) into the search bar in the UI. Nothing in the logs jumps out, even with debug on. No exceptions seen.

We have 7500+ machines in this environment. Concurrent deployments would be fairly low, between 0 and 5 at any given time, I’d say. I’d have to look up a number.

with 7500 machines it’s reasonable to see these spikes for the current implementation. In 3.5 (not released yet) we have improved the machine listing with the UI by 50% of the time so you should see an improvement there when you’ll upgrade to that version.

Also, I’d expect high cpu usage if you call the /machines endpoints with the rest api. We are reworking these endpoints as well but the changes will be included in the next releases (3.6 onwards)

Thanks for interacting with us. I can accept CPU spikes here and there, but when they take out my API and UI, I either have to find a solution or revert back to 3.3.

We do see these queries in 3.3, they just don’t stall out like they do in 3.4. So they could be victims of another issue rather than the cause. We will continue to investigate. If you have any other ideas where we might search for clues, we’d be appreciative.

Thanks for the sneak peek at future version improvements.

Actually, there are a couple of other things to check. For example, can you share the size of the following table

SELECT pg_size_pretty( pg_total_relation_size('maasserver_event') );

and the number of neighbours

select count(*) from maasserver_neighbour;

?

We did clear out a sizable number of older events to see what effect it would have.

maasserver_event was ~3600 MB and 13 million records, we reduced down to 1402 MB and 2,472,047 records

SELECT pg_size_pretty( pg_total_relation_size('maasserver_event') ); Size: 1402 MB

and

select count(*) from maasserver_neighbour; Count: 1043

Have you just cleared the events table or did you do that in the past with no improvements?

Also, I would clear the neighbours as in some cases (I can’t say if you are in this case or not without looking into your env) they cause CPU spikes when you start the deployment of nodes

We cleared the events early in the week, and though it may have helped, it did not directly impact the hanging up of the API/UI.

Neighbours table cleared. We will see if it has any effect. We also turned network discovery off.

Ok. Did you find any call to the machines endpoint in the logs? Also, it would be interesting to correlate the calls you find in the access.log and the spikes.

Hi there!

I’m using 3.4.0 and my env has 3700+ machines (3 regions and 20 racks) and sometimes our API/UI hangs as well and nothing works. API starts to timeout and we have to restart everything and basically pray to return in a better state :confused:

It would be great to see some documentation about how to properly scale MaaS because sometimes we don’t know if its our setup or the application that can’t handle too much work/machines. I tend to believe that its the second because every time that I see someone else with a big env we see performance problems related to API/UI.

Some performance problems (like the API one) are well known and are currently being addressed by the team.

It would be great to see some documentation about how to properly scale MaaS because sometimes we don’t know if its our setup or the application that can’t handle too much work/machines.

I’d love that could be possible! Unfortunately, it’s really hard to tell people how they should operate maas since the environments can be really different to each other depending on the needs. Also, MAAS users have different background and knowledge and our docs can’t be the replacement of an IT expert that would design the proper solution for a given problem/environment.

This is why Canonical can provide enterprise support for such complex scenarios.

What we observe is that this behavior generally manifests in the morning when people are logging in to their laptops. We suspect most people keep browser tabs open to MAAS, which seems to “ping” MAAS every 11 seconds until the page is refreshed. At scale, this takes down the corresponding regiond thread. Incidentally, killing off the pegged thread does not improve MAAS behavior, and the managing thread is apparently unaware of the removed thread - no new thread is spawned to replace it. After refreshing the tab, the ping every 11 seconds goes away.

snippet from the access.log:

10.11.12.13 - - [25/Mar/2024:10:07:56 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:08:07 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:08:18 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:08:29 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:08:39 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:08:50 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:09:01 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:09:05 -0500] "GET /MAAS/r/machine/tx7tx6/summary HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:09:05 -0500] "GET /MAAS/r/static/js/main.cc28e418.js HTTP/1.1" 304 0 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:09:05 -0500] "GET /MAAS/r/static/css/main.1913b5b2.css HTTP/1.1" 304 0 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:09:06 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 70 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:09:06 -0500] "GET /MAAS/r/maas-favicon-32px.png HTTP/1.1" 304 0 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:10:33 -0500] "POST /MAAS/accounts/login/ HTTP/1.1" 204 0 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
10.11.12.13 - - [25/Mar/2024:10:10:33 -0500] "GET /MAAS/accounts/login/ HTTP/1.1" 200 69 "https://maas-prod.target.com:8443/MAAS/r/machine/tx7tx6/summary" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

We are backing out to 3.3 to avoid further customer impact.

Can you clarify wdym by regiond thread? From a technical perspective, every region consists of n python processes, and every python process has some numbers of threads.