Slow to perform terraform state refresh with 46 servers

I am facing very long times (10min) just to refresh terraform state while running terraform plan when using the maas terraform provider to manage HP servers using the redfish power driver.

Test setup:

I have 46 servers. Each server resource in my module uses the following maas terraform resources:

  • 1x maas_machine
  • 2x maas_network_interface_physical
  • 1x maas_network_interface_bond
  • 1x maas_instance

Command for test: time TF_LOG=trace terragrunt plan -parallelism=10

Results:

Maas CPU MaaS RAM Terraform Parallelism Time Insights
2 8 10 (default) 9m29s Original Setup
2 8 60 9m44s Increasing parallelism did not improve performance
2 8 1 16m48s Reducing parallelism reduced performance. This means that parallelism does contribute to scalability.
2 8 2 8m30s Increasing parallelism shows linear increase in performance.
2 8 3 8m21s Performance peaks at 2 parallelism and parallelism is no longer the rate limiting factor. This looks similar to the number of CPU cores. MaaS cpu utilization was also high.
8 8 10 5m30s Increasing CPU cores on the MaaS VM shows an increase in performance. MaaS cpu utilization was low (typically <50% on all cores)
8 16 10 5m53s MaaS RAM is not the rate limiting factor.
16 8 10 6m17s MaaS CPU is not the rate limiting factor.
16 8 20 6m10s Parallelism was not the rate limiting factor.
16 8 4 6m42s 4 parallelism is just as good as 20 parallelism.
16 8 2 8m12s Reducing parallelism reduces performance. 4 paralleism seems like a good number.

Throughout the tests, disk io and network throughput were not high and likely not the rate limiting factors.

Performance breakdown:

# parallelism=1
start
- 2s (0.20%)
maas_machine
- 54s (5.36%)
maas_network_interface_physical
- 10m5s or 605s (60.14%)
maas_network_interface_bond
- 4m51s or 291s (28.93%)
maas_instance
- 53s (5.27%)

It seems that the largest contributors come from maas_network_interface_physical and maas_network_interface_bond taking 89% of the entire time.

Is there any way i can further reduce the time of refreshing terraform state, given that parallelism, cpu and ram don’t seem to be the limiting factor anymore? Is there a cap on the number of temporal workers? Is there a way i can increase this cap?

One approach is to move network and bond configuration out of maas terraform and into the OS configuration since this is the largest contributor to the time. This is not preferable but it is something I am trying now, but i’m hoping that there is still a way to improve the time by increasing parallelism and cpu since it seems to have a significant effect at lower numbers.

I managed to find the rate limiting factor was the number of maas server workers.

From How to manage regions, the default was 4 workers which matched the results of my experiment above.

After increasing workers to 16 on the maas server with 16 cpus, using parallelism of 10 gave a time of 2m28s (compared to the previous value of 6m17s)