Slow to perform terraform state refresh with 46 servers

I am facing very long times (10min) just to refresh terraform state while running terraform plan when using the maas terraform provider to manage HP servers using the redfish power driver.

Test setup:

I have 46 servers. Each server resource in my module uses the following maas terraform resources:

  • 1x maas_machine
  • 2x maas_network_interface_physical
  • 1x maas_network_interface_bond
  • 1x maas_instance

Command for test: time TF_LOG=trace terragrunt plan -parallelism=10

Results:

Maas CPU MaaS RAM Terraform Parallelism Time Insights
2 8 10 (default) 9m29s Original Setup
2 8 60 9m44s Increasing parallelism did not improve performance
2 8 1 16m48s Reducing parallelism reduced performance. This means that parallelism does contribute to scalability.
2 8 2 8m30s Increasing parallelism shows linear increase in performance.
2 8 3 8m21s Performance peaks at 2 parallelism and parallelism is no longer the rate limiting factor. This looks similar to the number of CPU cores. MaaS cpu utilization was also high.
8 8 10 5m30s Increasing CPU cores on the MaaS VM shows an increase in performance. MaaS cpu utilization was low (typically <50% on all cores)
8 16 10 5m53s MaaS RAM is not the rate limiting factor.
16 8 10 6m17s MaaS CPU is not the rate limiting factor.
16 8 20 6m10s Parallelism was not the rate limiting factor.
16 8 4 6m42s 4 parallelism is just as good as 20 parallelism.
16 8 2 8m12s Reducing parallelism reduces performance. 4 paralleism seems like a good number.

Throughout the tests, disk io and network throughput were not high and likely not the rate limiting factors.

Performance breakdown:

# parallelism=1
start
- 2s (0.20%)
maas_machine
- 54s (5.36%)
maas_network_interface_physical
- 10m5s or 605s (60.14%)
maas_network_interface_bond
- 4m51s or 291s (28.93%)
maas_instance
- 53s (5.27%)

It seems that the largest contributors come from maas_network_interface_physical and maas_network_interface_bond taking 89% of the entire time.

Is there any way i can further reduce the time of refreshing terraform state, given that parallelism, cpu and ram don’t seem to be the limiting factor anymore? Is there a cap on the number of temporal workers? Is there a way i can increase this cap?

One approach is to move network and bond configuration out of maas terraform and into the OS configuration since this is the largest contributor to the time. This is not preferable but it is something I am trying now, but i’m hoping that there is still a way to improve the time by increasing parallelism and cpu since it seems to have a significant effect at lower numbers.

I managed to find the rate limiting factor was the number of maas server workers.

From How to manage regions, the default was 4 workers which matched the results of my experiment above.

After increasing workers to 16 on the maas server with 16 cpus, using parallelism of 10 gave a time of 2m28s (compared to the previous value of 6m17s)

Hi, can I check why MaaS can only work with a maximum of 8 workers (assuming 1 worker to 1 core) as mentioned at the end of How to manage regions - #2? I have also had problems assigning more than 8 workers as MaaS will start throwing errors. This will be a waste of an entire server’s resources and it is also counter productive to set up VMs before setting up MaaS.

My main issue is still with the problem statement in this topic. Refreshing terraform state is very slow for only 10s of servers.

How many machines do have in that env?

Now I have 46 servers, and just terraform refresh of state alone takes 5-6min. It is possible to reduce that time with additional maas workers and increased terraform parallelism, but maas can only vertically scale up to 8 workers. I suppose horizontally scaling might work, but I haven’t tried clustered MaaS yet.

I’m curious about why the worker count was capped to 8. I assume that MaaS should typically be installed on bare metal servers, and hence it is surprising that it would not be able to benefit from an entire server’s resources.

It’s because each worker is binding a port for RPC, so by design you can only have 8 at max.

You might try to add another region and put HAproxy in front

You might be hitting Many functions are calling the machine listing to retrieve a single machine, overloading the MAAS regions · Issue #239 · canonical/terraform-provider-maas · GitHub . I would be curious to know how much time it takes to ‘maas admin machines read’ in your env

I am not keen on adding another region yet as this will mean having another server for MaaS which will be a large overhead, especially since the existing server still has idle cores.

Running time maas admin machines read results in

real 8.894s, user 3.761s, sys 0.333s.

Is this a bad time?

I need to correct what i said earlier. Terraform refresh takes about 3mins (after changing maas workers to 8) instead of 5-6min (with default maas workers of 4).

Also, can we configure additional RPC ports for additional workers to bind to? Was the 8 ports limit arbitrarily hardcoded or does that have other implications to the system?

It feels that this problem could be better solved in code somewhere (i’m not sure where) instead of brute force by adding more workers. 8 cores should be more than sufficient to perform the tasks. I’m just not sure why getting machine state takes so much time.

And I believe you are hitting the bug I pasted above: many resources are calling the machine listing and each call takes 10 seconds. This is how you end up with 3 minutes.

You might try to patch the codebase of course.

But for the future we are currently working on new performant API and we are also reworking the whole architecture so to better scale the MAAS components. The only improvement you might try without patching the code is to add another region

ok thanks for your prompt response