Allocating a GPU

dandevslack · 11 January 2021 23:42

Hi all,

I’ve managed to get GPU tags working whereachines are automatically tagged with the correct GPU type after commission. However, I’m not able to deploy a KVM pod and allocate that GPU to a portion of the resources. This far, I’ve only deployed via MAAS UI. Is this something that must be done via CLI? Any other suggestions?

Any help is appreciated

calvinhartwell · 13 January 2021 03:22

Hi Dan,

Which model of GPU(s) are you using? Are you able to share information about your setup/test bed?

Unfortunately right now there is no way to “carve up” or share a single GPU across multiple KVM instances, at least not with NVIDIA cards on Ubuntu. The functionality you are describing is known as vGPU support and we are hoping to support this feature at some point later on in Ubuntu.

However, certain models of NVIDIA cards, I believe the A100 allow a GPU to be “split” into several “segments” similar to vGPU without the need for additional software like a licence server. These are presented to the underlying host as several individual GPU(S) and thus, resource sharing is made easier.

Here is a direct quote from NVIDIA on the subject:

MIG Capability of NVIDIA Ampere GPU Architecture
The new MIG feature can partition each A100 into as many as seven GPU Instances for optimal
utilization, effectively expanding access to every user and application.

The A100 GPU new MIG capability can divide a single GPU into multiple GPU partitions called
GPU Instances. Each instance’s SMs have separate and isolated paths through the entire
memory system — the on-chip crossbar ports, L2 cache banks, memory controllers and DRAM
address busses are all assigned uniquely to an individual instance. This ensures that an
individual user’s workload can run with predictable throughput and latency, with the same L2
cache allocation and DRAM bandwidth even if other tasks are thrashing their own caches or
saturating their DRAM interface.

Using this capability, MIG can partition available GPU compute resources to provide a defined
quality of service (QoS) with fault isolation for different clients (such as VMs, containers,
processes, and so on). It enables multiple GPU Instances to run in parallel on a single, physical
A100 GPU. MIG also keeps the CUDA programming model unchanged to minimize
programming effort.

CSPs can use MIG to raise utilization rates on their GPU servers, delivering up to 7x more GPU
Instances at no additional cost. MIG supports the necessary QoS and isolation guarantees
needed by CSPs to ensure that one client (VM, container, process) cannot impact the work or
scheduling from another client.

CSPs often partition their hardware based on customer usage patterns. Effective partitioning
only works if hardware resources are providing consistent bandwidth, proper isolation, and good
performance during runtime.

With NVIDIA Ampere architecture-based GPU, users will be able to see and schedule jobs on
their new virtual GPU Instances as if they were physical GPUs. MIG works with Linux operating
systems and their hypervisors. Users can run containers with MIG using runtimes such as
Docker Engine, with support for container orchestration using Kubernetes coming soon.

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf.

I have not personally tested this, however it should work. If you’re using an AMD card I assume there will be similar functionality available.

dandevslack · 13 January 2021 04:16

Thanks @calvinhartwell for the response. I was hoping to separate multiple quadro cards into separate kvm instances that are running on the same server. I’ve moved on to attempting the configuration with lxd, then assigning each lxd container rights to the gpu via config device add [vm name] gpu gpu id=gpu_num. I’ve tested the passthrough, which works, however whenver the lxd vm is provided with the GPU, it seems to lose connection to MAAS. Thus back to square one. Thanks for the assistance thus far

dandevslack · 13 January 2021 11:35

As an update it would seem that the LXD process is much more reliable. The only issue is that the VM loses connectivity and the ability to start once it has been assigned a gpu with the config device add command. Anyone else having these issues with MAAS and LXD? lxc start [vm name] does not respond either once gpu has been added.

calvinhartwell · 13 January 2021 13:50

Hi Dan,

I was hoping to separate multiple quadro cards into separate kvm instances that are running on the same server. I’ve moved on to attempting the configuration with lxd, then assigning each lxd container rights to the gpu via config device add [vm name] gpu gpu id=gpu_num. I’ve tested the passthrough, which works, however whenver the lxd vm is provided with the GPU, it seems to lose connection to MAAS. Thus back to square one. Thanks for the assistance thus far

OK - this is a different issues to what I was describing. In this case, you will be able to pass one quadro card to one per VM using hardware passthrough, but there will be a direct mapping of 1 GPU to 1 virtual machine. I am not sure if the MAAS API supports this right now, but it is possible to do this using the underlying API(s) provided by KVM/Virsh/LXD.

billwear · 18 May 2022 20:22

@dandevslack, ever solve this one?

michaelandrewfischer · 20 September 2024 05:25

Have run into this issue with maas 3.4 & 3.5. Is this not possible?