Version: latest

Enable dynamic MIG feature

HAMi now supports dynamic MIG using mig-parted to adjust MIG devices dynamically, including:

Dynamic MIG Instance Management: Users no longer need to operate directly on GPU nodes or use commands like nvidia-smi -i 0 -mig 1 to manage MIG instances. HAMi-device-plugin will handle this automatically.
Dynamic MIG Adjustment: Each MIG device managed by HAMi will dynamically adjust its MIG template according to the jobs submitted, as needed.
Device MIG Observation: Each MIG instance generated by HAMi will be displayed in the scheduler monitor, along with job information, providing a clear overview of MIG nodes.
Compatibility with HAMi-Core Nodes: HAMi can manage a unified GPU pool across both HAMi-core nodes and MIG nodes. A job can be scheduled to either node unless manually specified using the nvidia.com/vgpu-mode annotation.
Unified API with HAMi-Core: No additional work is required to make jobs compatible with the dynamic MIG feature.

Prerequisites

NVIDIA Blackwell, Hopper™, and Ampere GPUs
HAMi > v2.5.0
nvidia-container-toolkit

Enable dynamic MIG support

Install the chart using helm, See enabling vGPU support in kubernetes.

Configure mode in device-plugin configMap to mig for MIG nodes

kubectl describe cm hami-device-plugin -n kube-system

{
  "nodeconfig": [
    {
      "name": "MIG-NODE-A",
      "operatingmode": "mig",
      "filterdevices": {
        "uuid": [],
        "index": []
      }
    }
  ]
}

Restart the following pods for the change to take effect:
- hami-scheduler
- hami-device-plugin on 'MIG-NODE-A'

Custom MIG configuration (optional)

HAMi currently has a built-in MIG configuration for MIG.

You can customize the MIG configuration by following the steps below:

Edit `device-configmap.yaml` in charts/hami/templates/scheduler

nvidia:
  resourceCountName: { { .Values.resourceName } }
  resourceMemoryName: { { .Values.resourceMem } }
  resourceMemoryPercentageName: { { .Values.resourceMemPercentage } }
  resourceCoreName: { { .Values.resourceCores } }
  resourcePriorityName: { { .Values.resourcePriority } }
  overwriteEnv: false
  defaultMemory: 0
  defaultCores: 0
  defaultGPUNum: 1
  memoryFactor: 1
  deviceSplitCount: { { .Values.devicePlugin.deviceSplitCount } }
  deviceMemoryScaling: { { .Values.devicePlugin.deviceMemoryScaling } }
  deviceCoreScaling: { { .Values.devicePlugin.deviceCoreScaling } }
  knownMigGeometries:
    - models: ["A30"]
      allowedGeometries:
        - name: 1g.6gb
          memory: 6144
          count: 4
        - name: 2g.12gb
          memory: 12288
          count: 2
        - name: 4g.24gb
          memory: 24576
          count: 1

    - models: ["A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB"]
      allowedGeometries:
        - name: 1g.5gb
          memory: 5120
          count: 7
        - name: 2g.10gb
          memory: 10240
          count: 3
        - name: 1g.5gb
          memory: 5120
          count: 1
        - name: 3g.20gb
          memory: 20480
          count: 2
        - name: 7g.40gb
          memory: 40960
          count: 1

    - models: ["A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"]
      allowedGeometries:
        - name: 1g.10gb
          memory: 10240
          count: 7
        - name: 2g.20gb
          memory: 20480
          count: 3
        - name: 1g.10gb
          memory: 10240
          count: 1
        - name: 3g.40gb
          memory: 40960
          count: 2
        - name: 7g.79gb
          memory: 80896
          count: 1

Helm installations and updates will follow the configuration specified in this file, overriding the default Helm settings.

HAMi uses the first MIG template that matches the job, in the order defined in this configMap. :::

Running MIG jobs

A MIG instance can now be requested by a container in the same way as hami-core, by specifying the nvidia.com/gpu and nvidia.com/gpumem resource types.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  annotations:
    nvidia.com/vgpu-mode: "mig" #(Optional), if not set, this pod can be assigned to a MIG instance or a hami-core instance
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:22.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2
          nvidia.com/gpumem: 8000

In this example above, the job allocates two MIG instances, each with at least 8G device memory.

Monitor MIG Instance

MIG Instance managed by HAMi will be displayed in scheduler monitor (scheduler node ip:31993/metrics), as follows:

# HELP nodeGPUMigInstance GPU Sharing mode. 0 for hami-core, 1 for mig, 2 for mps
# TYPE nodeGPUMigInstance gauge
nodeGPUMigInstance{deviceidx="0",deviceuuid="GPU-936619fc-f6a1-74a8-0bc6-ecf6b3269313",migname="3g.20gb-0",nodeid="aio-node15",zone="vGPU"} 1
nodeGPUMigInstance{deviceidx="0",deviceuuid="GPU-936619fc-f6a1-74a8-0bc6-ecf6b3269313",migname="3g.20gb-1",nodeid="aio-node15",zone="vGPU"} 0
nodeGPUMigInstance{deviceidx="1",deviceuuid="GPU-30f90f49-43ab-0a78-bf5c-93ed41ef2da2",migname="3g.20gb-0",nodeid="aio-node15",zone="vGPU"} 1
nodeGPUMigInstance{deviceidx="1",deviceuuid="GPU-30f90f49-43ab-0a78-bf5c-93ed41ef2da2",migname="3g.20gb-1",nodeid="aio-node15",zone="vGPU"} 1

note

No action is required on MIG nodes - everything is managed by mig-parted in hami-device-plugin.
NVIDIA devices older than the Ampere architecture do not support MIG mode.
MIG resources (e.g., nvidia.com/mig-1g.10gb) won’t be visible on the node. HAMi uses a unified resource name for both MIG and hami-core nodes.

Prerequisites​

Enable dynamic MIG support​

Custom MIG configuration (optional)​

Edit device-configmap.yaml in charts/hami/templates/scheduler​

Running MIG jobs​

Monitor MIG Instance​