vGPU Manager VIB Installation

This chapter illustrates how the install the NVIDIA vGPU Manager VIB. It provides a step-by-step guide which includes downloading and preparing the VIB file for installation on the host.

For demonstration purposes, these steps use the VMWare vSphere web interface to upload the VIB to the server host.

The following paragraphs describe how to

  • Uploading VIB in vSphere Web Client

  • Installing the VIB

  • Updating the VIB

  • Verifying the Installation of the VIB

  • Uninstalling VIB

  • Changing the Default Graphics Type in VMware vSphere 6.5 and Later

  • Changing the vGPU Scheduling Policy

Uploading VIB in vSphere Web Client

For demonstration purposes, these steps use the VMWare vSphere web interface for uploading the VIB to the server host.

Before you begin, download the archive containing the VIB file and extract the contents of the archive to a folder. The file ending with VIB is the file that you must upload to the host data store for installation.

Use the following procedure to upload the file to the data store using vSphere Web Client.

  1. Click the Related Objects tab for the desired server

  2. Select Datastores.

  3. Either RIGHT CLICK the data store and then select Browse Files or LEFT CLICK the icon in the toolbar. The Datastore Browser window displays.

    ../_images/deployment_vgpu_section-04_subsection-01_image-01.png
  4. Click the New Folder icon.

    • The Create a new folder window displays.

  5. Name the new folder vib and then click Create.

    ../_images/deployment_vgpu_section-04_subsection-01_image-02.png
  6. Select the vib folder in the Datastore Browser window.

  7. Click the Upload icon.

    ../_images/deployment_vgpu_section-04_subsection-01_image-03.png
    • The Client Integration Access Control window displays.

  8. Select Allow.

  9. The .VIB file is uploaded to the data store on the host.

Note

If you do not click Allow before the timer runs out, then further attempts to upload a file will silently fail. If this happens, exit and restart vSphere Web Client., Repeat this procedure and be sure to click Allow before the timer runs out.

Installing vGPU Manager with the .vib File

The NVIDIA Virtual GPU Manager runs on the ESXi host. It is provided in the following formats.

  • As a VIB file, which must be copied to the ESXi host and then installed.

  • As an offline bundle that you can import manually as explained in Import Patches Manually in the VMware vSphere documentation

Important

Prior to vGPU release 11, NVIDIA Virtual GPU Manager and Guest VM drivers must be matched from the same main driver branch. If you update vGPU Manager to a release from another driver branch, guest VMs will boot with vGPU disabled until their guest vGPU driver is updated to match the vGPU Manager version. Consult Virtual GPU Software for VMware vSphere Release Notes for further details.

To install the vGPU Manager VIB you need to access the ESXi host via the ESXi Shell or SSH. Refer to VMware’s documentation on how to enable ESXi Shell or SSH for an ESXi host.

Note

Before proceeding with the vGPU Manager installation make sure that all VMs are powered off and the ESXi host is placed in maintenance mode. Refer to VMware’s documentation on how to place an ESXi host in maintenance mode.

  1. Place the host into Maintenance mode by right-clicking it and then selecting Maintenance Mode - Enter Maintenance Mode.

    ../_images/deployment_vgpu_section-04_subsection-02_image-01.png

    ..note:: Alternatively, you can place the host into Maintenance mode using the command prompt by entering $ esxcli system maintenanceMode set -- enable=true. This command will not return a response. Making this change using the command prompt will not refresh the vSphere Web Client UI. Click the Refresh icon in the upper right corner of the vSphere Web Client window.

    Important

    Placing the host into maintenance mode disables any vCenter appliance running on this host until you exit maintenance mode and then restart that vCenter appliance.

  2. Click OK to confirm your selection.

  3. Use the esxcli command to install the vGPU Manager package:

    [root@esxi:~] esxcli software vib install -v directory/ NVIDIA-VMware-460.32.04-1OEM.670.0.0.8169922.x86_64.vibNVIDIA-vGPU-
    VMware_ESXi_6.0_Host_Driver_390.72-1OEM.600.0.0.2159203.vib
    Installation Result    Message: Operation finished successfully.
    Reboot Required: false
    VIBs Installed: NVIDIA-VMware-460.32.04-1OEM.670.0.0.8169922.x86_64.vibNVIDIA-vGPU-
    VMware_ESXi_6.0_Host_Driver_390.72-1OEM.600.0.0.2159203
    VIBs Removed:
    VIBs Skipped:
    
  4. Where <directory> is the absolute pathname of the directory that contains the .vib file. You must specify the absolute path even if the .vib file is in the current working directory.

  5. Reboot the ESXi host and remove it from maintenance mode.

    Note

    Although the display states “Reboot Required: false”, a reboot is necessary for the vib to load and xorg to start.

  6. From the vSphere Web Client, exit Maintenance Mode by right clicking the host and selecting Exit Maintenance Mode.

    Note

    Alternatively, you may exit from Maintenance mode via the command prompt by entering: $ esxcli system maintenanceMode set -- enable=false

  7. This command will not return a response.

  8. Making this change via the command prompt will not refresh the vSphere Web Client UI. Click the Refresh icon in the upper right corner of the vSphere Web Client window.

  9. Reboot the host from the vSphere Web Client by right clicking the host and then selecting Reboot.

    Note

    You can reboot the host by entering the following at the command prompt: $ reboot

  10. This command will not return a response. The Reboot Host window displays.

  11. When rebooting from the vSphere Web Client, enter a descriptive reason for the reboot in the Log a reason for this reboot operation field, and then click OK to proceed.

  12. Updating vGPU Manager with the .vib File

  13. Update the vGPU Manager VIB package if you want to install a new version of NVIDIA Virtual GPU Manager on a system where an existing version is already installed.

  14. To update the vGPU Manager VIB you need to access the ESXi host via the ESXi Shell or SSH. Refer to VMware’s documentation on how to enable ESXi Shell or SSH for an ESXi host.

    Note

    Before proceeding with the vGPU Manager update, make sure that all VMs are powered off and the ESXi host is placed in maintenance mode. Refer to VMware’s documentation on how to place an ESXi host in maintenance mode

  15. Use the esxcli command to update the vGPU Manager package:

    [root@esxi:~] esxcli software vib update -v directory/ NVIDIA-VMware-460.32.04-1OEM.670.0.0.8169922.x86_64.vibNVIDIA-vGPU-VMware_ESXi_6.0_Host_Driver_390.72-1OEM.600.0.0.2159203.vib
    Installation Result    Message: Operation finished successfully.
    Reboot Required: false
    VIBs Installed: NVIDIA-VMware-460.32.04-1OEM.670.0.0.8169922.x86_64.vibNVIDIA-vGPU-
    VMware_ESXi_6.0_Host_Driver_390.72-1OEM.600.0.0.2159203
    VIBs Removed: NVIDIA-vGPU-
    VMware_ESXi_6.0_Host_Driver_390.57-1OEM.600.0.0.2159203    VIBs Skipped:
    

    Note

    Directory is the path to the directory that contains the VIB file.

  16. Reboot the ESXi host and remove it from maintenance mode.

Verifying the Installation of vGPU Manager

  1. After the ESXi host has rebooted, verify the installation of the NVIDIA vGPU software package.

  2. Verify that the NVIDIA vGPU software package installed and loaded correctly by checking for the NVIDIA kernel driver in the list of kernels loaded modules. [root@esxi:~] ``vmkload_mod -l | grep nvidia                   5    8420

  3. If the NVIDIA driver is not listed in the output, check dmesg for any load-time errors reported by the driver.

  4. Verify that the NVIDIA kernel driver can successfully communicate with the

  5. NVIDIA physical GPUs in your system by running the nvidia-smi command.

  6. The nvidia-smi command is described in more detail in NVIDIA System Management Interface nvidia-smi.

  7. Running the nvidia-smi command should produce a listing of the GPUs in your platform.

    [root@esxi:~] nvidia-smi ``
    Wed Jan 13 19:48:05 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.32.0450.55  Driver Version: 460.32.0450.55 CUDA Version: N/A |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            On   | 00000000:81:00.0 Off |                  Off |
    | N/A   33C    P8    15W /  70W |     79MiB / 16383MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla T4            On   | 00000000:C5:00.0 Off |                  Off |
    | N/A   31C    P8    15W /  70W |     79MiB / 16383MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    [root@esxi:~]
    
  8. If nvidia-smi fails to report the expected output for all the NVIDIA GPUs in your system, see Troubleshooting for troubleshooting steps.

  9. SMI also allows GPU monitoring using the following command: $ nvidia-smi -l

  10. This command switch adds a loop, auto refreshing the display at a fixed interval.

Uninstalling vGPU Manager

Use the following procedure to uninstall VIB.

  1. Determine the name of the vGPU driver bundle: $ esxcli software vib list | grep -i nvidia

  2. This command returns output like the following: NVIDIA-VMware_ESXi_6.7_Host_Driver 390.72-1OEM.600.0.0.2159203           NVIDIA VMwareAccepted    2018-07-20

  3. Run the following command to uninstall the driver package: $ esxcli software vib remove -n NVIDIA-VMware_ESXi_6.7_Host_Driver --maintenance-mode

  4. The following message displays when installation is successful:

    Removal Result
        Message: Operation finished successfully.
        Reboot Required: false
    VIBs Installed:
    VIBs Removed: NVIDIA_bootbank_NVIDIA-VMware_ESXi_6.7_Host_Driver_390.72-1OEM.600.0.0.2159203
    VIBs Skipped: Reboot the host to complete the uninstallation process.
    

Changing the Default Graphics Type in VMware vSphere 6.5 and Later

The vGPU Manager VIBs for VMware vSphere 6.5 and later provide vSGA and vGPU functionality in a single VIB. After this VIB is installed, the default graphics type is Shared, which provides vSGA functionality. To enable vGPU support for VMs in VMware vSphere 6.5, you must change the default graphics type to Shared Direct. If you do not change the default graphics type, VMs to which a vGPU is assigned fail to start and the following error message is displayed:

The amount of graphics resource available in the parent resource pool is insufficient for the operation.

Note

If you are using a supported version of VMware vSphere earlier than 6.5, or are configuring a VM to use vSGA, omit this task.

Change the default graphics type before configuring vGPU. Output from the VM console in the VMware vSphere Web Client is not available for VMs that are running vGPU.

Before changing the default graphics type, ensure that the ESXi host is running and that all VMs on the host are powered off.

  1. Log in to vCenter Server by using the vSphere Web Client.

  2. In the navigation tree, select your ESXi host and click the Configure tab.

  3. From the menu, choose Graphics and then click the Host Graphics tab.

  4. On the Host Graphics tab, click Edit.

    ../_images/deployment_vgpu_section-04_subsection-06_image-01.png
  5. In the Edit Host Graphics Settings dialog box that opens, select Shared Direct and click OK.

    ../_images/deployment_vgpu_section-04_subsection-01_image-02.png
  6. Screen capture showing the Edit Host Graphics Settings dialog box in the VMware vCenter Web UI for changing the default graphics type

    Note

    In this dialog box, you can also change the allocation scheme for vGPU-enabled VMs. For more information, see Modifying GPU Allocation Policy on VMware vSphere.

  7. After you click OK, the default graphics type changes to Shared Direct.

  8. Restart the ESXi host or stop and restart the Xorg service and nv-hostengine on the ESXi host.

  9. To stop and restart the Xorg service and nv-hostengine, perform these steps.

    • Stop the Xorg service. [root@esxi:~] /etc/init.d/xorg stop

    • Stop nv-hostengine. [root@esxi:~] nv-hostengine -t

    • Wait for 1 second to allow nv-hostengine to stop.

    • Start nv-hostengine. [root@esxi:~] nv-hostengine -d

    • Start the Xorg service. [root@esxi:~] /etc/init.d/xorg start

After changing the default graphics type, configure vGPU as explained in Configuring a vSphere VM with Virtual GPU.

See also the following topics in the VMware vSphere documentation:

  • Log in to vCenter Server by Using the vSphere Web Client

  • Configuring Host Graphics

Changing the vGPU Scheduling Policy

GPUs, starting with the NVIDIA® Maxwell™ architecture, implement a best effort vGPU scheduler that aims to balance performance across vGPUs by default. The best-effort scheduler allows a vGPU to use GPU processing cycles that are not being used by other vGPUs. Under some circumstances, a VM running a graphics-intensive application may adversely affect the performance of graphics-light applications running in other VMs.

GPUs, starting with the NVIDIA® Pascal™ architecture, also support equal-share and fixed-share vGPU schedulers. These schedulers impose a control on GPU processing cycles used by a vGPU which prevents graphics-intensive applications running in one VM from affecting the performance of graphics-light applications running in other VMs. The best-effort scheduler is the default scheduler for all supported GPU architectures.

  • The GPUs that are based on the Pascal architecture are the NVIDIA P4, NVIDIA P6, NVIDIA P40, and NVIDIA P100.

  • The GPUs that are based on the Volta™ architecture are the NVIDIA V100 SXM2, NVIDIA V100 PCIe, NVIDIA V100 FHHL, and NVIDIA V100s.

  • The GPUs that are based on the Turing™ architecture are the NVIDIA T4, RTX 6000 and RTX 8000.

  • The GPUs that are based on the Ampere™ architecture are the NVIDIA A100, and A40.

vGPU Scheduling Policies

NVIDIA RTX vWS provides three GPU scheduling options to accommodate a variety of QoS requirements of customers. Additional information regarding GPU scheduling can be found here.

  • Fixed share scheduling always guarantees the same dedicated quality of service. The fixed share scheduling policies guarantee equal GPU performance across all vGPUs sharing the same physical GPU. Dedicated quality of service simplifies a POC since it allows the use of common benchmarks used to measure physical workstation performance such as SPECviewperf, to compare the performance with current physical or virtual workstations.

  • Best effort scheduling provides consistent performance at a higher scale and therefore reduces the TCO per user. The best effort scheduler leverages a round-robin scheduling algorithm which shares GPU resources based on actual demand which results in optimal utilization of resources. This results in consistent performance with optimized user density. The best effort scheduling policy best utilizes the GPU during idle and not fully utilized times, allowing for optimized density and a good QoS.

  • Equal share scheduling provides equal GPU resources to each running VM. As vGPUs are added or removed, the share of GPU processing cycles allocated changes, accordingly, resulting in performance to increase when utilization is low, and decrease when utilization is high.

  • Organizations typically leverage the best effort GPU scheduler policy for their deployment to achieve better utilization of the GPU, which usually results in supporting more users per server with a lower quality of service (QoS) and better TCO per user.

RmPVMRL Registry Key

The RmPVMRL registry key sets the scheduling policy for NVIDIA vGPUs.

Note

You can change the vGPU scheduling policy only on GPUs based on the Pascal, Volta, Turing, and Ampere architectures.

Type

Value

Meaning

DWORD

0x00 (default)

Best effort scheduler

DWORD

0x01

Equal share scheduler with the default time slice length

DWORD

0x00TT0001

Equal share scheduler with a user-defined time slice length TT

DWORD

0x11

Fixed share scheduler with the default time slice length

DWORD

0x00TT0011

Fixed share scheduler with a user-defined time slice length TT

Examples

The default time slice length depends on the maximum number of vGPUs per physical GPU allowed for the vGPU type.

Maximum Number of vGPUs

Default Time Slice Length

Less than or equal to 8

2 ms

Greater than 8

1 ms

TT
  • Two hexadecimal digits in the range 01 to 1E that set the length of the time slice in milliseconds (ms) for the equal share and fixed share schedulers. The minimum length is 1 ms and the maximum length is 30 ms.

  • If TT is 00, the length is set to the default length for the vGPU type.

  • If TT is greater than 1E, the length is set to 30 ms.

Examples

  • This example sets the vGPU scheduler to equal share scheduler with the default time slice length.

    RmPVMRL=0x01

  • This example sets the vGPU scheduler to equal share scheduler with a time slice that is 3 ms long.

    RmPVMRL=0x00030001

  • This example sets the vGPU scheduler to fixed share scheduler with the default time slice length.

    RmPVMRL=0x11

  • This example sets the vGPU scheduler to fixed share scheduler with a time slice that is 24 (0x18) ms long.

    RmPVMRL=0x00180011

Changing the vGPU Scheduling Policy for All GPUs

Perform this task in your hypervisor command shell.

  1. Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use secure shell (SSH) for this purpose. Set the RmPVMRL registry key to specify the GPU scheduling policy you want.

  2. Add an entry to the /etc/modprobe.d/nvidia.conf file. options nvidia NVreg_RegistryDwords=”RmPVMRL=>value>"

    Where <value> is the value that sets the vGPU scheduling policy you want.
    • 0x00 - Equal Share Scheduler with the default time slice length

    • 0x00030001 - Equal Share Scheduler with a time slice of 3 ms

    • 0x011 - Fixed Share Scheduler with the default time slice length

    • 0x00180011 - Fixed Share Scheduler with a time slice of 24 ms (0x18)

    The default time slice length depends on the maximum number of vGPUs per physical GPU allowed for the vGPU type.

    Maximum Number of vGPUs

    Default Time Slice Length

    Less than or equal to 8

    2 ms

    Greater than 8

    1 ms

    For all supported values, see RmPVMRL Registry Key.

  3. Reboot your hypervisor host machine.

Changing the vGPU Scheduling Policy for Select GPUs

Note

You can change the vGPU scheduling policy only on GPUs based on the Pascal, Volta, Turing, and Ampere architectures.

Perform this task in your hypervisor command shell.

  1. Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use secure shell (SSH) for this purpose.

  2. Use the lspci command to obtain the PCI domain and bus/device/function (BDF) of each GPU for which you want to change the scheduling behavior. Add the -D option to display the PCI domain and the -d 10de: option to display information only for NVIDIA GPUs. # lspci -D -d 10de:

    The NVIDIA GPUs listed in this example have the PCI domain 0000 and BDFs 85:00.0 and 86:00.0.

    0000:85:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [M60] (rev a1)
    0000:86:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [M60] (rev a1)
    
  3. Use the module parameter NVreg_RegistryDwordsPerDevice to set the pci and RmPVMRL registry keys for each GPU.

  4. Add the following entry to the /etc/modprobe.d/nvidia.conf file.

    options nvidia NVreg_RegistryDwordsPerDevice=”pci=pci-domain:pci-bdf;RmPVMRL=value
    [;pci=pci-domain:pci-bdf;RmPVMRL=value...]"
    

    For each GPU, provide the following information:

    • pci-domain: The PCI domain of the GPU.

    • pci-bdf: The PCI device BDF of the GPU.

    Value

    • 0x00 - Sets the vGPU scheduling policy to Equal Share Scheduler with the default time slice length.

    • 0x00030001 - Sets the vGPU scheduling policy to Equal Share Scheduler with a time slice that is 3ms long.

    • 0x011 - Sets the vGPU scheduling policy to Fixed Share Scheduler with the default time slice length.

    • 0x00180011 - Sets the vGPU scheduling policy to Fixed Share Scheduler with a time slice that is 24 ms (0x18) long.

    For all supported values, see RmPVMRL Registry Key.

    This example adds an entry to the /etc/modprobe.d/nvidia.conf file to change the scheduling behavior of two GPUs as follows:

    • For the GPU at PCI domain 0000 and BDF 85:00.0, the vGPU scheduling policy is set to Equal Share Scheduler.

    • For the GPU at PCI domain 0000 and BDF 86:00.0, the vGPU scheduling policy is set to Fixed Share Scheduler.

    options nvidia NVreg_RegistryDwordsPerDevice= "pci=0000:85:00.0;RmPVMRL=0x01;pci=0000:86:00.0;RmPVMRL=0x11"
    
  5. Reboot your hypervisor host machine.

Restoring Default vGPU Scheduler Settings

Perform this task in your hypervisor command shell.

  1. Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use secure shell (SSH) for this purpose.

  2. Unset the RmPVMRL registry key by commenting out the entries in the /etc/modprobe.d/nvidia.conf file that set RmPVMRL by prefixing each entry with the # character.

  3. Reboot your hypervisor host machine.

Disabling and Enabling ECC Memory

Some GPUs that support NVIDIA vGPU software support error correcting code (ECC) memory with NVIDIA vGPU. ECC memory improves data integrity by detecting and handling double-bit errors. However, not all GPUs, vGPU types, and hypervisor software versions support ECC memory with NVIDIA vGPU.

On GPUs that support ECC memory with NVIDIA vGPU, ECC memory is supported with C-series and Q-series vGPUs, but not with A-series and B-series vGPUs. Although A-series and B-series vGPUs start on physical GPUs on which ECC memory is enabled, enabling ECC with vGPUs that do not support it might incur some costs.

On physical GPUs that do not have HBM2 memory, the amount of frame buffer that is usable by vGPUs is reduced. All types of vGPU are affected, not just vGPUs that support ECC memory.

The effects of enabling ECC memory on a physical GPU are as follows:

  • ECC memory is exposed as a feature on all supported vGPUs on the physical GPU.

  • In VMs that support ECC memory, ECC memory is enabled, with the option to disable ECC in the VM.

  • ECC memory can be enabled or disabled for individual VMs. Enabling or disabling ECC memory in a VM does not affect the amount of frame buffer that is usable by vGPUs.

GPUs based on the Pascal GPU architecture and later GPU architectures support ECC memory with NVIDIA vGPU. These GPUs are supplied with ECC memory enabled.

M60 and M6 GPUs support ECC memory when used without GPU virtualization, but NVIDIA vGPU does not support ECC memory with these GPUs. In graphics mode, these GPUs are supplied with ECC memory disabled by default.

Some hypervisor software versions do not support ECC memory with NVIDIA vGPU.

If you are using a hypervisor software version or GPU that does not support ECC memory with NVIDIA vGPU and ECC memory is enabled, NVIDIA vGPU fails to start. In this situation, you must ensure that ECC memory is disabled on all GPUs if you are using NVIDIA vGPU.

Disabling ECC Memory

If ECC memory is unsuitable for your workloads but is enabled on your GPUs, disable it. You must also ensure that ECC memory is disabled on all GPUs if you are using NVIDIA vGPU with a hypervisor software version or a GPU that does not support ECC memory with NVIDIA vGPU. If your hypervisor software version or GPU does not support ECC memory and ECC memory is enabled, NVIDIA vGPU fails to start.

Where to perform this task from depends on whether you are changing ECC memory settings for a physical GPU or a vGPU.

  • For a physical GPU, perform this task from the hypervisor host.

  • For a vGPU, perform this task from the VM to which the vGPU is assigned.

Note

ECC memory must be enabled on the physical GPU on which the vGPUs reside.

Before you begin, ensure that NVIDIA Virtual GPU Manager is installed on your hypervisor. If you are changing ECC memory settings for a vGPU, also ensure that the NVIDIA vGPU software graphics driver is installed in the VM to which the vGPU is assigned.

  1. Use nvidia-smi to list the status of all physical GPUs or vGPUs and check for ECC noted as enabled.

    # nvidia-smi -q``
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:36:45 2020
    Driver Version                      : 450.55
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    [...]
    Ecc Mode
    Current                     : Enabled
    Pending                     : Enabled
    [...]
    
  2. Change the ECC status to off for each GPU for which ECC is enabled.

  3. If you want to change the ECC status to off for all GPUs on your host machine or vGPUs assigned to the VM, run this command: # nvidia-smi -e 0

  4. If you want to change the ECC status to off for a specific GPU or vGPU, run this command: # nvidia-smi -i id -e 0

  5. id is the index of the GPU or vGPU as reported by nvidia-smi.

  6. This example disables ECC for the GPU with index 0000:02:00.0. # nvidia-smi -i 0000:02:00.0 -e 0

  7. Reboot the host or restart the VM.

  8. Confirm that ECC is now disabled for the GPU or vGPU.

    # nvidia—smi —q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:37:53 2020
    Driver Version                      : 450.55
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    [...]
    Ecc Mode
    Current                     : Disabled
    Pending                     : Disabled
    [...]
    

Enabling ECC Memory

If ECC memory is suitable for your workloads and is supported by your hypervisor software and GPUs, but is disabled on your GPUs or vGPUs, enable it. Where to perform this task from depends on whether you are changing ECC memory settings for a physical GPU or a vGPU.

  • For a physical GPU, perform this task from the hypervisor host.

  • For a vGPU, perform this task from the VM to which the vGPU is assigned.

..note:: ECC memory must be enabled on the physical GPU on which the vGPUs reside.

Before you begin, ensure that NVIDIA Virtual GPU Manager is installed on your hypervisor. If you are changing ECC memory settings for a vGPU, also ensure that the NVIDIA vGPU software graphics driver is installed in the VM to which the vGPU is assigned.

  1. Use nvidia-smi to list the status of all physical GPUs or vGPUs and check for ECC noted as disabled.

    # nvidia-smi -q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:36:45 2020
    Driver Version                      : 450.55
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    
    [...]
    Ecc Mode
    Current                     : Disabled
    Pending                     : Disabled
    [...]
    
  2. Change the ECC status to on for each GPU or vGPU for which ECC is enabled.
    • If you want to change the ECC status to on for all GPUs on your host machine or vGPUs assigned to the VM, run this command: # nvidia-smi -e 1

    • If you want to change the ECC status to on for a specific GPU or vGPU, run this command: # nvidia-smi -i id -e 1 id is the index of the GPU or vGPU as reported by nvidia-smi.

    This example enables ECC for the GPU with index 0000:02:00.0. # nvidia-smi -i 0000:02:00.0 -e 1

  3. Reboot the host or restart the VM.

  4. Confirm that ECC is now enabled for the GPU or vGPU.

    # nvidia—smi —q
    
    ==============NVSMI LOG==============
    
    Timestamp                           : Mon Jul 13 18:37:53 2020
    Driver Version                      : 450.55
    Attached GPUs                       : 1
    GPU 0000:02:00.0
    
    [...]
    Ecc Mode
    Current                     : Enabled
    Pending                     : Enabled
    [...]