Troubleshooting#
Use the following information to troubleshoot issues that may arise during deployment and use of Omniverse Kit App Streaming.
Azure Deployment#
Azure Marketplace Deployment: The entire deployment failed to create.
This is likely a permissions issue. Ensure you have the ability to create resource groups, role assignments, and resources.
Virtual Machine: The initial virtual machine failed to start or execute the deployment script.
Occasionally in cloud environments, virtual machines fail their health checks and cannot start. Delete the deployment and start over.
Role Assignments: The deployment has started, but is unable to create role assignments.
Role assignments must have unique names within the entire subscription and are generated using the virtual machine and resource group names. Delete this deployment, start over, and then provide a new, unique “Project Name” in the marketplace deployment UI.
Ensure you have permissions to create role assignments within your subscription.
AKS Cluster: The cluster fails to create.
The cluster takes time to build. If the deployment completes and you still have no AKS cluster, review the run_command stderr logs on the virtual machine. A common problem is the unavailability of GPU instances. Check your regional quotas for A10_V5 instance type.
Instance Type and Public IP Minimum Availability Count:
Standard_d2s_v3: 2
Standard_D8s_v3: 1
Standard_NV36ads_A10_v5: 2
Public IP Addresses: 2 (minimum)
Omniverse Resources: Deployment fails to create Omniverse resources.
If the deployment completes and you do not see any Omniverse workloads in the AKS cluster, check the run_command logs on the virtual machine (see Virtual Machine Deployment Logs). This may be because the NGINX load balancer creation timed out. If this occurs, delete the deployment and all the resources in the resource group and start over.
Virtual Machine Deployment Logs#
If you see no directory inside of /var/lib/waagent/
called ‘run-command-handler’, it likely means that the VM itself failed its health check. Delete the resource group and start over.
Obtain the virtual machine IP Address from the Azure portal. It will have the same name as the project name you specified. (The admin username and password for logging into the virtual machine were specified during the initial deployment steps.)
Enter
ssh <username>@<IP>
, type in ‘yes’ when prompted, and then enter the password when prompted.Once you are logged into the virtual machine, type
sudo su
to become the root user. The error logs will be inside of/var/lib/waagent/run-command-handler/download/customScript<projectname>/0/stderr
.To see the parameters of the script that you ran, enter:
cat /var/lib/waagent/run-command-handler/download/customScript<projectname>/0/script.sh
To see the output of the script that you ran, enter:
cat /var/lib/waagent/run-command-handler/download/customScript<projectname>/0/stdout.sh
Kubernetes Deployment#
NVIDIA Container Registry API Key: You can test the validity of your NGC API Token Key by following the steps to create one, using it to log in to NVIDIA Docker Repository
nvcr.io
with Docker login, and then attempting to do a Docker pull locally for a Kit app container. For example:docker pull nvcr.io/nvidia/omniverse/kit-appstreaming-rmcp:1.9.0
If you are seeing “Image Pull Backoff” for the kit-app pods, it likely means that the NGC API Token Key was not able to log into the
nvcr.io
repository:Kubectl --cluster <AKS-CLUSTER-NAME> --namespace omni-streaming get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES applications-b95996b7f-ckxw6 0/2 ImagePullBackOff 0 104m 10.244.3.15 aks-gpupool-24796148-vmss000001 <none> <none> memcached-0 1/1 Running 0 105m 10.244.0.206 aks-cachepool-24796148-vmss000000 <none> <none> rmcp-6d65cdd765-ctgfp 0/1 ImagePullBackOff 0 104m 10.244.1.175 aks-agentpool-24796148-vmss000000 <none> <none> streaming-7bdbcc6475-smnwk 0/1 ImagePullBackOff 0 104m 10.244.2.69 aks-gpupool-24796148-vmss000000 <none> <none>
Web Viewer: If the Web Viewer fails to load the image from kit-app pod:
View the logs from the kit-app pod that the web viewer creates.
If you see a message about CUDA memory, check the annotations of the GPU node it is running on (
kubectl get node <node name> -o yaml
). The GPU operator version should be535.161.08
. If it is not, it is possible the GPU operator failed to install or was overwritten. Delete this deployment and start over. If the issue continues, file a ticket with Azure Technical Support and verify the GPU operator installed on A10_V5 family is correct.