Deploying Omniverse Farm on Kubernetes#
The following document describes configuring and deploying Omniverse Farm with on Kubernetes. These instructions are generic and should apply to standard Kubernetes clusters as well as the various cloud flavours of Kubernetes. The intended audience for this document is experienced systems administrators familiar with Kubernetes, CPSs Kubernetes offerings if applicable, and the deployment of Helm charts.
1. Prerequisites#
A. NVIDIA Device Plugin#
The Kubernetes cluster must have the NVIDIA Device Plugin installed. This plugin provides a daemonset that automatically exposes the number of GPUs available, keeps track of GPU health, and runs GPU enabled containers.
The NVIDIA Device Plugin runs as a daemonset on all nodes by default in the cluster. A nodeSelector
can be used to isolate the daemonset to only run on GPU nodes.
On Helm install:
--set nodeSelector=<LABEL>=VALUE
Or via a values file:
# nvidia-device-plugin-values.yaml
nodeSelector:
<LABEL>: <VALUE>
Once the NVIDIA Device Plugin has been installed. You can verify the number of GPUs on the nodes via the following command:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
B. Kubernetes version#
Omniverse Farm has been tested on Kubernetes versions 1.22 and higher. We’d recommend using, where possible, Kubernetes 1.24 or higher.
2. Considerations#
A. Security#
It is strongly recommended to not expose Omniverse Farm to the public internet. Farm does not ship with authN/authZ and has limited authentication for job submission via tokens. If this is a technical requirement for your organization, be sure to restrict access to public endpoints (eg. security groups in cloud deployments, Firewalls and VPN access for on premise).
B. Capacity Tuning#
Tuning the Omniverse Farm controller’s maximum job capacity can be achieved through configuring farm-values.yaml
. This will limit the number of jobs that can run in parallel and may be useful for people running in mixed environments where they share Kubernetes with other workloads.
controller:
serviceConfig:
capacity:
max_capacity: 32
C. Number of GPUs#
Omniverse Farm will parallelize work based on the number of available GPUs. Once work has been assigned to a GPU, it will occupy the GPU until it completes.
In a production environment, it will take some experimentation to determine the optimal number of GPUs for the work being performed.
D. Storage#
Hard drive size selection must take into consideration both the containers being used and the types of jobs being executed.
Omniverse USD Composer (used for running various jobs) executes inside a large container (multiple gigabytes in size) and must have sufficient storage for the container. Generally, an EBS Volume around 100GB is a good starting point, but this is highly coupled with the requirements and workflow of your project.
Depending on the workload, data may be stored temporarily locally before. As such, the instance must have sufficient storage for any temporary files (this can be fairly large for rendering related jobs for example).
A cluster’s exact needs will be determined by the jobs the cluster is meant to execute.
Note
It is good practice to begin with oversized resources and then eventually pair back or grow into the resources as necessary rather than have an undersized cluster that may alarm or become unavailable due to resource starvation.
E. Management Services#
Multiple services handle communication, life cycle, and interaction across the Omniverse Farm cluster. These instances are considered memory intensive and should be treated as such. These services include the agents, controller, dashboard, jobs, logs, metrics, retries, settings, tasks, and UI services.
F. Ingress#
Omniverse Farm does not deploy an Ingress. In order to be able to reach the services from outside a Kubernetes cluster an Ingress may be required. There are a variety of ingress’s available.
Within NVIDIA Omniverse Farm has been deployed and tested with the nginx ingress as well as the Traefik ingress. CSPs usually have their own ingress solutions.
3. Deploying the Helm Chart#
A. Prerequisites#
Local:#
A valid NVIDIA NGC API key.
Installing the NGC CLI and configuring it
Cluster:#
A recent NVIDIA driver version that’s certified for the Omniverse applications that you’ll be using for Farm tasks (it should be preinstalled with current accelerated AMIs listed above).
NVIDIA k8s-device-plugin (see section 1.B).
NVIDIA Container Toolkit (should be preinstalled with current accelerated AMIs listed above).
It is assumed that a method of targeting specific nodes is utilized (e.g. nodeSelector).
B. Deploying Farm#
In this step, we will deploy the Omniverse Farm Helm chart. This document will provide a step-by-step guide and should be generic across Kubernetes flavours. For more advanced cases, feel free to examine the Helm chart itself and determine the best approach for your provider.
A full set of all resources (containers, Helm charts, job definitions) can be found in this collection
All steps utilize the following values, however you should feel free to change them at your discretion. For this guide, we will assume:
NAMESPACE=ov-farm
SECRET=registry-secret
NGC_API_TOKEN=<your_token>
Step 1:#
First, we will create a namespace for Omniverse Farm.
kubectl create namespace $NAMESPACE
As the container images referenced within the Helm chart are private, you will need to create a secret within your cluster namespace to provide your cluster with the NGC API token.
kubectl create secret docker-registry $SECRET \
--namespace $NAMESPACE \
--docker-server="nvcr.io" \
--docker-username='$oauthtoken' \
--docker-password=$NGC_API_TOKEN
Step 2:#
Create a farm-values.yaml
file. This file will be used for specifying overrides during the installation.
Note
Replace the highlighted lines with your secret.
1global:
2 imagePullSecrets:
3 - name: SECRET
4
5controller:
6 serviceConfig:
7 k8s:
8 jobTemplateSpecOverrides:
9 imagePullSecrets:
10 - name: SECRET
Note
It may be required to add additional overrides in the farm-values.yml
file.
For example, to tell the dashboard service to use a LoadBalancer
service type and to target t3.medium
instance types, if these are available in your cluster, you may need to add something like the following:
dashboard:
nodeSelector:
node.kubernetes.io/instance-type: t3.medium
service:
type: LoadBalancer
Step 3:#
Install the Omniverse Farm Helm chart:
FARM_HELM_VERSION="0.3.0
helm fetch \
https://helm.ngc.nvidia.com/nvstaging/omniverse-farm/charts/omniverse-farm-$FARM_HELM_VERSION.tgz \
--username='$oauthtoken' \
--password=$NGC_API_TOKEN
helm upgrade \
--install \
--create-namespace \
--namespace $NAMESPACE \
omniverse-farm \
omniverse-farm-$FARM_HELM_VERSION.tgz \
--values farm-values.yaml
Step 4:#
Validate the installation.
Ensure that all pods are in the ready state before proceeding.
kubectl -n $NAMESPACE wait --timeout=300s --for condition=Ready pods --all
The following command creates a curl
pod in the namespace that will allow us to query the various service endpoints.
(For more details on this, refer to the Official Kubernetes service networking documentation):
kubectl run curl --namespace=$NAMESPACE --image=radial/busyboxplus:curl -i --tty -- sh
The following code block defines two functions that facilitate querying if the various services are up:
1check_endpoint() {
2 url=$1
3 curl -s -o /dev/null "$url" && echo -e "[UP]\t${url}" || echo -e "[DOWN]\t${url}"
4}
5
6check_farm_status() {
7 echo "======================================================================"
8 echo "Farm status:"
9 echo "----------------------------------------------------------------------"
10 check_endpoint "http://agents.ov-farm/queue/management/agents/status"
11 check_endpoint "http://dashboard.ov-farm/queue/management/dashboard/status"
12 check_endpoint "http://jobs.ov-farm/queue/management/jobs/status"
13 check_endpoint "http://jobs.ov-farm/queue/management/jobs/load"
14 check_endpoint "http://logs.ov-farm/queue/management/logs/status"
15 check_endpoint "http://retries.ov-farm/queue/management/retries/status"
16 check_endpoint "http://tasks.ov-farm/queue/management/tasks/status"
17 check_endpoint "http://tasks.ov-farm/queue/management/tasks/list?status=submitted"
18 echo "======================================================================"
19}
Once you have the functions available in your curl
pod, you can query the status of Omniverse Farm by running:
check_farm_status
Output should be similar to:
======================================================================
Farm status:
----------------------------------------------------------------------
[UP] http://agents.ov-farm/queue/management/agents/status
[UP] http://dashboard.ov-farm/queue/management/dashboard/status
[UP] http://jobs.ov-farm/queue/management/jobs/status
[UP] http://jobs.ov-farm/queue/management/jobs/load
[UP] http://logs.ov-farm/queue/management/logs/status
[UP] http://retries.ov-farm/queue/management/retries/status
[UP] http://tasks.ov-farm/queue/management/tasks/status
[UP] http://tasks.ov-farm/queue/management/tasks/list?status=submitted
======================================================================
This validates that all Farm services are running and accessible.
Step 5:#
Now that you have confirmed that Farm services are available, it is time to run a simple job. This job definition runs the df
command using the busybox
container image.
Use the following command from the NGC CPU Resource setup documents to download the example df.kit
job and sample upload script:
ngc registry resource download-version "nvidia/omniverse-farm/cpu_verification:1.0.0"
Next, retrieve a token from Omniverse Farm for use in uploading jobs:
kubectl get cm omniverse-farm-jobs -o yaml -n $NAMESPACE | grep api_key
The token is unique per Farm instance and must be kept secure.
Two libraries are required dependencies of the Python script, install them with:
pip install requests
pip install toml
Finally, from the directory the files were downloading into, execute the following script to upload the job definition to your cluster:
python ./job_definition_upload df.kit --farm-url=<URL to the instance of Omniverse Farm> --api-key=<API Key as retrieved in previous step>
The job definition may take up to about 1 minute to propagate to the various services in the cluster.
Step 6:#
After a few moments, it should be safe to submit a job to Omniverse Farm for scheduling. Execute the following snippet (found in the NGC CPU Resource Quick Start Guide) to submit a job:
1export FARM_URL=<REPLACE WITH URL OF OMNIVERSE FARM INSTANCE>
2curl -X "POST" \
3 "${FARM_URL}/queue/management/tasks/submit" \
4 -H 'Accept: application/json' \
5 -H 'Content-Type: application/json' \
6 -d '{
7 "user": "testuser",
8 "task_type": "df",
9 "task_args": {},
10 "metadata": {
11 "_retry": {
12 "is_retryable": false
13 }
14 },
15 "status": "submitted"
16}'
After submitting, you should be able to navigate to ${FARM_URL}/queue/management/dashboard
, enter a username (this can be anything you want as no authentication is present) and observe your task in the task list.
Step 7:#
Now that you have confirmed that Farm services are available and that you can run a simple job, it is time to run a GPU workload. This job definition runs the gpu
command using the nvidia-cuda
container image.
Use the following command from the NGC GPU Resource setup documents to download the example gpu.kit
job and sample upload script:
ngc registry resource download-version "nvidia/omniverse-farm/gpu_verification:1.0.0"
Next, retrieve a token from Omniverse Farm for use in uploading jobs:
kubectl get cm -n ov-farm omniverse-farm-jobs -o yaml -n ov-farm | grep api_key | head -n 1 | awk '{print $3}' | tr -d \"
This token is unique per Farm instance and must be kept secure.
Two libraries are required dependencies of the python script, install them with:
pip install requests
pip install toml
Finally, from the directory the files were downloading into, execute the following script to upload the job definition to your cluster:
python ./job_definition_upload.py gpu.kit --farm-url=<url to the instance of omniverse farm> --api-key=<api key as retrieved in previous step>
The job definition may take up to about 1 minute to propagate to the various services in the cluster.
Step 8:#
After a few moments, it should be safe to submit a job to Omniverse Farm for scheduling. Execute the following snippet (found in the NGC GPU resource quick start guide) to submit a job:
1export FARM_URL=<REPLACE WITH URL OF OMNIVERSE FARM INSTANCE>
2curl -X "POST" \
3 "${FARM_URL}/queue/management/tasks/submit" \
4 -H 'Accept: application/json' \
5 -H 'Content-Type: application/json' \
6 -d '{
7 "user": "testuser",
8 "task_type": "gpu",
9 "task_args": {},
10 "metadata": {
11 "_retry": {
12 "is_retryable": false
13 }
14 },
15 "status": "submitted"
16}'
After submitting, you should be able to navigate to ${FARM_URL}/queue/management/dashboard
, enter a username (this can be anything you want as no authentication is present) and observe your task in the task list.
Conclusion#
At this point your cluster should have a working version of Omniverse Farm able to run basic jobs. It is worth having a closer look at the job definitions to see how workloads are structured and how you may be able to onboard your own workloads. Omniverse Farm on Kubernetes can run any containerized workload. We would recommend reading the Farm documentation on job definitions.
In the next section we will target onboarding a rendering workflow using Omniverse USD Composer.
4. Batch Rendering Workloads#
Omniverse Farm can be used as a powerful distributed rendering solution.
A. Configuring storage#
Before we dive into the workload itself there are some considerations regarding data access:
I. Nucleus#
Omniverse Farm jobs can be configured to connect directly to a Nucleus instance. This does assume that the Nucleus instance is accessible from the cloud either via a direct connect or by having an instance deployed in aws and configured to be accessible from the Omniverse Farm cluster.
Step 1:#
Create a Nucleus account that can be used as a service account.
Step 2:#
Update the job definitions that need access to Nucleus by adding in the OMNI_USER
and OMNI_PASS
environment variables. For example the create-render
job definition would look like the following if the user and password from step 1 were foo
and bar
:
1[job.create-render]
2job_type = "kit-service"
3name = "create-render"
4command = "/startup.sh"
5# There is inconsistency with how args are parsed within Kit.
6# This is why --enable arguments have a space in them as they do not support `--enable=`
7# They will however be split into individual args when submitting them
8args = [
9 "--enable omni.services.render",
10 "--/app/file/ignoreUnsavedOnExit=true",
11 "--/app/extensions/excluded/0=omni.kit.window.privacy",
12 "--/app/hangDetector/enabled=0",
13 "--/app/asyncRendering=false",
14 "--/rtx/materialDb/syncLoads=true",
15 "--/omni.kit.plugin/syncUsdLoads=true",
16 "--/rtx/hydra/materialSyncLoads=true",
17 "--/rtx-transient/resourcemanager/texturestreaming/async=false",
18 "--/rtx-transient/resourcemanager/enableTextureStreaming=false",
19 "--/exts/omni.kit.window.viewport/blockingGetViewportDrawable=true",
20 "--ext-folder", "/opt/nvidia/omniverse/farm-jobs/farm-job-create-render/exts-job.omni.farm.render",
21 "--/crashreporter/dumpDir=/tmp/renders",
22 # Example code to set up pushing metrics to a Prometheus push gateway.
23 #"--/exts/services.monitoring.metrics/push_metrics=true",
24 #"--/exts/services.monitoring.metrics/job_name=create_render",
25 #"--/exts/services.monitoring.metrics/push_gateway=http://localhost:9091"
26]
27task_function = "render.run"
28headless = true
29log_to_stdout = true
30container = "nvcr.io/nvidia/omniverse/create-render:2022.2.1"
31
32[job.create-render.env]
33OMNI_USER = "foo"
34OMNI_PASS = "bar"
Step 3:#
Upload the job definition to the Farm as explained previously and submit the jobs.
It should now be possible to read the files from Nucleus and upload back any results.
II. Kubernetes Persistent Volumes#
Farm jobs can be configured to write to Persistent Volumes which provide a variety of storage solutions that can be exposed to jobs. We will not cover how to configure the backend storage or the PVs themselves as these will vary per deployment. All CSPs do have options available to configure a variety of PV solutions.
Step 1:#
After configuring a PV it can be mounted by configuring the capacity_requirements
(see: resource limits) section in a job definition. For example, to mount a volume output-storage
into a pod at the location of /data/output
the job definition can be updated as below:
1[capacity_requirements]
2
3[[capacity_requirements.resource_limits]]
4cpu = 2
5memory = "14Gi"
6"nvidia.com/gpu" = 1
7
8[[capacity_requirements.volume_mounts]]
9mountPath = "/data/output"
10name = "output-storage"
11
12[[capacity_requirements.volumes]]
13name = "output-storage"
14[capacity_requirements.volumes.persistentVolumeClaim]
15claimName = "aws-credentials"
Step 2:#
Upload the job definition to the Farm as previously explained and then submit the jobs.
It should now be possible to read and write (depending on the permissions on the Persisted Volume) data from /data/output
.
B. Onboarding create-render
job definition#
Step 1:#
With storage now configured, the create-render
job can be onboarded.
Use the following command from the NGC create-render resource setup documents to download the example job.omni.farm.render.kit
job and sample upload script:
ngc registry resource download-version "nvidia/omniverse-farm/create-render:2022.2.1"
Step 2:#
Based on the selected storage solution, add the required job definition updates to the farm.job.create.render.kit
file.
Step 3:#
Next, retrieve a token from Omniverse Farm for use in uploading jobs:
kubectl get cm omniverse-farm-jobs -o yaml -n $NAMESPACE | grep api_key
The token is unique per Farm instance and must be kept secure.
Two libraries are required dependencies of the Python script, install them with:
pip install requests
pip install toml
Finally, from the directory the files were downloading into, execute the following script to upload the job definition to your cluster:
python ./job_definition_upload job.omni.farm.render.kit --farm-url=<URL to the instance of Omniverse Farm> --api-key=<API Key as retrieved in previous step>
The job definition may take up to about 1 minute to propagate to the various services in the cluster.
Step 4:#
With the job definition on-boarded, it is possible to submit a render job following the rendering with farm guide.