Deploy to Cloud

The following document describes configuring and deploying Omniverse Farm with Amazon Elastic Kubernetes Service (EKS). The intended audience for this document is experienced systems administrators familiar with Amazon, Kubernetes, and the deployment of Helm charts.

Note

Deploy to cloud is in its early stages and is expected to mature in both ease of deployment as well as accessability throughout the Omniverse Platform. If you have trouble or concerns, please make your voice heard on the Omnmiverse Forums.

1. Prerequisites

A. AWS Configuration

If you are familiar with AWS, but not EKS, then we recommend starting with the user guide to get a high level overview and then working through the Amazon EKS workshop to gain familiarity with the topic.

General AWS EKS documentation can be found here and provides details on getting started, best practices, API surface, and using the AWS EKS cli.

Note: if using a pre-existing EKS cluster before 1.24 and updating, then it is recommended to familiarize yourself with Dockershim deprecation. If starting from 1.24, no intervention is required.

In order to deploy Omniverse Farm on AWS an adequately sized cluster must be setup and configured for use. It is expected that a user has an AWS account with appropriate EC2 service quotas for the desired instance type(s) in a specified region. These EC2 instances are expected to be part of a VPC with configured security groups and subnets and an EKS cluster must be running on a supported version of K8S.

Typically, at least two types of node configurations are needed depending on the type of workload:

  • One or more node(s) and/or node group(s) configured for Farm services.

  • One or more node(s) and/or node group(s) configured for Farm workers. This typically includes:

    • Non-GPU workloads.

    • GPU workloads (T4/A10/A100 GPU required) running on supported accelerated computing instance types (P4, G5, G4dn families) using a supported x86 accelerated EKS optimized Amazon Linux AMI.

Additional considerations:

Note

This document aims to be unopinionated and will not describe how to setup and manage any of the additional resources.

It will assume that the various services can be reached from outside the cluster (eg. Ingress <–> AWS Application Load Balancer) and that the application has been securely configured (eg. through configured Security Groups and/or Web Application Firewall ACLs).

B. NVIDIA Device Plugin

The Kubernetes cluster must have the NVIDIA Device Plugin installed. This plugin provides a daemonset that automatically exposes the number of GPUs available, keeps track of GPU health, and runs GPU enabled containers.

The NVIDIA Device Plugin runs as a daemonset on all nodes by default in the cluster. A nodeSelector can be used to isolate the daemonset to only run on GPU nodes.

On Helm install:

--set nodeSelector=<LABEL>=VALUE

Or via a values file:

# nvidia-device-plugin-values.yaml
nodeSelector:
   <LABEL>: <VALUE>

Once the NVIDIA Device Plugin has been installed. You can verify the number of GPUs on the nodes via the following command:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

2. Considerations

A. Security

It is strongly recommended to not expose Omniverse Farm to the public internet. Farm does not ship with authN/authZ and has limited authentication for job submission via tokens. If this is a technical requirement for your organization, be sure to restrict access to public endpoints (eg. security groups, AWS WAF, etc.).

Consult with your organization’s security team to best determine how to properly secure AWS, EKS, and Omniverse Farm (see Security in Amazon EKS for more details).

B. Capacity Tuning

Tuning the Omniverse Farm controller’s maximum job capacity can be achieved through configuring farm-values.yaml. This will limit the number of jobs that can run in parallel and may be useful for people running in mixed environments where they share Kubernetes with other workloads.

controller:
   serviceConfig:
      capacity:
         max_capacity: 32

Note

Cluster Autoscaling

Cluster autoscaling is highly coupled with the configuration of worker node(s) and/or node group(s) within the cluster and goes outside the scope of this document. This is typically achieved with the Kubernetes Cluster Autoscaler and/or the open-source project Karpenter.

Please refer to the Official AWS Autoscaling documentation for more details.

C. Number of GPUs

Omniverse Farm will parallelize work based on the number of available GPUs. Once work has been assigned to a GPU, it will occupy the GPU until it completes.

In a production environment, it will take some experimentation to determine the optimal number of GPUs for the work being performed.

D. Storage

Hard drive size selection must take into consideration both the containers being used and the types of jobs being executed.

Omniverse Create (used for running various jobs) executes inside a large container (multiple gigabytes in size) and must have sufficient storage for the container. Generally, an EBS Volume around 100GB is a good starting point, but this is highly coupled with the requirements and workflow of your project.

If writing data to S3, data must first temporarily be written to the running instance. As such, the instance must have sufficient storage for any temporary files (this can be fairly large for rendering related jobs).

A cluster’s exact needs will be determined by the jobs the cluster is meant to execute.

Note

It is good practice to begin with oversized resources and then eventually pair back or grow into the resources as necessary rather than have an undersized cluster that may alarm or become unavailable due to resource starvation.

E. Management Services

Multiple services handle communication, life cycle, and interaction across the Omniverse Farm cluster. These instances are considered memory intensive and should be treated as such. These services include the agents, controller, dashboard, jobs, logs, metrics, retries, settings, tasks, and UI services.

3. Deploying the Helm Chart

A. Prerequisites

Local:

Cluster:

  • NVIDIA driver version 470.52.02 (should be preinstalled with current accelerated AMIs listed above).

  • NVIDIA k8s-device-plugin (see section 1.B).

  • NVIDIA Container Toolkit (should be preinstalled with current accelerated AMIs listed above).

  • It is assumed that a method of targeting specific nodes is utilized (e.g. nodeSelector).

B. Deploying Farm

In this step, we will deploy the Omniverse Farm Helm chart. This document will provide a step-by-step guide for AWS EKS. For more advanced cases or other cloud providers, feel free to examine the Helm chart itself and determine the best approach for your provider.

A full set of all resources (containers, Helm charts, job definitions) can be found in this collection

All steps utilize the following values, however you should feel free to change them at your discretion. For this guide, we will assume:

NAMESPACE=ov-farm
SECRET=registry-secret
NGC_API_TOKEN=<your_token>

Step 1:

First, we will create a namespace for Omniverse Farm.

kubectl create namespace $NAMESPACE

As the container images referenced within the Helm chart are private, you will need to create a secret within your cluster namespace to provide your cluster with the NGC API token.

kubectl create secret docker-registry $SECRET \
   --namespace $NAMESPACE \
   --docker-server="nvcr.io" \
   --docker-username='$oauthtoken' \
   --docker-password=$NGC_API_TOKEN

Step 2:

Create a farm-values.yaml file. This file will be used for specifying overrides during the installation.

Note

Replace the highlighted lines with your secret.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
global:
   imagePullSecrets:
      - name: SECRET

controller:
   serviceConfig:
      k8s:
         jobTemplateSpecOverrides:
            imagePullSecrets:
               - name: SECRET

Note

It may be required to add additional overrides in the farm-values.yml file.

For example, to tell the dashboard service to use a LoadBalancer service type and to target t3.medium instance types, if these are available in your cluster, you may need to add something like the following:

dashboard:
  nodeSelector:
    node.kubernetes.io/instance-type: t3.medium

  service:
    type: LoadBalancer

Step 3:

Install the Omniverse Farm Helm chart:

helm upgrade \
   --install \
   --create-namespace \
   --namespace $NAMESPACE \
   omniverse-farm \
   omniverse-farm \
   -f farm-values.yaml \
   --repo https://helm.ngc.nvidia.com/nvidia/omniverse-farm \
   --username='$oauthtoken' \
   --password=$NGC_API_TOKEN

Step 4:

Validate the installation.

Ensure that all pods are in the ready state before proceeding.

kubectl -n $NAMESPACE wait --timeout=300s --for condition=Ready pods --all

The following command creates a curl pod in the namespace that will allow us to query the various service endpoints.

(For more details on this, refer to the Official Kubernetes service networking documentation):

kubectl run curl --namespace=$NAMESPACE --image=radial/busyboxplus:curl -i --tty -- sh

The following code block defines two functions that facilitate querying if the various services are up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
check_endpoint() {
   url=$1
   curl -s -o /dev/null "$url" && echo -e "[UP]\t${url}" || echo -e "[DOWN]\t${url}"
}

check_farm_status() {
   echo "======================================================================"
   echo "Farm status:"
   echo "----------------------------------------------------------------------"
   check_endpoint "http://agents.ov-farm/queue/management/agents/status"
   check_endpoint "http://dashboard.ov-farm/queue/management/dashboard/status"
   check_endpoint "http://jobs.ov-farm/queue/management/jobs/status"
   check_endpoint "http://jobs.ov-farm/queue/management/jobs/load"
   check_endpoint "http://logs.ov-farm/queue/management/logs/status"
   check_endpoint "http://retries.ov-farm/queue/management/retries/status"
   check_endpoint "http://tasks.ov-farm/queue/management/tasks/status"
   check_endpoint "http://tasks.ov-farm/queue/management/tasks/list?status=submitted"
   echo "======================================================================"
}

Once you have the functions available in your curl pod, you can query the status of Omniverse Farm by running:

check_farm_status

Output should be similar to:

======================================================================
Farm status:
----------------------------------------------------------------------
[UP]     http://agents.ov-farm/queue/management/agents/status
[UP]     http://dashboard.ov-farm/queue/management/dashboard/status
[UP]     http://jobs.ov-farm/queue/management/jobs/status
[UP]     http://jobs.ov-farm/queue/management/jobs/load
[UP]     http://logs.ov-farm/queue/management/logs/status
[UP]     http://retries.ov-farm/queue/management/retries/status
[UP]     http://tasks.ov-farm/queue/management/tasks/status
[UP]     http://tasks.ov-farm/queue/management/tasks/list?status=submitted
======================================================================

This validates that all Farm services are running and accessible.

Step 5:

Now that you have confirmed that Farm services are available, it is time to run a simple job. This job definition runs the df command using the busybox container image.

Use the following command from the NGC CPU Resource setup documents to download the example df.kit job and sample upload script:

ngc registry resource download-version "nvidia/omniverse-farm/cpu_verification:1.0.0"

Next, retrieve a token from Omniverse Farm for use in uploading jobs:

kubectl get cm omniverse-farm-jobs -o yaml -n $NAMESPACE | grep api_key

The token is unique per Farm instance and must be kept secure.

Two libraries are required dependencies of the Python script, install them with:

pip install requests
pip install toml

Finally, from the directory the files were downloading into, execute the following script to upload the job definition to your cluster:

python ./job_definition_upload df.kit --farm-url=<URL to the instance of Omniverse Farm> --api-key=<API Key as retrieved in previous step>

The job definition may take up to about 1 minute to propagate to the various services in the cluster.

Step 6:

After a few moments, it should be safe to submit a job to Omniverse Farm for scheduling. Execute the following snippet (found in the NGC CPU Resource Quick Start Guide) to submit a job:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
export FARM_URL=<REPLACE WITH URL OF OMNIVERSE FARM INSTANCE>
curl -X "POST" \
  "${FARM_URL}/queue/management/tasks/submit" \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "user": "testuser",
  "task_type": "df",
  "task_args": {},
  "metadata": {
    "_retry": {
      "is_retryable": false
    }
  },
  "status": "submitted"
}'

After submitting, you should be able to navigate to ${FARM_URL}/queue/management/dashboard, enter a username (this can be anything you want as no authentication is present) and observe your task in the task list.

Step 7:

Now that you have confirmed that Farm services are available and that you can run a simple job, it is time to run a GPU workload. This job definition runs the gpu command using the nvidia-cuda container image.

Use the following command from the NGC GPU Resource setup documents to download the example gpu.kit job and sample upload script:

ngc registry resource download-version "nvidia/omniverse-farm/gpu_verification:1.0.0"

Next, retrieve a token from Omniverse Farm for use in uploading jobs:

kubectl get cm omniveFse-farm-jobs -o yaml -n $NAMESPACE | grep api_key

This token is unique per Farm instance and must be kept secure.

Two libraries are required dependencies of the python script, install them with:

pip install requests
pip install toml

Finally, from the directory the files were downloading into, execute the following script to upload the job definition to your cluster:

python ./job_definition_upload.py gpu.kit --farm-url=<url to the instance of omniverse farm> --api-key=<api key as retrieved in previous step>

The job definition may take up to about 1 minute to propagate to the various services in the cluster.

Step 8:

After a few moments, it should be safe to submit a job to Omniverse Farm for scheduling. Execute the following snippet (found in the NGC GPU resource quick start guide) to submit a job:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
export FARM_URL=<REPLACE WITH URL OF OMNIVERSE FARM INSTANCE>
curl -X "POST" \
  "${FARM_URL}/queue/management/tasks/submit" \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "user": "testuser",
  "task_type": "gpu",
  "task_args": {},
  "metadata": {
    "_retry": {
      "is_retryable": false
    }
  },
  "status": "submitted"
}'

After submitting, you should be able to navigate to ${FARM_URL}/queue/management/dashboard, enter a username (this can be anything you want as no authentication is present) and observe your task in the task list.

Conclusion

At this point your cluster should have a working version of Omniverse Farm able to run basic jobs. It is worth having a closer look at the job definitions to see how workloads are structured and how you may be able to onboard your own workloads. Omniverse Farm on EKS can run any containerized workload. We would recommend reading the Farm documentation on job definitions.

In the next section we will target onboarding a rendering workflow using Omniverse Create.

4. Batch Rendering Workloads

Omniverse Farm can be used as a powerful distributed rendering solution.

A. Configuring storage

Before we dive into the workload itself there are some considerations regarding data access:

I. Nucleus

Omniverse Farm jobs can be configured to connect directly to a Nucleus instance. This does assume that the Nucleus instance is accessible from the cloud either via a direct connect or by having an instance deployed in aws and configured to be accessible from the Omniverse Farm cluster.

Step 1:

Create a Nucleus account that can be used as a service account.

Step 2:

Update the job definitions that need access to Nucleus by adding in the OMNI_USER and OMNI_PASS environment variables. For example the create-render job definition would look like the following if the user and password from step 1 were foo and bar:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[job.create-render]
job_type = "kit-service"
name = "create-render"
command = "/startup.sh"
# There is inconsistency with how args are parsed within Kit.
# This is why --enable arguments have a space in them as they do not support `--enable=`
# They will however be split into individual args when submitting them
args = [
    "--enable omni.services.render",
    "--/app/file/ignoreUnsavedOnExit=true",
    "--/app/extensions/excluded/0=omni.kit.window.privacy",
    "--/app/hangDetector/enabled=0",
    "--/app/asyncRendering=false",
    "--/rtx/materialDb/syncLoads=true",
    "--/omni.kit.plugin/syncUsdLoads=true",
    "--/rtx/hydra/materialSyncLoads=true",
    "--/rtx-transient/resourcemanager/texturestreaming/async=false",
    "--/rtx-transient/resourcemanager/enableTextureStreaming=false",
    "--/exts/omni.kit.window.viewport/blockingGetViewportDrawable=true",
    "--ext-folder", "/opt/nvidia/omniverse/farm-jobs/farm-job-create-render/exts-job.omni.farm.render",
    "--/crashreporter/dumpDir=/tmp/renders",
    # Example code to set up pushing metrics to a Prometheus push gateway.
    #"--/exts/services.monitoring.metrics/push_metrics=true",
    #"--/exts/services.monitoring.metrics/job_name=create_render",
    #"--/exts/services.monitoring.metrics/push_gateway=http://localhost:9091"
]
task_function = "render.run"
headless = true
log_to_stdout = true
container = "nvcr.io/nvidia/omniverse/create-render:2022.2.1"

[job.create-render.env]
OMNI_USER = "foo"
OMNI_PASS = "bar"
Step 3:

Upload the job definition to the Farm as explained previously and submit the jobs.

It should now be possible to read the files from Nucleus and upload back any results.

II. Kubernetes Persistent Volumes

Farm jobs can be configured to write to Persistent Volumes which provide a variety of storage solutions that can be exposed to jobs. We will not cover how to configure the backend storage or the PVs themselves. To configure NFS type storage in AWS, please use the following guide.

Step 1:

After configuring a PV it can be mounted by configuring the capacity_requirements (see: resource limits) section in a job definition. For example, to mount a volume output-storage into a pod at the location of /data/output the job definition can be updated as below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
[capacity_requirements]

[[capacity_requirements.resource_limits]]
cpu = 2
memory = "14Gi"
"nvidia.com/gpu" = 1

[[capacity_requirements.volume_mounts]]
mountPath = "/data/output"
name = "output-storage"

[[capacity_requirements.volumes]]
name = "output-storage"
[capacity_requirements.volumes.persistentVolumeClaim]
claimName = "aws-credentials"
Step 2:

Upload the job definition to the Farm as previously explained and then submit the jobs.

It should now be possible to read and write (depending on the permissions on the Persisted Volume) data from /data/output.

B. Onboarding create-render job definition

Step 1:

With storage now configured, the create-render job can be onboarded.

Use the following command from the NGC create-render resource setup documents to download the example job.omni.farm.render.kit job and sample upload script:

ngc registry resource download-version "nvidia/omniverse-farm/create-render:2022.2.1"

Step 2:

Based on the selected storage solution, add the required job definition updates to the farm.job.create.render.kit file.

Step 3:

Next, retrieve a token from Omniverse Farm for use in uploading jobs:

kubectl get cm omniverse-farm-jobs -o yaml -n $NAMESPACE | grep api_key

The token is unique per Farm instance and must be kept secure.

Two libraries are required dependencies of the Python script, install them with:

pip install requests
pip install toml

Finally, from the directory the files were downloading into, execute the following script to upload the job definition to your cluster:

python ./job_definition_upload job.omni.farm.render.kit --farm-url=<URL to the instance of Omniverse Farm> --api-key=<API Key as retrieved in previous step>

The job definition may take up to about 1 minute to propagate to the various services in the cluster.

Step 4:

With the job definition on-boarded, it is possible to submit a render job following the rendering with farm guide.