Deploying Omniverse Farm on OCI#

../_images/app_farm_banner_oci.png

1. Introduction#

It is possible to deploy Omniverse Farm on Oracle Cloud using OKE.

OKE will give you a managed Kubernetes cluster reducing the overhead of maintaining the cluster itself.

It is recommended to read through this guide as well as the deployment guide linked below before starting the deployment to make sure all pre-requisites are fulfilled.

2. Prerequisites#

A. OKE Configuration#

If you are familiar with Oracle Cloud, but not OKE, then we recommend starting with the user guide to get a high level overview and then working through the OKE tutorial to gain familiarity with the topic.

In order to deploy Omniverse Farm on OKS an adequately sized cluster must be setup and configured for use. It is expected that a user has a Oracle Cloud account with appropriate quotas for the desired instance type(s) in a specified region.

Typically, at least two types of node configurations are needed depending on the type of workload:

  • One or more node(s) and/or node group(s) configured for Farm services.

  • One or more node(s) and/or node group(s) configured for Farm workers. This typically includes:

Later in the installation instructions, deploying the NVIDIA device plugin is covered, OCI however overs options with pre-configured instances <https://blogs.oracle.com/cloud-infrastructure/post/announcing-oracle-container-engine-for-kubernetes-oke-support-for-gpu>.

Additional considerations:

Note

This document aims to be unopinionated and will not describe how to setup and manage any of the additional resources.

It will assume that the various services can be reached from outside the cluster and that the application has been securely configured.

B. OKE version#

Omniverse Farm has been tested on Kubernetes versions 1.22 and higher. We’d recommend using, where possible, OKE 1.24 or higher. Available versions:

2. Considerations#

A. Security#

It is strongly recommended to not expose Omniverse Farm to the public internet yet. Farm does not ship with authN/authZ and has limited authentication for job submission via tokens. If this is a technical requirement for your organization, be sure to restrict access to public endpoints (eg. security groups, Firewalls, etc.) Consult with your organization’s security team to best determine how to properly secure Oracle Cloud, OKE, and Omniverse Farm (see Security in OKE <https://docs.oracle.com/en-us/iaas/Content/Security/Reference/oke_security.htm>`_ for more details).

B. Capacity Tuning#

Tuning the Omniverse Farm controller’s maximum job capacity can be achieved through configuring farm-values.yaml. This will limit the number of jobs that can run in parallel and may be useful for people running in mixed environments where they share Kubernetes with other workloads.

controller:
   serviceConfig:
      capacity:
         max_capacity: 32

Note

Cluster Autoscaling

Cluster autoscaling is highly coupled with the configuration of worker node(s) and/or node group(s) within the cluster and goes outside the scope of this document.

Please refer to the Official OKE Autoscaling documentation for more details.

C. Number of GPUs#

Omniverse Farm will parallelize work based on the number of available GPUs. Once work has been assigned to a GPU, it will occupy the GPU until it completes.

In a production environment, it will take some experimentation to determine the optimal number of GPUs for the work being performed.

D. Storage#

Hard drive size selection must take into consideration both the containers being used and the types of jobs being executed.

Omniverse USD Composer (used for running various jobs) executes inside a large container (multiple gigabytes in size) and must have sufficient storage for the container. Generally, a volume of around 100GB is a good starting point, but this is highly coupled with the requirements and workflow of your project.

If writing data to OCI’s Object store, data may first temporarily be written to the running instance. As such, the instance must have sufficient storage for any temporary files (this can be fairly large for rendering related jobs). This will depend on the workload and their respective data management implementation.

A cluster’s exact needs will be determined by the jobs the cluster is meant to execute.

Note

It is good practice to begin with oversized resources and then eventually pair back or grow into the resources as necessary rather than have an undersized cluster that may alarm or become unavailable due to resource starvation.

E. Management Services#

Multiple services handle communication, life cycle, and interaction across the Omniverse Farm cluster. These instances are considered memory intensive and should be treated as such. These services include the agents, controller, dashboard, jobs, logs, metrics, retries, settings, tasks, and UI services.

F. Ingress#

Omniverse Farm does not deploy an Ingress. In order to be able to reach the services from outside a Kubernetes cluster an Ingress may be required. On OKE a load balancer is available

3. Deployment#

With the OKE cluster configured, the deployment steps are identical to the general Kubernetes deployment documentation. Please follow this guide to continue with the installation of Omniverse Farm: Guide