Deploying Omniverse Farm on AWS

../../_images/app_farm_banner_aws.png

1. Introduction

It is possbile to deploy Omniverse Farm on AWS using EKS EKS will give you a managed Kubernetes cluster reducing the overhead of maintaining the cluster itself.

This guide won’t be going into how to deploy an EKS cluster but will cover the pre-requisites for running OV Farm on an EKS cluster.

It is recommended to read through this guide as well as the deployment guide linked below before starting the deployment to make sure all pre-requirisites are fulfulled.

2. Prerequisites

A. AWS Configuration

If you are familiar with AWS, but not EKS, then we recommend starting with the user guide to get a high level overview and then working through the Amazon EKS workshop to gain familiarity with the topic.

General AWS EKS documentation can be found here and provides details on getting started, best practices, API surface, and using the AWS EKS cli.

Note: if using a pre-existing EKS cluster before 1.24 and updating, then it is recommended to familiarize yourself with Dockershim deprecation. If starting from 1.24, no intervention is required.

In order to deploy Omniverse Farm on AWS an adequately sized cluster must be setup and configured for use. It is expected that a user has an AWS account with appropriate EC2 service quotas for the desired instance type(s) in a specified region. These EC2 instances are expected to be part of a VPC with configured security groups and subnets and an EKS cluster must be running on a supported version of K8S.

Typically, at least two types of node configurations are needed depending on the type of workload:

  • One or more node(s) and/or node group(s) configured for Farm services.

  • One or more node(s) and/or node group(s) configured for Farm workers. This typically includes:

    • Non-GPU workloads.

    • GPU workloads (T4/A10/A100 GPU required) running on supported accelerated computing instance types (P4, G5, G4dn families) using a supported x86 accelerated EKS optimized Amazon Linux AMI.

Additional considerations:

Note

This document aims to be unopinionated and will not describe how to setup and manage any of the additional resources.

It will assume that the various services can be reached from outside the cluster (eg. Ingress <–> AWS Application Load Balancer) and that the application has been securely configured (eg. through configured Security Groups and/or Web Application Firewall ACLs).

B. EKS version

Omniverse Farm has been tested on Kubernetes versions 1.22 and higher. We’d recommend using, where possible, EKS 1.24 or higher.

2. Considerations

A. Security

It is strongly recommended to not expose Omniverse Farm to the public internet yet. Farm does not ship with authN/authZ and has limited authentication for job submission via tokens. If this is a technical requirement for your organization, be sure to restrict access to public endpoints (eg. security groups, AWS WAF, etc.).

Consult with your organization’s security team to best determine how to properly secure AWS, EKS, and Omniverse Farm (see Security in Amazon EKS for more details).

B. Capacity Tuning

Tuning the Omniverse Farm controller’s maximum job capacity can be achieved through configuring farm-values.yaml. This will limit the number of jobs that can run in parallel and may be useful for people running in mixed environments where they share Kubernetes with other workloads.

controller:
   serviceConfig:
      capacity:
         max_capacity: 32

Note

Cluster Autoscaling

Cluster autoscaling is highly coupled with the configuration of worker node(s) and/or node group(s) within the cluster and goes outside the scope of this document. This is typically achieved with the Kubernetes Cluster Autoscaler and/or the open-source project Karpenter.

Please refer to the Official AWS Autoscaling documentation for more details.

C. Number of GPUs

Omniverse Farm will parallelize work based on the number of available GPUs. Once work has been assigned to a GPU, it will occupy the GPU until it completes.

In a production environment, it will take some experimentation to determine the optimal number of GPUs for the work being performed.

D. Storage

Hard drive size selection must take into consideration both the containers being used and the types of jobs being executed.

Omniverse USD Composer (used for running various jobs) executes inside a large container (multiple gigabytes in size) and must have sufficient storage for the container. Generally, an EBS Volume around 100GB is a good starting point, but this is highly coupled with the requirements and workflow of your project.

If writing data to S3, data may first temporarily be written to the running instance. As such, the instance must have sufficient storage for any temporary files (this can be fairly large for rendering related jobs). This will depend on the workload and their respective data management implementation.

A cluster’s exact needs will be determined by the jobs the cluster is meant to execute.

Note

It is good practice to begin with oversized resources and then eventually pair back or grow into the resources as necessary rather than have an undersized cluster that may alarm or become unavailable due to resource starvation.

E. Management Services

Multiple services handle communication, life cycle, and interaction across the Omniverse Farm cluster. These instances are considered memory intensive and should be treated as such. These services include the agents, controller, dashboard, jobs, logs, metrics, retries, settings, tasks, and UI services.

F. Ingress

Omniverse Farm does not deploy an Ingress. In order to be able to reach the services from outside a Kubernetes cluster an Ingress may be required. On AWS an application load balancer ingress is available: documentation

3. Deployment

With the AKS cluster configured, the deployment steps are identical to the general Kubernetes deployment documentation. Please follow this guide to continue with the installation of Omniverse Farm: Guide