Requirements#

Kubernetes#

General
Kubernetes	version 1.29
Network	IPv4 only

Roles and Permissions#

Installation Permissions#

The user/operator installing the streaming microservices with Helm is expected to have the necessary privileges on the Kubernetes cluster. This includes permissions to create, update, and delete the following Kubernetes resources:

ClusterRoleBinding

ClusterRoles

ConfigMap

DaemonSet

Deployment

Ingress

Namespaces

Secrets

Service

ServiceAccount

ServiceMonitor

Runtime Permissions#

During the runtime of the Streaming API/Microservices, various extended permissions are required. Below is a breakdown of the necessary permissions for each microservice. Detailed configurations on the required permissions and security context can be found in the values.yaml and Helm chart template files of the respective service.

Resource Management Control Plane

The RMCP Service utilizes a Role to create, list, get and delete HelmRelease Custom Resource Definitions (CRD).

The HelmRelease CRD, provided by Flux, is used to manage the creation of streaming instances.

Session Management Service

The session management service uses a Role to list and get service resources created within its namespace. This allows the service to extract the NLB DNS name that the client needs to connect to.

Application and Profile Management Service

The application and profile management service uses a Role to create, list, get, and delete the following three CRDs within its namespace.

applications.omniverse.nvidia.com
applicationversions.omniverse.nvidia.com
applicationprofiles.omniverse.nvidia.com

These CRDs enable the management of Omniverse applications, versions, and profiles. Their creation does not trigger any events and should have no side effects. They are used solely to manage the availability of these applications in the cluster without requiring additional resources like a database.

Kubernetes Worker Nodes#

Supported Operating System(s)#

Ensure your Kubernetes worker nodes are running Ubuntu 22.04. Currently, no other Linux operating systems are compatible.

System Nodes#

The Omniverse Kit App Streaming services do not require GPUs and are ideally suited for standard microservice resource types. For smaller-scale deployments, a single node may be sufficient. Otherwise, monitor and scale each service based upon your observed workloads.

Specs

System nodes (cpu-only)
vCPUs:	4+
memory:	8Gi+
number:	1-3

CSP Instance Types

System nodes (cpu-only)
AWS	m6a.xlarge
Azure	D4s_v3

Label any worker node that should run these services with NodeGroup=system, using the following command:

kubectl label nodes <node_name> NodeGroup=system

Memcached Node#

memcached is required to support shared shader caching. Depending on the number and complexity of the compiled shaders, this can be more memory and network bandwidth constrained than CPU.

Specs

Memcached nodes (cpu-only)
vCpus:	4+
memory:	8-16Gi+
number:	1

CSP Instance Types

Memcached nodes (cpu-only)
AWS	r7i.xlarge
Azure	E8s_v5

Label any worker node that should run memcached with NodeGroup=cache, using the following command:

kubectl label nodes <node_name> NodeGroup=cache

GPU Worker Nodes#

Ensure your GPU Worker Nodes are scaled suitably for your particular workloads. The number of cpu cores, memory and available GPU memory required are highly impacted by the specifics of the data being loaded and streamed. These are suggestions, only you can determine what is appropriate, based on your knowledge of your workloads.

Ensure all GPU worker nodes are using a supported NVIDIA driver version.

Specs

GPU Worker nodes
vCPUs:	8-16+
memory:	32Gi - 64Gi+
GPU type	RTX based GPUs
Driver	535.104.05 [1]
number:	1 per stream [2]

CSP Instance Types

GPU Worker nodes
AWS	g5, g6, g6e.4xlarge
Azure	NV36ads_A10_v5

Label any worker node that should run gpu workloads with NodeGroup=gpu, using the following command:

kubectl label nodes <node_name> NodeGroup=gpu

Installing the NVIDIA GPU Operator with Helm#

The GPU operator can be installed with Helm. Below is an example of the installation with the desired driver version:

helm install --wait --generate-name \
   -n gpu-operator --create-namespace \
   --repo https://helm.ngc.nvidia.com/nvidia \
   gpu-operator \
   --set driver.version=535.104.05

sample result#

NAME: gpu-operator-1710472876
LAST DEPLOYED: Thu Mar 14 20:21:18 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Additional documentation on installing the NVIDIA GPU Operator

Container Image Registry#

If your Kubernetes worker nodes cannot access the NVIDIA NGC Image registry (nvcr.io), or if you are using your own container image registry (e.g., ECR, Harbor), you need to manually pull and push the Omniverse containers to your preferred image registry.

Network#

Network Segmentation#

Worker nodes require a private network/subnet. The Network Load Balancers will need access to the Kubernetes subnet (default 10.0.0.0/16) for UDP and TCP traffic.

Load Balancers - L4 Traffic#

For streaming the Kit application, L4 network load balancers are used. The setup of these load balancers varies depending on the cloud service provider (CSP) used. Below are guidelines for the supported CSPs.

Load Balancers - L7 Traffic#

For API requests, set up public load balancers, such as the Amazon Application Load Balancer. The example values files provided assume that the AWS Load Balancer Controller is installed and configured for your cluster, which will automatically create an ALB when Kubernetes Ingress resources are created.

Cloud Service Provider Requirements#

General Requirements#

Availability Zones#

To ensure low latency and cost-optimized operation, the Kubernetes worker nodes involved in the streaming traffic should be located in the same Availability Zone; otherwise, cross-AZ traffic may introduce additional latency and cost.

GPU Resource Limits#

We recommend that each Omniverse Kit App be configured to request one GPU resource per stream. As the Kubernetes node groups for GPUs scale manually or automatically, the CSP account may reach the soft limits defined for these instance types. Ensure the account resource limits are configured to allow the required number of NVIDIA GPU instances.

AWS#

AWS Recommended Instance Types#

Worker Type	AWS Instance
System	m6a.xlarge
Memcached	r7i.xlarge
GPU	g5.4xlarge, g6.4xlarge, g6e.4xlarge

Supported NVIDIA GPU Driver Version: 535.104.05

Using the AWS NLB Manager (optional)#

Using the AWS NLB Manager (LBM) is optional and only required if you want to leverage NLB pooling. This allows you to provision a set number of NLBs and dynamically bind/unbind listeners and target groups when a Kit application stream begins or ends.

When using the NLB service for AWS NLB pooling, please pre-provision Network Load Balancers with the following configuration:

Internet Facing Load Balancer
NLB deployed in all public subnets and across all AZ used by the Kubernetes Cluster’s Worker Nodes
Cross-Zone Load Balancing enabled
Security Group configured (see below for details)
Add a tag key/value pair to identify the NLB as available streaming
- This key/value pair will be given to the NLB service to filter NLBs
If requiring TLS:
- Attach a certificate to the NLB
- Configure a Route53 entry for the NLB and add a tag to the NLB:
Route53Alias: <FQDN>

AWS NLB Manager Tag Lookup Configuration#

The LBM supports dynamic configuration of NLBs at service startup and through the GET:/refresh API endpoint. These settings can be configured via the service’s application.toml and Helm chart values file using the following parameters:

nv.ov.svc.streaming.aws.nlb.resource.lookup.tag.key = ""
nv.ov.svc.streaming.aws.nlb.resource.lookup.tag.value = ""

Any NLBs with a matching tag key/value will be configured by the LBM.

AWS NLB Manager TLS Configuration#

The LBM supports configuring the signaling (TCP) port with TLS termination at the NLB listener. TLS requires both a valid ACM certificate and a DNS alias record pointing to the NLB’s public IP and/or default DNS name. The LBM will search for a configurable tag and use its value as the NLB’s alias for the streaming session.

This feature is enabled and configurable through the service’s application.toml and Helm chart values file using the following settings:

nv.ov.svc.streaming.aws.nlb.resource.dns.alias.tag.key = "Route53Alias"
nv.ov.svc.streaming.aws.nlb.ports.tcp.tls.enabled = true
nv.ov.svc.streaming.aws.nlb.ports.tcp.tls.ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
nv.ov.svc.streaming.aws.nlb.ports.tcp.tls.certificate_arn = "<arn>"

AWS NLB Manager Port and Security Group Configuration#

The LBM manages the creation of Listener(s) and Target Group(s) for the provided Network Load Balancers. By default, the following ports are used:

NLB port allocation	Default value(s)
signaling	TCP/443
media	UDP/80

Note

The LBM’s TCP and UDP starting port and stream limit (per NLB) are configurable through the service’s application.toml and Helm chart values file using the following settings:

nv.ov.svc.streaming.aws.nlb.stream.limit = 1
nv.ov.svc.streaming.aws.nlb.ports.tcp.port_start = 443
nv.ov.svc.streaming.aws.nlb.ports.udp.port_start = 80

At startup, the LBM attempts to pre-populate the necessary listeners and target groups based on the provided configuration.

In the above example (default configuration), two listeners (one TCP and one UDP) and target groups are created on ports 443 and 80; therefore, traffic to these ports must be allowed via security groups attached to the following resources:

The Security Group(s) of the Network Load Balancer(s) themselves, allowing traffic from:
- The client source address(es)
- The NAT gateway EGRESS IP address
The Security Group(s) of the EKS (GPU) Node(s) themselves, allowing traffic from:
- The Network Load Balancer(s)
- The client source address(es)

AWS NLB Manager IAM Role & Policy#

The LBM requires an IAM role, which is assumed using its own Kubernetes service account. The IAM role and policy closely follow the IAM setup of the AWS Load Balancer Controller using IRSA. However, unlike the role and policy used for the AWS Load Balancer Controller, the IAM policy for the LBM should be scoped to the minimal resources needed.

For detailed setup instructions, please refer to the official documentation above. The following details and links are provided as a quick reference only.

IAM Role Trust#

The configured IAM Roles need to have a Trust Policy created for the Kubernetes service account of the AWS NLB Manager to assume the role via the IAM OIDC Identity Provider.

AWS’ Applying IRSA has additional information on AWS IRSA (IAM Roles for Service Accounts).

IAM Policy Permissions Example#

The IAM policy for the AWS NLB Manager role needs to include the following permissions:

{
   "Version": "2012-10-17",
   "Statement": [
      {
         "Effect": "Allow",
         "Action": [
            "elasticloadbalancing:ModifyListener",
            "tag:GetResources",
            "elasticloadbalancing:DescribeTags",
            "elasticloadbalancing:CreateTargetGroup",
            "elasticloadbalancing:RemoveListenerCertificates",
            "elasticloadbalancing:DescribeLoadBalancerAttributes",
            "elasticloadbalancing:DescribeLoadBalancers",
            "elasticloadbalancing:CreateListener",
            "elasticloadbalancing:DescribeTargetGroupAttributes",
            "elasticloadbalancing:DescribeListeners",
            "elasticloadbalancing:ModifyRule",
            "elasticloadbalancing:AddTags",
            "elasticloadbalancing:CreateRule",
            "elasticloadbalancing:DescribeTargetHealth",
            "elasticloadbalancing:DescribeTargetGroups",
            "elasticloadbalancing:DescribeListenerCertificates",
            "elasticloadbalancing:AddListenerCertificates",
            "elasticloadbalancing:DescribeRules",
            "elasticloadbalancing:ModifyTargetGroup"
         ],
         "Resource": "*"
      }
   ]
}

Azure#

Azure Recommended Instance Types#

Worker Type	Azure Instance
System	D4s_v3
Memcached	E8s_v5
GPU	NV36ads_A10_v5

Supported NVIDIA GPU Driver Version: 535.104.05

On-premises#

When deploying on-premise, there are many options for Network Load Balancers and API Gateways. For our internal testing, we have leveraged the following:

Network Load Balancer - MetalLB