UCC: Connection Saturation and High Response Times#

Overview#

USD Content Cache (UCC) serves cached USD assets to render workers over HTTP/HTTPS. When connection limits are exceeded or worker capacity is insufficient, UCC response times degrade, client requests time out, and simulations fail. UCC uses NGINX as its foundation, which has per-worker connection limits that must be sized appropriately for the workload.

Connection saturation occurs when:

  • Worker connection limits (worker_connections) are undersized for concurrent client requests

  • CPU cores allocated to UCC are insufficient for the connection handling workload

  • Replica count is too low to distribute connection load across the cluster

  • Client concurrency spikes exceed UCC’s configured capacity

  • HTTP/1.1 is used instead of HTTP/2, requiring one connection per concurrent request

When connection saturation occurs, UCC cannot accept new connections, client requests queue or time out, and P99 response times increase dramatically (from milliseconds to seconds or tens of seconds). This manifests as simulation failures with timeout errors or “connection refused” messages.

The recommended sizing formula for UCC connections is:

worker_connections = (GPU_count * client_concurrency) / replica_count / vCPU_count * safety_margin

For example, with 66 GPUs, 256 client concurrency, 5 replicas, and 32 vCPUs per replica:

worker_connections = (66 * 256) / 5 / 32 * 1.5 ~ 1,600

Symptoms and Detection Signals#

Visible Symptoms#

  • High P99 response times - Response times exceeding 5-10 seconds in P99 percentile

  • Client timeout errors - Render workers reporting connection timeouts or “connection refused”

  • Simulation failures - Simulations failing with gRPC UNKNOWN errors or HTTP timeout errors

  • Connection queue buildup - Connections waiting for available worker slots

Log Messages#

Connection Refused#

Where to find these logs:

  • Location: Render Worker Pod

  • Application: OmniClient / Storage Service

  • Description: Logs indicating connection failures to UCC

# Look for timeout or connection errors in render worker logs
# Patterns may include:
# - *refused*
# - *timeout*
# - *connect*
# - *dial*

Timeout Errors#

Where to find these logs:

  • Location: Render Worker Pod

  • Application: Storage Service

  • Description: Logs indicating request timeouts from UCC

# Look for timeout errors in render worker logs
# Patterns may include:
# - *timeout*
# - *deadline*
# - *context*

Metric Signals#

The following Prometheus metrics can be used to detect connection saturation before it causes simulation failures. Monitor these metrics proactively to identify capacity issues early.

nginx_connections_active{pod=~"usd-content-cache-.*", namespace="ucc"}
nginx_connections_waiting{pod=~"usd-content-cache-.*", namespace="ucc"}
nginx_http_request_duration_seconds{quantile="0.99", pod=~"usd-content-cache-.*"}
rate(nginx_http_requests_total{pod=~"usd-content-cache-.*", status=~"5.."}[5m])
container_network_receive_bytes_total{pod=~"usd-content-cache-.*", namespace="ucc"}

Connection capacity is tracked through NGINX connection metrics that reveal worker utilization and queueing. The nginx_connections_active metric reports active connections per NGINX worker process. High values approaching the configured worker_connections limit indicate connection saturation risk, requiring alerts when exceeding 80% of the configured capacity. When connection limits are reached, NGINX begins queueing incoming connections, tracked by nginx_connections_waiting. Non-zero values indicate connection queueing has begun, with high values signaling severe saturation. This metric should typically remain at zero or very low under normal operation.

Request performance degradation manifests in latency metrics tracked through nginx_http_request_duration_seconds. The P99 quantile (99th percentile) reveals tail latency, where values exceeding 5-10 seconds indicate severe performance degradation. Compare against baseline performance, typically under 500ms for cache hits, to identify connection saturation impacts. When saturation prevents requests from being processed, error rates increase, captured in nginx_http_requests_total with 5xx status codes. Sharp increases in this rate may indicate connection saturation causing request failures, as normal operation should have minimal 5xx errors.

Network capacity constraints can contribute to connection issues, tracked through container_network_receive_bytes_total, which measures total inbound traffic to UCC pods. High values approaching NIC capacity may indicate network saturation contributing to connection problems. Compare against VM SKU NIC limits to determine if network bandwidth is becoming a constraint alongside connection capacity.

Root Cause Analysis#

Known Causes#

Connection saturation in UCC is typically caused by undersized worker connection limits, insufficient CPU cores, or too few replicas to handle the workload.

Undersized Worker Connection Limits#

The worker_connections parameter in NGINX controls the maximum number of simultaneous connections each worker process can handle. The default value (often 1,024) is insufficient for high-concurrency simulation workloads. Each NGINX worker process runs on one CPU core, so total connection capacity is worker_connections * vCPU_count * replica_count.

For example, with default worker_connections=1024, 32 vCPUs, and 5 replicas:

Total capacity = 1,024 * 32 * 5 = 163,840 connections

However, this includes both inbound (client→UCC) and outbound (UCC→S3) connections. A workload with 66 GPUs and 256 client concurrency requires:

Required connections ~ 66 * 256 = 16,896 inbound + outbound to S3

If worker_connections is too low, NGINX rejects new connections once the limit is reached, causing “connection refused” errors and queueing.

Check current worker connection configuration:

# Check Helm values for worker_connections
helm get values <ucc-release-name> -n ucc | grep worker_connections

# If not set, check default from ConfigMap
kubectl get configmap -n ucc <ucc-configmap> -o yaml | grep worker_connections

Insufficient CPU Cores#

NGINX spawns one worker process per CPU core. If CPU allocation is too low, even with properly sized worker_connections, UCC cannot handle the connection load because there are not enough worker processes to distribute connections across.

Check CPU allocation:

# Check UCC pod CPU limits and requests
kubectl get pods -n ucc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

# Check actual CPU usage
kubectl top pods -n ucc

Too Few Replicas#

UCC replica count may be too low to distribute connection load. The recommended sizing is based on network bandwidth requirements: provision at least 3.3 Gbps of network bandwidth per GPU, with a minimum of 1 Gbps per GPU.

For example, with 66 GPUs requiring 217.8 Gbps total bandwidth, and VMs with 10 Gbps NICs:

Required replicas ~ 217.8 Gbps / 10 Gbps per node ~ 22 pods

However, network bandwidth is typically the primary sizing factor; connection limits are secondary.

Check current replica count:

# Check UCC StatefulSet replica count
kubectl get statefulset -n ucc

# Check Helm values
helm get values <ucc-release-name> -n ucc | grep replicas

Other Possible Causes#

  1. HTTP/1.1 Instead of HTTP/2

    • HTTP/1.1 requires one connection per concurrent request

    • HTTP/2 multiplexes multiple requests over a single connection

    • Using HTTP/1.1 exhausts connections faster than HTTP/2

    • Check client HTTP version support and UCC HTTP/2 configuration

  2. Load Balancer Session Affinity Disabled

    • Without session affinity, client retries hit different UCC pods

    • Retry storms amplify connection pressure across all pods

    • Each retry counts as a new connection without affinity

  3. Cloud Provider SNAT Port Exhaustion

    • Outbound connections to S3 consume SNAT ports

    • SNAT exhaustion prevents new upstream connections

    • More common in cloud environments with NAT gateways or load balancers

  4. Network Latency or Packet Loss

    • High network latency increases connection lifetime

    • Packet loss triggers retransmissions and connection delays

    • Connections remain open longer, consuming worker slots

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

1. Monitor Connection Metrics and Identify Saturation#

Check active and waiting connection counts to determine if saturation is occurring.

# Access UCC metrics endpoint
kubectl port-forward -n ucc svc/\<ucc-service-name> 9145:9145
curl http://localhost:9145/metrics | grep "nginx_connections"

# Query Prometheus metrics:
# - nginx_connections_active{pod=~"usd-content-cache-.*"}
# - nginx_connections_waiting{pod=~"usd-content-cache-.*"}

# Check if active connections approach worker_connections limit
# Alert threshold: active > 0.8 * worker_connections

Analysis:

  • Active connections consistently near worker_connections limit indicate saturation.

  • Non-zero nginx_connections_waiting indicates connection queueing.

  • P99 response times >5s correlate with high active connection counts.

  • Compare active connections across pods to identify load distribution issues.

Resolution:

  • If active connections approach limit, increase worker_connections (see step 2).

  • If queueing occurs, scale replicas or increase CPU allocation (see steps 3-4).

  • Monitor connection metrics after changes to verify improvements.

2. Increase Worker Connection Limits#

If connection saturation is detected, increase the worker_connections parameter in NGINX configuration.

# Get current Helm values
helm get values \<ucc-release-name> -n ucc -o yaml > current-values.yaml

# Edit: Add or update nginx.workerConnections
# Recommended: (GPU_count * concurrency) / replicas / vCPU * 1.5
# Example: (66 * 256) / 5 / 32 * 1.5 ~ 1,600

# Apply updated values
helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml

# Verify configuration applied
kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep worker_connections

# Monitor connection metrics after upgrade
kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145
curl http://localhost:9145/metrics | grep "nginx_connections_active"

Analysis:

  • Current worker_connections compared to calculated requirement determines if increase is needed.

  • Post-upgrade, active connections should remain well below new limit (target <70%).

  • Verify no connection queueing (nginx_connections_waiting should be zero).

Resolution:

  • Set worker_connections to calculated value based on sizing formula.

  • Apply via Helm upgrade: helm upgrade <release> <chart> -n ucc -f values.yaml

  • Restart UCC pods if configuration hot-reload is not supported.

  • Monitor metrics for 24-48 hours to validate capacity improvements.

3. Scale UCC Replicas to Distribute Load#

If connection saturation persists after increasing worker limits, scale the number of UCC replicas to distribute load across more pods.

# Check current replica count
kubectl get statefulset -n ucc

# Calculate required replicas based on network bandwidth
# Required bandwidth = GPU_count * 3.3 Gbps (recommended) or 1 Gbps (minimum)
# Example: 66 GPUs * 3.3 Gbps = 217.8 Gbps
# VM NIC capacity: 10 Gbps → required replicas ~ 22

# Update Helm values: cluster.replicas
# Edit current-values.yaml

# Apply updated replica count
helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml

# Wait for new pods to become ready
kubectl get pods -n ucc -w

# Verify load distribution across replicas
kubectl top pods -n ucc

Analysis:

  • Current replica count compared to calculated requirement determines scaling needs.

  • Post-scaling, connection load should distribute evenly across pods.

  • Network bandwidth per pod should be well below NIC capacity (target <70%).

Resolution:

  • Scale replicas to match calculated requirement (network bandwidth-based sizing).

  • Apply via Helm upgrade: helm upgrade <release> <chart> -n ucc -f values.yaml

  • Monitor connection distribution and response times across all replicas.

  • Verify load balancer distributes traffic evenly across new pods.

4. Increase CPU Allocation for More Worker Processes#

If CPU utilization is high (>80%) and connection saturation persists, increase CPU allocation to spawn more NGINX worker processes (one per core).

# Check current CPU allocation
kubectl get pods -n ucc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'

# Check actual CPU usage
kubectl top pods -n ucc

# If CPU usage consistently >80%, increase CPU limits
# Edit Helm values: cluster.container.resources.limits.cpu
# Example: increase from 16 to 32 vCPUs

# Apply updated CPU allocation
helm upgrade <ucc-release-name> <chart-path> -n ucc -f current-values.yaml

# Verify NGINX spawned more workers
kubectl exec -n ucc <ucc-pod> -- ps aux | grep "nginx: worker process" | wc -l
# Should equal vCPU count

Analysis:

  • High CPU utilization (>80%) with connection queueing indicates CPU bottleneck.

  • Number of NGINX worker processes equals vCPU count (one per core).

  • More workers allow more concurrent connection handling.

Resolution:

  • Increase CPU allocation to match workload needs.

  • Verify worker process count equals new vCPU allocation.

  • Monitor CPU and connection metrics post-upgrade.

  • Consider upgrading VM SKU if CPU limits are reached.

Other Diagnostic Actions#

  • Check load balancer distribution: Verify traffic distributes evenly across UCC replicas:

    # Check request distribution across pods (from UCC metrics)
    kubectl port-forward -n ucc svc/<ucc-service-name> 9145:9145
    curl http://localhost:9145/metrics | grep "nginx_http_requests_total"
    
    # Compare request counts per pod
    # Uneven distribution may indicate load balancer issues
    
  • Review client session affinity: Check if load balancer has session affinity enabled:

    # Check Service configuration
    kubectl get svc -n ucc <ucc-service> -o yaml | grep -A 5 "sessionAffinity"
    
    # For cloud provider load balancers, check annotations
    kubectl get svc -n ucc <ucc-service> -o yaml | grep -i "affinity\|sticky"
    
  • Monitor cloud provider SNAT usage: Check if SNAT port exhaustion is contributing:

    # For Azure AKS:
    # az monitor metrics list \
    #   --resource <load-balancer-resource-id> \
    #   --metric "UsedSnatPorts,AllocatedSnatPorts" \
    #   --interval PT1M --aggregation Average
    
    # For AWS:
    # Check NAT Gateway or load balancer connection tracking metrics
    

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Connection utilization thresholds: Alert when active connections exceed 80% of worker_connections limit

  • Connection queueing: Alert when nginx_connections_waiting is non-zero for >30 seconds

  • P99 response time degradation: Alert when P99 exceeds baseline by 3x (e.g., >1.5s if baseline is 500ms)

  • CPU utilization: Alert when CPU usage exceeds 80% for >5 minutes

  • 5xx error rate increases: Alert on sharp increases in 5xx response codes

Configuration Best Practices#

  • Size worker connections appropriately: Use sizing formula: (GPU_count * concurrency) / replicas / vCPU * 1.5

  • Provision adequate CPU: Allocate vCPUs to match connection handling needs (one worker process per core)

  • Scale replicas for network bandwidth: Provision at least 3.3 Gbps per GPU (minimum 1 Gbps)

  • Enable HTTP/2: Configure HTTP/2 on both UCC and clients to multiplex requests over fewer connections

  • Enable session affinity: Configure load balancer session affinity (30-60s timeout) to improve retry efficiency

  • Monitor connection trends: Track connection usage over time to predict when scaling is needed

  • Plan for traffic spikes: Size capacity for peak concurrent simulations, not average load

Capacity Planning#

  • Calculate connection requirements: Use formula above to determine worker_connections based on GPU count and client concurrency

  • Account for multiple concurrent simulations: Multiply requirements by concurrent simulation count

  • Provision headroom: Add 50% safety margin to calculated values to handle traffic bursts

  • Plan VM SKU upgrades: Select VM SKUs with adequate vCPUs and network bandwidth for workload

  • Test under load: Validate configuration with representative workload before production deployment

  • Monitor during scale-out: Track metrics during GPU fleet expansion to predict UCC scaling needs