DDCS: Network Bandwidth or Latency Bottlenecks#

Overview#

When DDCS (Derived Data Cache Service) experiences network bandwidth or latency bottlenecks, cache population is slow, throughput between DDCS and GPU nodes is degraded, or scene load times increase. DDCS requires high bandwidth and low latency to efficiently serve derived data to render workers over the network.

DDCS serves derived data to render workers over the network. The service requires high bandwidth and low latency to support efficient data transfer. It is generally recommended to have 3.3Gbps of network bandwidth available for each GPU in a cluster, with a minimum of 1Gbps per GPU.

Network bandwidth or latency bottlenecks can occur when:

Insufficient DDCS pods scheduled to handle network load
Insufficient network bandwidth between DDCS and GPU nodes
Network congestion from noisy neighbors in shared environments
Misconfigured Kubernetes networking causing suboptimal routing
Cross-availability zone or cross-data center traffic

When network bottlenecks occur, DDCS cannot efficiently serve data to render workers, causing slow cache population, degraded throughput, and increased scene load times.

Symptoms and Detection Signals#

Visible Symptoms#

Slow cache population - Cache taking longer than expected to populate with derived data
Degraded throughput - Reduced data transfer rates between DDCS and GPU nodes
Increased scene load times - Scene loads taking significantly longer than expected

Metric Signals#

The following Prometheus metrics can be used to detect network bandwidth or latency bottlenecks. Review these metrics to determine if networking is a problem.

ddcs_m_adapter_bytes_returned_total{}
container_network_receive_bytes_total{pod=~"ddcs-.*"}
container_network_transmit_bytes_total{pod=~"ddcs-.*"}

Network saturation can be identified through several complementary metrics. The ddcs_m_adapter_bytes_returned_total metric tracks total bytes returned from all cache levels (in-memory and disk), providing application-level visibility into data throughput. When this metric shows high values relative to available network capacity, it suggests DDCS may be saturating network bandwidth serving cached content to clients.

Container-level network metrics provide direct visibility into network interface utilization. The container_network_transmit_bytes_total metric tracks bytes transmitted by DDCS pods, where high values approaching network interface limits indicate outbound network saturation as DDCS serves data to clients. Complementing this, container_network_receive_bytes_total measures inbound traffic to DDCS pods. By monitoring both transmit and receive metrics together, you can identify whether DDCS pods are network-bound and approaching the physical limits of their network interfaces. Compare these values against the network interface capacity of your VM SKUs to determine if network scaling is needed.

Root Cause Analysis#

Known Causes#

Network bandwidth or latency bottlenecks in DDCS are typically caused by insufficient DDCS pods scheduled, insufficient network bandwidth, or network congestion.

Insufficient DDCS Pods Scheduled#

DDCS must be configured with replica count equal to the number of compute nodes calculated based on network bandwidth requirements. The rule of thumb is to provision at least 3.3 Gbps of bandwidth per GPU. If there are not enough DDCS pods scheduled, the available pods may become network-bound, causing bottlenecks. Refer to the DDCS Configuration Guide for scaling guidance.

Check DDCS pod count:

# Check current DDCS pod count
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

# Check StatefulSet replica configuration
kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs
kubectl describe statefulset -n ddcs <ddcs-statefulset-name>

# Calculate required DDCS pods
# Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
# Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required

Insufficient Network Bandwidth#

Network bandwidth may be insufficient for the workload. DDCS requires high bandwidth to serve derived data efficiently. It is recommended to have 3.3Gbps per GPU, with a minimum of 1Gbps.

Network Congestion#

In shared network environments, other workloads may consume bandwidth, causing congestion and degraded performance for DDCS traffic.

Other Possible Causes#

Misconfigured Kubernetes Networking
- Suboptimal routing causing increased latency
- Network policies limiting throughput
- Cross-availability zone traffic
Cross-Availability Zone or Cross-Region Traffic
- Pods in different availability zones causing higher latency
- Cross-region traffic increasing latency significantly
- Public internet transit instead of private networking
Node-Level Network Issues
- Node network interface problems
- Network driver issues
- Hardware network limitations

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

1. Check DDCS Pod Count and Scaling#

Verify there are sufficient DDCS pods scheduled to handle network load. Refer to the DDCS Configuration Guide for scaling requirements.

# Check current DDCS pod count
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

# Calculate required DDCS pods
# Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
# Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required

# Check network metrics per pod
# Query: container_network_transmit_bytes_total{pod=~"ddcs-.*"}
# Query: container_network_receive_bytes_total{pod=~"ddcs-.*"}

Analysis:

Pod count below requirements indicates insufficient scaling.
High network utilization per pod suggests pods are network-bound.
Network metrics approaching interface limits indicate saturation.

Resolution:

Scale DDCS StatefulSet to match calculated pod requirements.
Refer to DDCS Configuration Guide for scaling guidance.
Update Helm values: helm upgrade ddcs omniverse/ddcs --version 5.0 -n ddcs -f values.yaml
Monitor network metrics after scaling to verify improvement.

2. Monitor Network Metrics#

Review network metrics to identify bandwidth saturation or bottlenecks.

# Query DDCS bytes returned metric
# ddcs_m_adapter_bytes_returned_total

# Query container network metrics
# container_network_receive_bytes_total{pod=~"ddcs-.*"}
# container_network_transmit_bytes_total{pod=~"ddcs-.*"}

# Calculate network utilization
# Compare transmit/receive bytes against network interface capacity
# Check if metrics are approaching interface limits

Analysis:

High ddcs_m_adapter_bytes_returned_total relative to network capacity indicates potential bottlenecks.
Network transmit/receive bytes approaching interface limits indicate saturation.
High utilization across nodes suggests network congestion.

Resolution:

If network is saturated, scale DDCS pods to distribute load (see step 1).
Upgrade to VM SKUs with higher NIC speeds if bandwidth is insufficient.
Investigate network congestion from other workloads.

Other Diagnostic Actions#

Check pod placement: Verify DDCS and GPU pods are optimally placed:

kubectl get pods -n ddcs -o wide
kubectl get pods -n <workload-namespace> -o wide
# Ensure pods are in same availability zone for optimal performance

Review network policies: Check if network policies are affecting performance:

kubectl get networkpolicies -n ddcs
kubectl describe networkpolicy <policy-name> -n ddcs

Monitor network trends: Track network performance over time:

# Use cloud provider network metrics
# Monitor throughput, latency, and error rates
# Identify patterns and trends

Prevention#

Proactive Monitoring#

Set up alerts for:

Network bandwidth thresholds: Alert when network utilization exceeds 80% of available bandwidth
DDCS pod count: Alert when DDCS pod count is below calculated requirements
Network saturation: Alert when container_network_transmit_bytes_total or container_network_receive_bytes_total approach interface limits
High bytes returned: Alert when ddcs_m_adapter_bytes_returned_total indicates potential network bottlenecks