DDCS: Network Bandwidth or Latency Bottlenecks#
Overview#
When DDCS (Derived Data Cache Service) experiences network bandwidth or latency bottlenecks, cache population is slow, throughput between DDCS and GPU nodes is degraded, or scene load times increase. DDCS requires high bandwidth and low latency to efficiently serve derived data to render workers over the network.
DDCS serves derived data to render workers over the network. The service requires high bandwidth and low latency to support efficient data transfer. It is generally recommended to have 3.3Gbps of network bandwidth available for each GPU in a cluster, with a minimum of 1Gbps per GPU.
Network bandwidth or latency bottlenecks can occur when:
Insufficient DDCS pods scheduled to handle network load
Insufficient network bandwidth between DDCS and GPU nodes
Network congestion from noisy neighbors in shared environments
Misconfigured Kubernetes networking causing suboptimal routing
Cross-availability zone or cross-data center traffic
When network bottlenecks occur, DDCS cannot efficiently serve data to render workers, causing slow cache population, degraded throughput, and increased scene load times.
Symptoms and Detection Signals#
Visible Symptoms#
Slow cache population - Cache taking longer than expected to populate with derived data
Degraded throughput - Reduced data transfer rates between DDCS and GPU nodes
Increased scene load times - Scene loads taking significantly longer than expected
Metric Signals#
The following Prometheus metrics can be used to detect network bandwidth or latency bottlenecks. Review these metrics to determine if networking is a problem.
ddcs_m_adapter_bytes_returned_total{}
container_network_receive_bytes_total{pod=~"ddcs-.*"}
container_network_transmit_bytes_total{pod=~"ddcs-.*"}
Network saturation can be identified through several complementary
metrics. The ddcs_m_adapter_bytes_returned_total metric tracks total
bytes returned from all cache levels (in-memory and disk), providing
application-level visibility into data throughput. When this metric
shows high values relative to available network capacity, it suggests
DDCS may be saturating network bandwidth serving cached content to
clients.
Container-level network metrics provide direct visibility into network
interface utilization. The container_network_transmit_bytes_total
metric tracks bytes transmitted by DDCS pods, where high values
approaching network interface limits indicate outbound network
saturation as DDCS serves data to clients. Complementing this,
container_network_receive_bytes_total measures inbound traffic to
DDCS pods. By monitoring both transmit and receive metrics together, you
can identify whether DDCS pods are network-bound and approaching the
physical limits of their network interfaces. Compare these values
against the network interface capacity of your VM SKUs to determine if
network scaling is needed.
Root Cause Analysis#
Known Causes#
Network bandwidth or latency bottlenecks in DDCS are typically caused by insufficient DDCS pods scheduled, insufficient network bandwidth, or network congestion.
Insufficient DDCS Pods Scheduled#
DDCS must be configured with replica count equal to the number of compute nodes calculated based on network bandwidth requirements. The rule of thumb is to provision at least 3.3 Gbps of bandwidth per GPU. If there are not enough DDCS pods scheduled, the available pods may become network-bound, causing bottlenecks. Refer to the DDCS Configuration Guide for scaling guidance.
Check DDCS pod count:
# Check current DDCS pod count
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
# Check StatefulSet replica configuration
kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs
kubectl describe statefulset -n ddcs <ddcs-statefulset-name>
# Calculate required DDCS pods
# Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
# Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required
Insufficient Network Bandwidth#
Network bandwidth may be insufficient for the workload. DDCS requires high bandwidth to serve derived data efficiently. It is recommended to have 3.3Gbps per GPU, with a minimum of 1Gbps.
Network Congestion#
In shared network environments, other workloads may consume bandwidth, causing congestion and degraded performance for DDCS traffic.
Other Possible Causes#
Misconfigured Kubernetes Networking
Suboptimal routing causing increased latency
Network policies limiting throughput
Cross-availability zone traffic
Cross-Availability Zone or Cross-Region Traffic
Pods in different availability zones causing higher latency
Cross-region traffic increasing latency significantly
Public internet transit instead of private networking
Node-Level Network Issues
Node network interface problems
Network driver issues
Hardware network limitations
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
1. Check DDCS Pod Count and Scaling#
Verify there are sufficient DDCS pods scheduled to handle network load. Refer to the DDCS Configuration Guide for scaling requirements.
# Check current DDCS pod count
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
# Calculate required DDCS pods
# Required pods = (GPU count * 3.3 Gbps) / compute_node_bandwidth
# Example: 25 GPUs * 3.3 Gbps = 82.5 Gbps / 10 Gbps per node = ~9 pods required
# Check network metrics per pod
# Query: container_network_transmit_bytes_total{pod=~"ddcs-.*"}
# Query: container_network_receive_bytes_total{pod=~"ddcs-.*"}
Analysis:
Pod count below requirements indicates insufficient scaling.
High network utilization per pod suggests pods are network-bound.
Network metrics approaching interface limits indicate saturation.
Resolution:
Scale DDCS StatefulSet to match calculated pod requirements.
Refer to DDCS Configuration Guide for scaling guidance.
Update Helm values:
helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f values.yamlMonitor network metrics after scaling to verify improvement.
2. Monitor Network Metrics#
Review network metrics to identify bandwidth saturation or bottlenecks.
# Query DDCS bytes returned metric
# ddcs_m_adapter_bytes_returned_total
# Query container network metrics
# container_network_receive_bytes_total{pod=~"ddcs-.*"}
# container_network_transmit_bytes_total{pod=~"ddcs-.*"}
# Calculate network utilization
# Compare transmit/receive bytes against network interface capacity
# Check if metrics are approaching interface limits
Analysis:
High
ddcs_m_adapter_bytes_returned_totalrelative to network capacity indicates potential bottlenecks.Network transmit/receive bytes approaching interface limits indicate saturation.
High utilization across nodes suggests network congestion.
Resolution:
If network is saturated, scale DDCS pods to distribute load (see step 1).
Upgrade to VM SKUs with higher NIC speeds if bandwidth is insufficient.
Investigate network congestion from other workloads.
Other Diagnostic Actions#
Check pod placement: Verify DDCS and GPU pods are optimally placed:
kubectl get pods -n ddcs -o wide kubectl get pods -n <workload-namespace> -o wide # Ensure pods are in same availability zone for optimal performance
Review network policies: Check if network policies are affecting performance:
kubectl get networkpolicies -n ddcs kubectl describe networkpolicy <policy-name> -n ddcs
Monitor network trends: Track network performance over time:
# Use cloud provider network metrics # Monitor throughput, latency, and error rates # Identify patterns and trends
Prevention#
Proactive Monitoring#
Set up alerts for:
Network bandwidth thresholds: Alert when network utilization exceeds 80% of available bandwidth
DDCS pod count: Alert when DDCS pod count is below calculated requirements
Network saturation: Alert when
container_network_transmit_bytes_totalorcontainer_network_receive_bytes_totalapproach interface limitsHigh bytes returned: Alert when
ddcs_m_adapter_bytes_returned_totalindicates potential network bottlenecks