DDCS: No Active Shards#

Overview#

When DDCS is enabled (all workflows), there must be at least 1 healthy pod serving traffic. DDCS is the storage medium for derived data. When all pods are unhealthy, rendering stops.

DDCS (Derived Data Cache Service) uses client-side logic to distribute traffic across multiple pods. An unhealthy DDCS results in issues within write-render-worker pods.

Clients discover DDCS peers with DNS (e.g., ddcs.ddcs.svc.cluster.local). It is possible for discovery to return IPs that are not reachable, or 0 IPs when DDCS pods are not scheduled. This can result in unexpected issues in the rendering pipeline.

Symptoms and Detection Signals#

Visible Symptoms#

  • Application crashes on startup with errors related to DDCS connection failures

  • Active sessions unable to open files - previously working sessions suddenly cannot load new assets

  • Error messages in application logs indicating DDCS is unavailable

Log Messages#

0 Active Shards Log Message#

Where to find these logs:

  • Pod: write-render-worker

  • Location: Cloud Function Pod

  • Application: KIT

  • Description: The client discovered peers but determined all peers were unhealthy

# LEVEL: Info
# SOURCE: omni.datastore.health
kubernetes.pod_name: write-render-worker* and
message: "*ShardHealthCheck: active shard list changed, have 0 active shards!*"

DNS Discovery Failed#

Where to find these logs:

  • Pod: write-render-worker

  • Location: Cloud Function Pod

  • Application: KIT

  • Description: The client was configured to discover DDCS pods but there is a problem finding peers

# LEVEL: Error
# SOURCE: omni.datastore
kubernetes.pod_name: write-render-worker* and
message: "*Unable to discover grpc datastore service uri*"

Metric Signals#

Monitor the following Prometheus metrics and Kubernetes states to detect DDCS availability issues:

kube_pod_status_ready
kube_pod_status_phase
kube_pod_container_status_restarts_total

Pod readiness is tracked through kube_pod_status_ready, where values showing false for extended periods indicate pods that are not ready to accept traffic. These unready pods may still appear in DNS SRV records, causing clients to attempt connections to unavailable endpoints. Cross-reference DNS entries against ready pods to identify stale DNS entries.

Pod lifecycle state is captured in kube_pod_status_phase, revealing pods in Pending, Failed, or CrashLoopBackOff states that are not reachable. Even though these pods are not functional, they may still appear in DNS discovery results, leading to connection failures when clients attempt to reach them.

Pod stability can be assessed through kube_pod_container_status_restarts_total, where high restart counts indicate pods that may be intermittently unavailable. During restart periods, these pods are temporarily unreachable, causing connection failures if clients discover them between restarts.

Beyond metrics, DDCS pod endpoint health should be verified by testing gRPC connectivity on port 3010. Pods responding to health checks but not accepting gRPC requests indicate a service-level issue rather than pod failure, suggesting application-specific problems within the DDCS service itself.

Root Cause Analysis#

Known Causes#

Replica Set Scaled to 0#

For various reasons an operator may want to scale the DDCS ReplicaSet “to zero”. This may be done through the values configuration, by setting replicas to 0.

In this scenario, the operator has chosen to not schedule any DDCS pods. However, sessions continue to rely on the unavailable service. The symptoms described in this document will appear.

Not Authorized for Nvidia Image repository#

The namespace may not have the appropriate secret to access the Nvidia container repository.

Check the Kubernetes pods in the namespace. If all pods are stuck with ImagePullBackOff status, kubectl describe the pods. The log will clarify why Kubernetes is unable to obtain the image. If it appears that authorization is an issue, ensure that the pull secret configured for the installation has access to the Nvidia repository.

PVC Attachment Failures#

Pods may be stuck in Pending state. In this case kubectl describe the pods and PVCs in the namespace. Kubernetes may explain that PVCs are failing to attach or cannot be created. Depending on the CSP, there may be a billing or configuration issue. Reference the documentation for your Kubernetes platform.

If need be, delete the PVC(s) for DDCS and then delete the pods. Renders will be slower while the cache is rebuilt.

Misc Scheduling Issue#

Pods stuck in Pending state are often not scheduled due to insufficient resources.

Ensure DDCS is properly configured to be scheduled on the resources provisioned in your environment. Allocate more or larger CPU VM SKUs to the Kubernetes node pool. Or, modify the values configuration to reduce resource consumption. Be mindful that this will affect overall performance.

Other Possible Causes#

  1. DNS Propagation Delays

    • DNS not updated after pod termination

    • DNS cache TTL longer than pod lifecycle transitions

    • Stale DNS entries in client-side DNS cache

  2. Network Connectivity Issues

    • Network policies blocking access to DDCS pods

    • Confirm that network policies allow communication between functions and DDCS pods

  3. Service Endpoint Mismatch

    • Kubernetes Service endpoints not updated to reflect current pod states

    • Service selector not matching current pod labels

    • Endpoints controller not reconciling pod changes

  4. Resource Constraints

    • Pods evicted due to node resource pressure

    • Pods unable to start due to insufficient node resources

    • Storage issues preventing pod startup

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

1. Check if DDCS StatefulSet is Scaled Down to 0#

If DDCS is enabled in the application configuration but the StatefulSet has 0 replicas, no pods will be available to serve requests.

# Check StatefulSet replica configuration
kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.replicas}{" replicas configured (current: "}{.status.replicas}{" running, "}{.status.readyReplicas}{" ready)"}{"\n"}{end}'

# Alternative: Simple check of all StatefulSets in ddcs namespace
kubectl get statefulset -n ddcs

# Verify if any DDCS pods exist
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

Analysis:

  • If StatefulSet shows 0 replicas configured, this is the root cause.

  • If no pods are returned from the pod query, DDCS is not running.

  • If DNS SRV records exist but no pods are running, DNS may be stale or StatefulSet was recently scaled down.

Resolution:

  • Scale the StatefulSet back up to the desired replica count (see Resolution Steps section).

2. Check for Image Pull Authorization Issues#

If pods are stuck in ImagePullBackOff status, the namespace may not have the appropriate secret to access the NVIDIA container repository.

# Check pod status for ImagePullBackOff
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

# Describe pods to see detailed error messages
kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"

# Verify image pull secrets exist
kubectl get secrets -n ddcs | grep -E "docker-registry|regcred|ngc"

Analysis:

  • Pods showing ImagePullBackOff status indicate authorization issues.

  • Error messages in pod events will clarify why Kubernetes cannot obtain the image.

  • Missing or incorrect image pull secrets indicate the root cause.

Resolution:

  • Ensure the pull secret configured for the installation has access to the NVIDIA repository.

  • Verify the secret name matches what’s configured in Helm values under image.pullSecrets.

3. Check for PVC Attachment Failures#

Pods may be stuck in Pending state due to Persistent Volume Claim (PVC) attachment failures.

# Check pod status
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

# Describe pods to see PVC-related errors
kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"

# Check PVC status
kubectl get pvc -n ddcs

# Describe PVCs to see attachment issues
kubectl describe pvc -n ddcs

Analysis:

  • Pods stuck in Pending state with PVC-related errors in events.

  • PVCs showing Pending or Failed status.

  • Error messages indicating attachment failures or billing/configuration issues.

Resolution:

  • Reference documentation for your Kubernetes platform (AWS EKS, Azure AKS, etc.).

  • Check for billing or quota issues with storage.

  • If necessary, delete the PVC(s) for DDCS and then delete the pods (renders will be slower while cache is rebuilt).

4. Check for Resource Scheduling Issues#

Pods stuck in Pending state may be unscheduled due to insufficient node resources.

# Check pod status and scheduling events
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs -o wide

# Describe pods to see why they're not scheduled
kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"

# Check node resources
kubectl top nodes

# Check if nodes have sufficient resources
kubectl describe nodes | grep -A 5 "Allocated resources"

Analysis:

  • Pods in Pending state with events indicating insufficient CPU/memory.

  • Nodes at or near capacity.

  • No available nodes matching pod resource requirements.

Resolution:

  • Allocate more or larger CPU VM SKUs to the Kubernetes node pool.

  • Modify the values configuration to reduce resource consumption (be mindful this affects performance).

Other Diagnostic Actions#

  • DNS SRV record count mismatch: Compare the number of SRV records returned by DNS resolution against the number of healthy DDCS pods:

    # Get DNS SRV records
    kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup -type=SRV _grpc._tcp.ddcs.ddcs.svc.cluster.local
    
  • Stale DNS SRV entries: DNS SRV records may contain endpoints for pods that no longer exist or have been terminated

  • Pod endpoint mismatch: SRV records pointing to pod endpoints (hostname:port) that don’t match current pod endpoints

  • gRPC connection failures: Attempt to connect to each discovered DDCS endpoint from a client pod:

    # From within a workload pod or debug pod
    kubectl run -it --rm debug --image=grpcurl/grpcurl --restart=Never -- grpcurl -plaintext <ddcs-pod-ip>:3010 list
    
  • Port 3010 unreachable: Network policies or firewall rules blocking access to DDCS gRPC port

  • Service endpoint verification: Verify Kubernetes Service endpoints match actual pod IPs:

    kubectl get endpoints ddcs -n ddcs -o yaml
    

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Pod count vs DNS SRV record count mismatch: Alert when number of DNS SRV records doesn’t match number of healthy DDCS pods

  • Pods in non-ready state: Alert when pods remain non-ready for more than 5 minutes

  • High pod restart rates: Alert on elevated restart counts indicating instability

  • Service endpoint mismatches: Alert when service endpoints don’t match SRV record targets

  • gRPC connection failures: Monitor application-level metrics for DDCS connection failures