DDCS: No Active Shards#
Overview#
When DDCS is enabled (all workflows), there must be at least 1 healthy pod serving traffic. DDCS is the storage medium for derived data. When all pods are unhealthy, rendering stops.
DDCS (Derived Data Cache Service) uses client-side logic to distribute
traffic across multiple pods. An unhealthy DDCS results in issues within
write-render-worker pods.
Clients discover DDCS peers with DNS (e.g.,
ddcs.ddcs.svc.cluster.local). It is possible for discovery to return
IPs that are not reachable, or 0 IPs when DDCS pods are not scheduled.
This can result in unexpected issues in the rendering pipeline.
Symptoms and Detection Signals#
Visible Symptoms#
Application crashes on startup with errors related to DDCS connection failures
Active sessions unable to open files - previously working sessions suddenly cannot load new assets
Error messages in application logs indicating DDCS is unavailable
Log Messages#
0 Active Shards Log Message#
Where to find these logs:
Pod:
write-render-workerLocation: Cloud Function Pod
Application: KIT
Description: The client discovered peers but determined all peers were unhealthy
# LEVEL: Info
# SOURCE: omni.datastore.health
kubernetes.pod_name: write-render-worker* and
message: "*ShardHealthCheck: active shard list changed, have 0 active shards!*"
DNS Discovery Failed#
Where to find these logs:
Pod:
write-render-workerLocation: Cloud Function Pod
Application: KIT
Description: The client was configured to discover DDCS pods but there is a problem finding peers
# LEVEL: Error
# SOURCE: omni.datastore
kubernetes.pod_name: write-render-worker* and
message: "*Unable to discover grpc datastore service uri*"
Metric Signals#
Monitor the following Prometheus metrics and Kubernetes states to detect DDCS availability issues:
kube_pod_status_ready
kube_pod_status_phase
kube_pod_container_status_restarts_total
Pod readiness is tracked through kube_pod_status_ready, where values
showing false for extended periods indicate pods that are not ready
to accept traffic. These unready pods may still appear in DNS SRV
records, causing clients to attempt connections to unavailable
endpoints. Cross-reference DNS entries against ready pods to identify
stale DNS entries.
Pod lifecycle state is captured in kube_pod_status_phase, revealing
pods in Pending, Failed, or CrashLoopBackOff states that are
not reachable. Even though these pods are not functional, they may still
appear in DNS discovery results, leading to connection failures when
clients attempt to reach them.
Pod stability can be assessed through
kube_pod_container_status_restarts_total, where high restart counts
indicate pods that may be intermittently unavailable. During restart
periods, these pods are temporarily unreachable, causing connection
failures if clients discover them between restarts.
Beyond metrics, DDCS pod endpoint health should be verified by testing gRPC connectivity on port 3010. Pods responding to health checks but not accepting gRPC requests indicate a service-level issue rather than pod failure, suggesting application-specific problems within the DDCS service itself.
Root Cause Analysis#
Known Causes#
Replica Set Scaled to 0#
For various reasons an operator may want to scale the DDCS
ReplicaSet “to zero”. This may be done through the values
configuration, by setting replicas to 0.
In this scenario, the operator has chosen to not schedule any DDCS pods. However, sessions continue to rely on the unavailable service. The symptoms described in this document will appear.
PVC Attachment Failures#
Pods may be stuck in Pending state. In this case
kubectl describe the pods and PVCs in the namespace. Kubernetes may
explain that PVCs are failing to attach or cannot be created. Depending
on the CSP, there may be a billing or configuration issue. Reference the
documentation for your Kubernetes platform.
If need be, delete the PVC(s) for DDCS and then delete the pods. Renders will be slower while the cache is rebuilt.
Misc Scheduling Issue#
Pods stuck in Pending state are often not scheduled due to
insufficient resources.
Ensure DDCS is properly configured to be scheduled on the resources
provisioned in your environment. Allocate more or larger CPU VM SKUs
to the Kubernetes node pool. Or, modify the values configuration to
reduce resource consumption. Be mindful that this will affect overall
performance.
Other Possible Causes#
DNS Propagation Delays
DNS not updated after pod termination
DNS cache TTL longer than pod lifecycle transitions
Stale DNS entries in client-side DNS cache
Network Connectivity Issues
Network policies blocking access to DDCS pods
Confirm that network policies allow communication between functions and DDCS pods
Service Endpoint Mismatch
Kubernetes Service endpoints not updated to reflect current pod states
Service selector not matching current pod labels
Endpoints controller not reconciling pod changes
Resource Constraints
Pods evicted due to node resource pressure
Pods unable to start due to insufficient node resources
Storage issues preventing pod startup
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
1. Check if DDCS StatefulSet is Scaled Down to 0#
If DDCS is enabled in the application configuration but the StatefulSet has 0 replicas, no pods will be available to serve requests.
# Check StatefulSet replica configuration
kubectl get statefulset -n ddcs -l app.kubernetes.io/instance=ddcs -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.replicas}{" replicas configured (current: "}{.status.replicas}{" running, "}{.status.readyReplicas}{" ready)"}{"\n"}{end}'
# Alternative: Simple check of all StatefulSets in ddcs namespace
kubectl get statefulset -n ddcs
# Verify if any DDCS pods exist
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
Analysis:
If StatefulSet shows
0 replicas configured, this is the root cause.If no pods are returned from the pod query, DDCS is not running.
If DNS SRV records exist but no pods are running, DNS may be stale or StatefulSet was recently scaled down.
Resolution:
Scale the StatefulSet back up to the desired replica count (see Resolution Steps section).
3. Check for PVC Attachment Failures#
Pods may be stuck in Pending state due to Persistent Volume Claim
(PVC) attachment failures.
# Check pod status
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs
# Describe pods to see PVC-related errors
kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"
# Check PVC status
kubectl get pvc -n ddcs
# Describe PVCs to see attachment issues
kubectl describe pvc -n ddcs
Analysis:
Pods stuck in
Pendingstate with PVC-related errors in events.PVCs showing
PendingorFailedstatus.Error messages indicating attachment failures or billing/configuration issues.
Resolution:
Reference documentation for your Kubernetes platform (AWS EKS, Azure AKS, etc.).
Check for billing or quota issues with storage.
If necessary, delete the PVC(s) for DDCS and then delete the pods (renders will be slower while cache is rebuilt).
4. Check for Resource Scheduling Issues#
Pods stuck in Pending state may be unscheduled due to insufficient
node resources.
# Check pod status and scheduling events
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs -o wide
# Describe pods to see why they're not scheduled
kubectl describe pods -n ddcs -l app.kubernetes.io/instance=ddcs | grep -A 10 "Events:"
# Check node resources
kubectl top nodes
# Check if nodes have sufficient resources
kubectl describe nodes | grep -A 5 "Allocated resources"
Analysis:
Pods in
Pendingstate with events indicating insufficient CPU/memory.Nodes at or near capacity.
No available nodes matching pod resource requirements.
Resolution:
Allocate more or larger CPU VM SKUs to the Kubernetes node pool.
Modify the
valuesconfiguration to reduce resource consumption (be mindful this affects performance).
Other Diagnostic Actions#
DNS SRV record count mismatch: Compare the number of SRV records returned by DNS resolution against the number of healthy DDCS pods:
# Get DNS SRV records kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup -type=SRV _grpc._tcp.ddcs.ddcs.svc.cluster.local
Stale DNS SRV entries: DNS SRV records may contain endpoints for pods that no longer exist or have been terminated
Pod endpoint mismatch: SRV records pointing to pod endpoints (hostname:port) that don’t match current pod endpoints
gRPC connection failures: Attempt to connect to each discovered DDCS endpoint from a client pod:
# From within a workload pod or debug pod kubectl run -it --rm debug --image=grpcurl/grpcurl --restart=Never -- grpcurl -plaintext <ddcs-pod-ip>:3010 list
Port 3010 unreachable: Network policies or firewall rules blocking access to DDCS gRPC port
Service endpoint verification: Verify Kubernetes Service endpoints match actual pod IPs:
kubectl get endpoints ddcs -n ddcs -o yaml
Prevention#
Proactive Monitoring#
Set up alerts for:
Pod count vs DNS SRV record count mismatch: Alert when number of DNS SRV records doesn’t match number of healthy DDCS pods
Pods in non-ready state: Alert when pods remain non-ready for more than 5 minutes
High pod restart rates: Alert on elevated restart counts indicating instability
Service endpoint mismatches: Alert when service endpoints don’t match SRV record targets
gRPC connection failures: Monitor application-level metrics for DDCS connection failures