DDCS: Disk Space Exhaustion#

Overview#

DDCS stores derived data on persistent volumes backed by Persistent Volume Claims (PVCs). When the volume attached to DDCS is too small or fills too quickly, pods may crash, enter a crash loop, or report errors writing to disk. DDCS requires persistent storage to maintain derived data, and disk space exhaustion prevents the service from operating correctly.

The storage limit is controlled by the storageLimit configuration value, which should be 7-10% less than the allocated PVC size to allow for breathing room, compaction, and WAL (Write-Ahead Log) files. When disk space is exhausted, RocksDB cannot write new data, flush write buffers, or perform compaction operations, causing write stalls, cache eviction, and service failures.

Disk space exhaustion can occur when:

Persistent volume allocation is insufficient for workload data growth
Data growth outpaces garbage collection and cleanup mechanisms
Retention policies are misconfigured or not functioning
Garbage collection is not running or is insufficient
Multiple DDCS pods competing for limited storage resources
RocksDB compaction backlog consuming excessive disk space
WAL files accumulating without cleanup

When disk space is exhausted, DDCS pods may fail to start, crash during operation, or report errors when attempting to write data, resulting in service unavailability and degraded rendering performance.

Symptoms and Detection Signals#

Visible Symptoms#

DDCS pods crash - Pods failing with disk-related errors
Crash loop backoff - Pods repeatedly restarting due to disk space issues
Errors writing to disk - Logs indicating write failures or “no space left” errors
Service unavailability - DDCS unable to serve requests due to disk issues
Write stalls - RocksDB stalling writes due to insufficient disk space
Cache eviction - Cache eviction occurring due to disk pressure

Log Messages#

Disk Space Errors#

Where to find these logs:

Pod: ddcs-*
Location: DDCS Pod
Application: DDCS/RocksDB
Description: Logs indicating disk space exhaustion or write failures

# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*no space left on device*" or
         "*No space left*" or
         "*ENOSPC*" or
         "*disk full*" or
         "*failed to write*" or
         "*cannot allocate memory*" or
         "*storage limit exceeded*"

Metric Signals#

The following Prometheus metrics can be used to detect disk space exhaustion before it causes service failures. Monitor these metrics proactively to identify capacity issues early.

Volume Capacity Metrics#

kubelet_volume_stats_available_bytes{persistentvolumeclaim="ddcs-*", namespace="ddcs"}
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="ddcs-*", namespace="ddcs"}
kubelet_volume_stats_used_bytes{persistentvolumeclaim="ddcs-*", namespace="ddcs"}

Persistent volume utilization is tracked through kubelet metrics that monitor available, used, and total capacity. The kubelet_volume_stats_available_bytes metric reports remaining disk space, where low values indicate approaching exhaustion. Total volume size is provided by kubelet_volume_stats_capacity_bytes, which serves as the baseline for calculating usage percentage. Current disk consumption is tracked in kubelet_volume_stats_used_bytes, where high values relative to capacity indicate the volume is approaching its limits. Alert when usage exceeds 80% of capacity to allow time for remediation before exhaustion.

RocksDB Write Stalls from Disk Pressure#

ddcs_rocksdb_intrinsic_gauge{name="..."}

Disk space constraints can trigger RocksDB write stalls, tracked through metrics with labels like __rocksdb_stalls_total_stops and __rocksdb_stalls_total_delays. These stalls indicate RocksDB is throttling writes, potentially due to insufficient disk space for compaction or new writes. Stalls related to pending compaction bytes (tracked in __rocksdb_stalls_stops_pending_compaction_bytes and __rocksdb_stalls_delays_pending_compaction_bytes) are particularly relevant, as they indicate compaction backlog due to insufficient disk space preventing cleanup of stale data.

Database and I/O Performance Metrics#

ddcs_m_adapter_error_total{kind="..."}
ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}

Database errors tracked in ddcs_m_adapter_error_total with kinds like rocks_timed_out, rocks_busy, and rocks_try_again may indicate the engine cannot write due to disk space constraints. Sharp increases in these error types correlate with disk exhaustion scenarios. Write I/O latency captured in ddcs_m_adapter_io_histogram_bucket with io_kind="rocksdb_write" may show elevated values during disk space pressure, as writes slow down or stall when the disk approaches capacity.

Cache Performance Impact#

ddcs_m_adapter_entry_cache_total{operation="miss"}
ddcs_m_adapter_get_total{level="rocksdb_disk_seek"}

Disk space pressure can force cache eviction, manifesting in cache performance metrics. The ddcs_m_adapter_entry_cache_total metric with operation="miss" may show elevated cache misses as disk space constraints force removal of cached data. Similarly, ddcs_m_adapter_get_total with level="rocksdb_disk_seek" may increase relative to in-memory gets, indicating cache eviction due to disk pressure forces more disk reads instead of serving from cache.

Root Cause Analysis#

Known Causes#

Disk space exhaustion in DDCS is typically caused by insufficient persistent volume allocation, data growth outpacing cleanup mechanisms, or misconfigured retention policies.

Insufficient Persistent Volume Allocation#

PVCs may be undersized for the workload data growth. DDCS stores derived data that can be many times the size of source content. If PVCs are not sized appropriately, they will fill to capacity quickly. The storageLimit configuration should be set to 7-10% less than the PVC size to allow for overhead, but if the PVC itself is too small, even proper configuration cannot prevent exhaustion.

Check PVC sizes and usage:

# Check PVC capacity and usage
kubectl get pvc -n ddcs
kubectl describe pvc -n ddcs

# Check actual disk usage within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

# Check storageLimit configuration
helm get values ddcs -n ddcs | grep storageLimit

# Compare PVC size to storageLimit
# storageLimit should be 7-10% less than PVC size

Data Growth Outpacing Cleanup#

Data growth may outpace garbage collection and cleanup mechanisms. If garbage collection is not running frequently enough, is misconfigured, or cannot free sufficient space, disk usage will continue to grow until exhaustion occurs. RocksDB compaction may also be unable to keep up with write volume, causing accumulation of stale SST files.

Check garbage collection and compaction:

# Check garbage collection configuration
helm get values ddcs -n ddcs -o yaml | grep -A 10 "garbageCollection"

# Check if GC is running
kubectl logs -n ddcs <ddcs-pod-name> | grep -i "garbage\|gc\|cleanup"

# Review RocksDB compaction settings
helm get values ddcs -n ddcs | grep -A 5 "periodic_compaction"

# Monitor disk usage trends
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

Other Possible Causes#

Storage Limit Misconfiguration
- storageLimit set too high relative to PVC size (should be 7-10% less)
- storageLimit not accounting for WAL and compaction overhead
- Multiple DDCS pods sharing storage resources without proper limits
RocksDB Compaction Issues
- Compaction not running efficiently due to disk space constraints
- Stale SST files not being removed due to insufficient space
- Compaction backlog causing storage growth faster than cleanup
- Pending compaction bytes exceeding limits due to slow disk I/O
Garbage Collection Configuration Issues
- garbageCollection.minFreeCapacity set too high, preventing GC from running when needed
- garbageCollection.deleteKeyspaceQuantile set too low, not freeing enough space
- garbageCollection.checkDbCapacityMs interval too long, delaying GC triggers
- GC not running due to configuration errors or service issues
Node-Level Storage Issues
- Node running out of disk space affecting all pods
- Multiple pods on same node competing for storage
- Storage backend performance issues preventing efficient cleanup
- Ephemeral storage limits being exceeded

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

1. Monitor Disk Usage Metrics and Set Up Alerts#

Monitor disk usage metrics to detect exhaustion before it causes service failures.

# Check PVC capacity and usage
kubectl get pvc -n ddcs -o wide

# Get detailed PVC information
kubectl describe pvc -n ddcs <pvc-name>

# Check disk usage from within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h

# Query Prometheus metrics for disk usage
# kubelet_volume_stats_available_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# kubelet_volume_stats_capacity_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# kubelet_volume_stats_used_bytes{pvc_name="<pvc-name>", namespace="ddcs"}

# Calculate usage percentage
# usage_percent = (used_bytes / capacity_bytes) * 100

# Check DDCS storage usage metrics
kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
curl http://localhost:3051/metrics | grep -i "storage\|disk\|size"

Analysis:

PVCs showing high usage (>90%) indicate capacity exhaustion risk.
Compare kubelet_volume_stats_used_bytes against kubelet_volume_stats_capacity_bytes.
DDCS storage metrics approaching storageLimit indicate exhaustion.
Rapid increases in disk usage indicate data growth outpacing cleanup.

Resolution:

Set up alerts for PVC usage exceeding 80% of capacity.
Monitor DDCS storage metrics and alert when approaching storageLimit.
Track disk usage trends to predict when capacity increases are needed.
Expand PVCs if storage class supports volume expansion (see step 3).

2. Expand Persistent Volume Claims (PVCs) as Needed#

If disk space is exhausted or approaching limits, expand PVCs to provide additional capacity.

# Check current PVC sizes
kubectl get pvc -n ddcs

# Check if storage class supports volume expansion
kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion

# If expansion is supported, edit PVC to request larger size
# Note: Some storage classes require manual PVC edit, others support dynamic expansion
kubectl edit pvc <pvc-name> -n ddcs
# Change spec.resources.requests.storage to larger value

# After expansion, update storageLimit in Helm values to match
# storageLimit should be 7-10% less than PVC size
helm get values ddcs -n ddcs -o yaml > current-values.yaml
# Edit: cluster.container.settings.storageLimit
# Example: If PVC is 100GB, storageLimit should be 90-93GB

# Apply updated values
helm upgrade ddcs omniverse/ddcs --version 5.0 -n ddcs -f current-values.yaml

Analysis:

PVCs at or near capacity require expansion.
Storage classes with allowVolumeExpansion: true support dynamic expansion.
After PVC expansion, storageLimit must be updated to match (7-10% less than PVC size).

Resolution:

Expand PVCs if storage class supports volume expansion.
For storage classes without expansion support, create new larger PVCs and migrate data.
Update storageLimit in Helm values after expansion (should be 7-10% less than PVC size).
Apply updated values: helm upgrade ddcs omniverse/ddcs --version 5.0 -n ddcs -f current-values.yaml
Monitor disk usage after expansion to ensure adequate capacity.

Other Diagnostic Actions#

Check node-level storage: Verify nodes have sufficient storage capacity:

kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl top nodes
# Check for node-level disk pressure
kubectl describe nodes | grep -i "disk\|storage\|pressure"

Review storage class configuration: Verify storage class provides adequate capacity:

kubectl get storageclass
kubectl describe storageclass <storage-class-name>
# Check volume expansion capabilities
kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion

Prevention#

Proactive Monitoring#

Set up alerts for:

PVC capacity thresholds: Alert when PVC usage exceeds 80% of capacity
Storage limit thresholds: Alert when DDCS storage usage approaches storageLimit
Garbage collection failures: Alert when GC is not running or failing
Rapid disk growth: Alert on rapid increases in disk usage indicating potential issues
Node storage pressure: Alert when node-level storage is approaching limits
RocksDB write stalls: Alert when write stall metrics indicate disk space pressure
Cache eviction rates: Alert when cache miss rates increase due to disk pressure

Configuration Best Practices#

Size PVCs appropriately: Estimate storage requirements based on workload patterns and size PVCs with headroom (account for derived data being many times source content size)
Configure storage limits: Set storageLimit to 7-10% less than PVC size to allow for overhead, WAL files, and compaction
Enable garbage collection: Ensure GC is configured and running with appropriate thresholds (minFreeCapacity, deleteKeyspaceQuantile, checkDbCapacityMs)
Configure periodic compaction: Enable RocksDB periodic compaction to clean up stale SST files
Monitor disk usage trends: Track disk usage over time to predict when capacity increases are needed
Plan for data growth: Account for data growth as workloads scale and scenes become more complex
Use volume expansion: Prefer storage classes that support volume expansion for easier capacity management

Capacity Planning#

Estimate storage requirements: Calculate storage needs based on scene sizes, derived data multipliers (often 3-10x source content), and retention requirements
Plan for storage growth: Account for storage growth as workloads scale and new assets are introduced
Monitor storage trends: Track storage usage trends over time to predict when capacity increases are needed
Test retention policies: Validate GC and retention policies under expected production load
Review workload patterns: Analyze workload patterns to understand data growth rates and adjust capacity planning accordingly
Account for multiple pods: If running multiple DDCS pods, ensure total storage capacity accounts for all pods and their data growth