DDCS: Disk Space Exhaustion#
Overview#
DDCS stores derived data on persistent volumes backed by Persistent Volume Claims (PVCs). When the volume attached to DDCS is too small or fills too quickly, pods may crash, enter a crash loop, or report errors writing to disk. DDCS requires persistent storage to maintain derived data, and disk space exhaustion prevents the service from operating correctly.
The storage limit is controlled by the storageLimit configuration
value, which should be 7-10% less than the allocated PVC size to allow
for breathing room, compaction, and WAL (Write-Ahead Log) files. When
disk space is exhausted, RocksDB cannot write new data, flush write
buffers, or perform compaction operations, causing write stalls, cache
eviction, and service failures.
Disk space exhaustion can occur when:
Persistent volume allocation is insufficient for workload data growth
Data growth outpaces garbage collection and cleanup mechanisms
Retention policies are misconfigured or not functioning
Garbage collection is not running or is insufficient
Multiple DDCS pods competing for limited storage resources
RocksDB compaction backlog consuming excessive disk space
WAL files accumulating without cleanup
When disk space is exhausted, DDCS pods may fail to start, crash during operation, or report errors when attempting to write data, resulting in service unavailability and degraded rendering performance.
Symptoms and Detection Signals#
Visible Symptoms#
DDCS pods crash - Pods failing with disk-related errors
Crash loop backoff - Pods repeatedly restarting due to disk space issues
Errors writing to disk - Logs indicating write failures or “no space left” errors
Service unavailability - DDCS unable to serve requests due to disk issues
Write stalls - RocksDB stalling writes due to insufficient disk space
Cache eviction - Cache eviction occurring due to disk pressure
Log Messages#
Disk Space Errors#
Where to find these logs:
Pod:
ddcs-*Location: DDCS Pod
Application: DDCS/RocksDB
Description: Logs indicating disk space exhaustion or write failures
# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*no space left on device*" or
"*No space left*" or
"*ENOSPC*" or
"*disk full*" or
"*failed to write*" or
"*cannot allocate memory*" or
"*storage limit exceeded*"
Metric Signals#
The following Prometheus metrics can be used to detect disk space exhaustion before it causes service failures. Monitor these metrics proactively to identify capacity issues early.
Volume Capacity Metrics#
kubelet_volume_stats_available_bytes{persistentvolumeclaim="ddcs-*", namespace="ddcs"}
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="ddcs-*", namespace="ddcs"}
kubelet_volume_stats_used_bytes{persistentvolumeclaim="ddcs-*", namespace="ddcs"}
Persistent volume utilization is tracked through kubelet metrics that
monitor available, used, and total capacity. The
kubelet_volume_stats_available_bytes metric reports remaining disk
space, where low values indicate approaching exhaustion. Total volume
size is provided by kubelet_volume_stats_capacity_bytes, which
serves as the baseline for calculating usage percentage. Current disk
consumption is tracked in kubelet_volume_stats_used_bytes, where
high values relative to capacity indicate the volume is approaching its
limits. Alert when usage exceeds 80% of capacity to allow time for
remediation before exhaustion.
RocksDB Write Stalls from Disk Pressure#
ddcs_rocksdb_intrinsic_gauge{name="..."}
Disk space constraints can trigger RocksDB write stalls, tracked through
metrics with labels like __rocksdb_stalls_total_stops and
__rocksdb_stalls_total_delays. These stalls indicate RocksDB is
throttling writes, potentially due to insufficient disk space for
compaction or new writes. Stalls related to pending compaction bytes
(tracked in __rocksdb_stalls_stops_pending_compaction_bytes and
__rocksdb_stalls_delays_pending_compaction_bytes) are particularly
relevant, as they indicate compaction backlog due to insufficient disk
space preventing cleanup of stale data.
Database and I/O Performance Metrics#
ddcs_m_adapter_error_total{kind="..."}
ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}
Database errors tracked in ddcs_m_adapter_error_total with kinds
like rocks_timed_out, rocks_busy, and rocks_try_again may
indicate the engine cannot write due to disk space constraints. Sharp
increases in these error types correlate with disk exhaustion scenarios.
Write I/O latency captured in ddcs_m_adapter_io_histogram_bucket
with io_kind="rocksdb_write" may show elevated values during disk
space pressure, as writes slow down or stall when the disk approaches
capacity.
Cache Performance Impact#
ddcs_m_adapter_entry_cache_total{operation="miss"}
ddcs_m_adapter_get_total{level="rocksdb_disk_seek"}
Disk space pressure can force cache eviction, manifesting in cache
performance metrics. The ddcs_m_adapter_entry_cache_total metric
with operation="miss" may show elevated cache misses as disk space
constraints force removal of cached data. Similarly,
ddcs_m_adapter_get_total with level="rocksdb_disk_seek" may
increase relative to in-memory gets, indicating cache eviction due to
disk pressure forces more disk reads instead of serving from cache.
Root Cause Analysis#
Known Causes#
Disk space exhaustion in DDCS is typically caused by insufficient persistent volume allocation, data growth outpacing cleanup mechanisms, or misconfigured retention policies.
Insufficient Persistent Volume Allocation#
PVCs may be undersized for the workload data growth. DDCS stores derived
data that can be many times the size of source content. If PVCs are not
sized appropriately, they will fill to capacity quickly. The
storageLimit configuration should be set to 7-10% less than the PVC
size to allow for overhead, but if the PVC itself is too small, even
proper configuration cannot prevent exhaustion.
Check PVC sizes and usage:
# Check PVC capacity and usage
kubectl get pvc -n ddcs
kubectl describe pvc -n ddcs
# Check actual disk usage within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h
# Check storageLimit configuration
helm get values <ddcs-release-name> -n ddcs | grep storageLimit
# Compare PVC size to storageLimit
# storageLimit should be 7-10% less than PVC size
Data Growth Outpacing Cleanup#
Data growth may outpace garbage collection and cleanup mechanisms. If garbage collection is not running frequently enough, is misconfigured, or cannot free sufficient space, disk usage will continue to grow until exhaustion occurs. RocksDB compaction may also be unable to keep up with write volume, causing accumulation of stale SST files.
Check garbage collection and compaction:
# Check garbage collection configuration
helm get values <ddcs-release-name> -n ddcs -o yaml | grep -A 10 "garbageCollection"
# Check if GC is running
kubectl logs -n ddcs <ddcs-pod-name> | grep -i "garbage\|gc\|cleanup"
# Review RocksDB compaction settings
helm get values <ddcs-release-name> -n ddcs | grep -A 5 "periodic_compaction"
# Monitor disk usage trends
kubectl exec -n ddcs <ddcs-pod-name> -- df -h
Other Possible Causes#
Storage Limit Misconfiguration
storageLimitset too high relative to PVC size (should be 7-10% less)storageLimitnot accounting for WAL and compaction overheadMultiple DDCS pods sharing storage resources without proper limits
RocksDB Compaction Issues
Compaction not running efficiently due to disk space constraints
Stale SST files not being removed due to insufficient space
Compaction backlog causing storage growth faster than cleanup
Pending compaction bytes exceeding limits due to slow disk I/O
Garbage Collection Configuration Issues
garbageCollection.minFreeCapacityset too high, preventing GC from running when neededgarbageCollection.deleteKeyspaceQuantileset too low, not freeing enough spacegarbageCollection.checkDbCapacityMsinterval too long, delaying GC triggersGC not running due to configuration errors or service issues
Node-Level Storage Issues
Node running out of disk space affecting all pods
Multiple pods on same node competing for storage
Storage backend performance issues preventing efficient cleanup
Ephemeral storage limits being exceeded
Troubleshooting Steps#
Diagnostic Steps for Known Root Causes#
1. Monitor Disk Usage Metrics and Set Up Alerts#
Monitor disk usage metrics to detect exhaustion before it causes service failures.
# Check PVC capacity and usage
kubectl get pvc -n ddcs -o wide
# Get detailed PVC information
kubectl describe pvc -n ddcs <pvc-name>
# Check disk usage from within pods
kubectl exec -n ddcs <ddcs-pod-name> -- df -h
# Query Prometheus metrics for disk usage
# kubelet_volume_stats_available_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# kubelet_volume_stats_capacity_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# kubelet_volume_stats_used_bytes{pvc_name="<pvc-name>", namespace="ddcs"}
# Calculate usage percentage
# usage_percent = (used_bytes / capacity_bytes) * 100
# Check DDCS storage usage metrics
kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
curl http://localhost:3051/metrics | grep -i "storage\|disk\|size"
Analysis:
PVCs showing high usage (>90%) indicate capacity exhaustion risk.
Compare
kubelet_volume_stats_used_bytesagainstkubelet_volume_stats_capacity_bytes.DDCS storage metrics approaching
storageLimitindicate exhaustion.Rapid increases in disk usage indicate data growth outpacing cleanup.
Resolution:
Set up alerts for PVC usage exceeding 80% of capacity.
Monitor DDCS storage metrics and alert when approaching
storageLimit.Track disk usage trends to predict when capacity increases are needed.
Expand PVCs if storage class supports volume expansion (see step 3).
2. Expand Persistent Volume Claims (PVCs) as Needed#
If disk space is exhausted or approaching limits, expand PVCs to provide additional capacity.
# Check current PVC sizes
kubectl get pvc -n ddcs
# Check if storage class supports volume expansion
kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion
# If expansion is supported, edit PVC to request larger size
# Note: Some storage classes require manual PVC edit, others support dynamic expansion
kubectl edit pvc <pvc-name> -n ddcs
# Change spec.resources.requests.storage to larger value
# After expansion, update storageLimit in Helm values to match
# storageLimit should be 7-10% less than PVC size
helm get values <ddcs-release-name> -n ddcs -o yaml > current-values.yaml
# Edit: cluster.container.settings.storageLimit
# Example: If PVC is 100GB, storageLimit should be 90-93GB
# Apply updated values
helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f current-values.yaml
Analysis:
PVCs at or near capacity require expansion.
Storage classes with
allowVolumeExpansion: truesupport dynamic expansion.After PVC expansion,
storageLimitmust be updated to match (7-10% less than PVC size).
Resolution:
Expand PVCs if storage class supports volume expansion.
For storage classes without expansion support, create new larger PVCs and migrate data.
Update
storageLimitin Helm values after expansion (should be 7-10% less than PVC size).Apply updated values:
helm upgrade <ddcs-release-name> omniverse/ddcs -n ddcs -f values.yamlMonitor disk usage after expansion to ensure adequate capacity.
Other Diagnostic Actions#
Check node-level storage: Verify nodes have sufficient storage capacity:
kubectl describe nodes | grep -A 5 "Allocated resources" kubectl top nodes # Check for node-level disk pressure kubectl describe nodes | grep -i "disk\|storage\|pressure"
Review storage class configuration: Verify storage class provides adequate capacity:
kubectl get storageclass kubectl describe storageclass <storage-class-name> # Check volume expansion capabilities kubectl get storageclass <storage-class-name> -o yaml | grep allowVolumeExpansion
Prevention#
Proactive Monitoring#
Set up alerts for:
PVC capacity thresholds: Alert when PVC usage exceeds 80% of capacity
Storage limit thresholds: Alert when DDCS storage usage approaches
storageLimitGarbage collection failures: Alert when GC is not running or failing
Rapid disk growth: Alert on rapid increases in disk usage indicating potential issues
Node storage pressure: Alert when node-level storage is approaching limits
RocksDB write stalls: Alert when write stall metrics indicate disk space pressure
Cache eviction rates: Alert when cache miss rates increase due to disk pressure
Configuration Best Practices#
Size PVCs appropriately: Estimate storage requirements based on workload patterns and size PVCs with headroom (account for derived data being many times source content size)
Configure storage limits: Set
storageLimitto 7-10% less than PVC size to allow for overhead, WAL files, and compactionEnable garbage collection: Ensure GC is configured and running with appropriate thresholds (
minFreeCapacity,deleteKeyspaceQuantile,checkDbCapacityMs)Configure periodic compaction: Enable RocksDB periodic compaction to clean up stale SST files
Monitor disk usage trends: Track disk usage over time to predict when capacity increases are needed
Plan for data growth: Account for data growth as workloads scale and scenes become more complex
Use volume expansion: Prefer storage classes that support volume expansion for easier capacity management
Capacity Planning#
Estimate storage requirements: Calculate storage needs based on scene sizes, derived data multipliers (often 3-10x source content), and retention requirements
Plan for storage growth: Account for storage growth as workloads scale and new assets are introduced
Monitor storage trends: Track storage usage trends over time to predict when capacity increases are needed
Test retention policies: Validate GC and retention policies under expected production load
Review workload patterns: Analyze workload patterns to understand data growth rates and adjust capacity planning accordingly
Account for multiple pods: If running multiple DDCS pods, ensure total storage capacity accounts for all pods and their data growth