DDCS: RocksDB Corruption or Failures#

Overview#

When DDCS (Derived Data Cache Service) experiences RocksDB corruption or failures, the service may fail to start, logs indicate database corruption, or data loss is observed. DDCS uses RocksDB as its persistent storage backend for derived data, and corruption can prevent the service from operating correctly.

DDCS relies on RocksDB for persistent storage of derived data. RocksDB is a persistent key-value store that provides durability and performance. When corruption occurs, RocksDB cannot read or write data correctly, causing service failures.

RocksDB corruption can occur when:

Unclean shutdowns preventing proper database closure
Disk errors or hardware failures affecting persistent volumes
File system corruption on persistent volumes
Insufficient disk space during critical operations

The only solution for corruption is to delete all PVCs and delete all pods to force a restart, which will result in data loss and require the cache to be rebuilt.

Symptoms and Detection Signals#

Visible Symptoms#

Service fails to start - DDCS pods unable to start due to database errors
Database corruption errors - Logs indicating RocksDB corruption or invalid data
Data loss - Missing or corrupted derived data in cache
Repeated pod restarts - Pods crashing repeatedly due to database errors

Log Messages#

Database Open Failures#

Where to find these logs:

Pod: ddcs-*
Location: DDCS Pod
Application: DDCS/RocksDB
Description: Errors indicating RocksDB cannot open the database

# LEVEL: Error/Fatal
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*failed to open rocksdb database*"

Metric Signals#

ddcs_m_adapter_error_total{kind="rocks_io_error"}

RocksDB I/O errors are tracked through the ddcs_m_adapter_error_total metric with kind="rocks_io_error". This metric counts I/O errors reported by RocksDB during database operations. High values or sharp increases indicate disk I/O failures or data corruption that may prevent the database from reading or writing data correctly. Sustained I/O errors typically precede database corruption and service failures.

Root Cause Analysis#

Known Causes#

RocksDB corruption in DDCS is typically caused by unclean shutdowns, disk errors, or software bugs.

Unclean Shutdowns#

Unclean shutdowns occur when DDCS pods are terminated abruptly without allowing RocksDB to flush data and close the database properly. This can happen during node failures, forced pod deletions, or OOM kills. When RocksDB cannot close cleanly, WAL files or SST files may be left in an inconsistent state, causing corruption.

Check for unclean shutdowns:

kubectl describe pod -n ddcs <ddcs-pod-name> | grep -A 10 "Events:"
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs -o wide
# Look for termination reasons: OOMKilled, Evicted, NodeShutdown

kubectl get events -n ddcs --sort-by='.lastTimestamp' | grep -i "evict\|kill\|terminate"

Disk Errors or Hardware Failures#

Disk errors or hardware failures on persistent volumes can cause data corruption. This may manifest as I/O errors, read/write failures, or corrupted data blocks. Persistent volumes backed by unreliable storage or experiencing hardware issues are more susceptible to corruption.

Check for disk errors:

kubectl logs -n ddcs <ddcs-pod-name> | grep -i "disk\|io\|error\|fail"
kubectl describe pvc -n ddcs
# Check for volume attachment issues or storage backend problems

Other Possible Causes#

File System Corruption
- File system corruption on persistent volumes
- File system errors preventing proper writes
- Mount issues causing data inconsistencies
Insufficient Disk Space
- Disk space exhaustion during critical operations
- WAL files or SST files corrupted due to space constraints
- Compaction failures due to insufficient space

Troubleshooting Steps#

Diagnostic Steps for Known Root Causes#

1. Delete All PVCs and Pods to Force Restart#

The only solution for RocksDB corruption is to delete all PVCs and delete all pods to force a restart. This will result in data loss, and the cache will be rebuilt as workloads run.

# List all DDCS PVCs
kubectl get pvc -n ddcs

# Delete all DDCS PVCs
# WARNING: This will result in complete data loss
kubectl delete pvc --all -n ddcs

# Delete all DDCS pods
kubectl delete pods -n ddcs -l app.kubernetes.io/instance=ddcs

# Verify pods are running and healthy
kubectl get pods -n ddcs -l app.kubernetes.io/instance=ddcs

Analysis:

Deleting PVCs removes all corrupted data.
Deleting pods forces recreation with new PVCs.
Cache will be rebuilt as workloads access DDCS.

Resolution:

Delete all PVCs to remove corrupted data (data will be lost).
Delete all pods to force restart with clean storage.
Monitor pod startup to ensure service is healthy.
Cache will rebuild automatically as workloads run (renders will be slower during rebuild).

Prevention#

Proactive Monitoring#

Set up alerts for:

RocksDB I/O errors: Alert when ddcs_m_adapter_error_total{kind=~"rocks_io_error"} increases
Pod termination events: Alert on OOM kills or forced pod terminations
Database corruption errors: Alert on corruption-related log messages
Pod startup failures: Alert when DDCS pods fail to start repeatedly