DDCS: Cache Misses and Performance Degradation#
Overview#
DDCS caches derived data that is computationally expensive to generate. When cache misses occur frequently, render time is spent regenerating derived content before getting to your desired workload, significantly increasing scene load times and reducing overall system performance.
Cache misses and performance degradation can occur when:
Cache is cold (scene or assets not preloaded)
Cache eviction due to memory or disk pressure
Misconfigured cache size parameters
Insufficient cache capacity for workload patterns
Write buffer pressure causing cache eviction
Symptoms and Detection Signals#
Visible Symptoms#
User-facing slowdowns - Scene loads taking significantly longer than expected
Inconsistent performance - Performance varies dramatically between “warm” and “cold” scene loads
Log Messages#
Write Buffer Pressure#
Where to find these logs:
Pod:
ddcs-*Location: DDCS Pod
Application: DDCS
Description: Logs indicating writes are behind or are failing
# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*timeout while trying to write to DB*" or
"*write failed because the database is busy*" or
"*failed to write*" or
"*the engine is busy and cannot currently process the request*"
Metric Signals#
The following Prometheus metrics can be used to further determine the cause of the problem. During normal operation these counts may climb. These metrics relate to the write performance of the storage medium. When the engine is writing or wants to write data faster than the medium allows for, scene load times will suffer.
RocksDB Write Stall Metrics#
ddcs_rocksdb_intrinsic_gauge{name="..."}
RocksDB write stalls occur when the database throttles writes due to
resource pressure. These stalls come in two forms: delays (soft stalls
that slow down writes) and stops (hard stalls that completely block
write operations). The total counts are tracked in
__rocksdb_stalls_total_delays and __rocksdb_stalls_total_stops,
which provide an overall view of how frequently RocksDB is throttling
writes due to resource constraints.
Write stalls related to L0 file count limits indicate that compaction
cannot keep up with write volume. When the L0 file count approaches
limits, RocksDB triggers delays tracked by
__rocksdb_stalls_delays_l0_file_count_limit as an early warning. If
the problem worsens and the limit is exceeded, write operations are
completely blocked, counted in
__rocksdb_stalls_stops_l0_file_count_limit.
Pending compaction bytes stalls occur when the compaction backlog grows
faster than it can be processed, typically suggesting disk I/O
bottlenecks. As pending compaction bytes approach limits, delays are
recorded in __rocksdb_stalls_delays_pending_compaction_bytes. When
the limit is exceeded, writes are stopped entirely, tracked by
__rocksdb_stalls_stops_pending_compaction_bytes.
Memtable stalls happen when write buffers fill faster than they can be
flushed to disk. Initial pressure triggers delays captured in
__rocksdb_stalls_delays_memtable_limit, warning that write buffers
are filling up. If the memtable limit is reached, RocksDB blocks all
writes until space is available, counted in
__rocksdb_stalls_stops_memtable_limit.
Database Error Metrics#
ddcs_m_adapter_error_total{kind="..."}
The database engine reports errors through several error types that
indicate write capacity issues. During normal operation, these error
counts will gradually climb as the system handles occasional resource
contention. However, when these errors sharply increase together, it
signals that the database engine cannot keep up with the write load. The
rocks_timed_out error indicates write operations exceeded timeout
thresholds, while rocks_busy shows the engine rejected operations
due to resource saturation. The rocks_try_again error suggests
temporary unavailability requiring retry logic. Monitoring these error
types collectively provides early warning of database performance
degradation.
Cache Performance Metrics#
ddcs_m_adapter_entry_cache_total{operation="..."}
ddcs_m_adapter_get_total{level="..."}
Cache effectiveness is measured through hit and miss ratios tracked by
the ddcs_m_adapter_entry_cache_total metric. When
operation="hit", the metric counts successful cache retrievals,
while operation="miss" tracks requests that required data generation
or disk access. Comparing these values produces the cache hit ratio,
where ratios below 70-80% indicate cache effectiveness issues such as
cold cache conditions, undersizing, or eviction pressure.
Beyond simple hit/miss counts, the cache location matters significantly
for performance. The ddcs_m_adapter_get_total metric breaks down
retrievals by storage level. Requests served from level="in_memory"
or level="rocksdb_in_memory" represent the ideal case requiring no
disk I/O and providing the fastest response times. In contrast,
level="rocksdb_disk_seek" indicates cache misses that forced disk
reads, significantly degrading performance. High disk seek counts
relative to in-memory gets suggest cache eviction or cold cache
conditions forcing expensive disk operations.
I/O Performance Metrics#
ddcs_m_adapter_io_histogram_bucket{io_kind="..."}
ddcs_m_adapter_bytes_returned_total{}
Disk I/O performance is tracked through histograms that capture
operation latency. The ddcs_m_adapter_io_histogram_bucket metric
measures time spent on storage operations, with
io_kind="rocksdb_read" tracking read latency and
io_kind="rocksdb_write" measuring write latency. High read latency
values suggest disk I/O bottlenecks or cache misses forcing disk access,
while elevated write latencies indicate write buffer pressure or storage
performance issues causing overall performance degradation.
The volume of data movement provides additional insight into cache
effectiveness through ddcs_m_adapter_bytes_returned_total, which
tracks total bytes returned across all storage levels. When this metric
shows high values relative to in-memory cache hits, it indicates
increased disk I/O activity, suggesting cache eviction or cold cache
conditions are forcing the system to read more data from disk rather
than serving it from fast in-memory caches.
Root Cause Analysis#
Possible Causes#
Cold Cache (Scene or Assets Not Preloaded)#
A cold cache occurs when scenes or assets have not been preloaded into DDCS. During initial scene loads, GPUs encounter assets for the first time and must generate derived content synchronously, which is then written to DDCS. This synchronous generation and writing process significantly increases scene load times compared to warm cache scenarios where derived data is already available.
Cold cache conditions are expected during initial scene loads or when new assets are introduced. However, if cache miss ratios remain consistently high across multiple scene loads or after cache warm-up procedures, it may indicate that the cache is not sized correctly for the workload or that cache eviction is occurring too frequently.
Cache Eviction Due to Memory/Disk Pressure#
Cache eviction occurs when memory or disk pressure forces DDCS to remove cached data to make room for new writes. Write buffers evict cached key-value pairs when full, and RocksDB may evict data from caches when memory pressure occurs.
Cache size parameters (sys.cache_size for row cache,
sys.block_cache_size for block cache) may be misconfigured for the
workload. Too small caches result in frequent eviction and cache misses,
while too large caches may cause memory pressure and OOM kills.
Storage Medium has Insufficient Performance#
Insufficient storage performance occurs when the underlying persistent volume cannot keep up with DDCS write and read operations. When storage IOPS or throughput is insufficient, RocksDB cannot flush write buffers or perform compaction fast enough, causing write stalls and cache eviction. This manifests as high stall metrics, slow I/O histograms, and increased disk seek operations.
Storage performance requirements depend on your installation environment and workload. The storage class and persistent volume configuration must provide sufficient IOPS and throughput for the attached volumes. Consult your cloud service provider (CSP) documentation for storage class performance characteristics and scaling options.
Other Possible Causes#
Insufficient Cache Capacity
Total cache size insufficient for workload data set
Cache not sized for peak scene sizes
Multiple concurrent scenes exceeding cache capacity
Write Buffer Configuration Issues
Too few write buffers (
cf.max_write_buffer_number) causing premature evictionWrite buffer size too small for write bursts
Write buffer pressure causing cache eviction
Garbage Collection Aggressiveness
Garbage collection removing cached data too aggressively
garbageCollection.deleteKeyspaceQuantileset too highgarbageCollection.minFreeCapacitythreshold too high
Workload Pattern Changes
New scenes or assets not fitting existing cache patterns
Increased scene complexity requiring more cache space
Concurrent workload increases exceeding cache capacity
Troubleshooting Steps#
Diagnostic Steps#
1. Check Cache Hit/Miss Ratios#
Monitor cache hit and miss metrics to determine if cache is cold or experiencing eviction.
# Query metrics:
# - ddcs_m_adapter_entry_cache_total{operation="hit"}
# - ddcs_m_adapter_entry_cache_total{operation="miss"}
Analysis:
High miss rates relative to hits indicate cold cache or cache eviction.
Hit ratios below 70-80% suggest cache effectiveness issues.
Compare hit rates between initial scene loads (cold) and subsequent loads (warm).
Consistently high miss ratios after warm-up indicate cache sizing or eviction problems.
Resolution:
If cache is cold, implement warm-up procedures for common scenes.
If miss ratios remain high after warm-up, investigate cache sizing (see Cache Eviction section).
Monitor
ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"}vsddcs_m_adapter_get_total{level="rocksdb_disk_seek"}to quantify cache effectiveness.
2. Check Cache Hit/Miss and Disk Seek Metrics#
Monitor cache effectiveness metrics to identify eviction patterns.
# Query metrics:
# - ddcs_m_adapter_entry_cache_total{operation="miss"} vs {operation="hit"}
# - ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"} vs {level="rocksdb_disk_seek"}
Analysis:
High miss rates relative to hits indicate cache eviction.
High
rocksdb_disk_seekrelative toin_memory|rocksdb_in_memoryindicates eviction forcing disk reads.Increasing miss rates over time suggest cache capacity issues.
Resolution:
If eviction is occurring, check cache size configuration (see step 4).
Monitor
ddcs_m_adapter_bytes_returned_totalto track data volume requiring disk access.
3. Monitor RocksDB Stall Metrics#
Check RocksDB stall metrics to identify storage performance bottlenecks.
# Access DDCS metrics endpoint
kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
curl http://localhost:3051/metrics | grep "__rocksdb_stalls"
# Query stall metrics:
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_total_(stops|delays)"}
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_l0_file_count_limit"}
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"}
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|stalls)_memtable_limit"}
Analysis:
High stall counts indicate RocksDB throttling writes due to storage performance limits.
L0 file count limit stalls suggest compaction cannot keep up with write volume.
Pending compaction bytes stalls indicate compaction backlog due to slow disk I/O.
Memtable limit stalls indicate write buffers cannot flush fast enough.
Resolution:
If stalls are high, investigate storage class IOPS and throughput capabilities.
Review
ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}for slow write operations.Consult CSP documentation to upgrade storage class or increase volume performance.
4. Check I/O Performance Metrics#
Monitor I/O histogram metrics to quantify storage performance issues.
# Query I/O metrics:
# - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_read"}
# - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}
Analysis:
High read I/O latency indicates slow disk reads, suggesting cache misses requiring disk access.
High write I/O latency indicates slow disk writes, suggesting write buffer pressure.
Compare I/O latencies against storage class performance specifications.
Resolution:
If I/O latencies are high, upgrade storage class or increase volume IOPS/throughput.
Monitor
ddcs_m_adapter_error_total{kind=~"rocks_timed_out|rocks_busy|rocks_try_again"}for storage-related errors.Review storage class configuration and consider higher performance tiers.
5. Review Storage Class and Volume Configuration#
Verify storage class provides sufficient IOPS and throughput for the workload.
# Check PVC storage class
kubectl get pvc -n ddcs
kubectl describe pvc -n ddcs <pvc-name>
# Check storage class configuration
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
Analysis:
Storage class performance characteristics determine available IOPS and throughput.
Insufficient storage performance causes RocksDB stalls and cache eviction.
Compare storage class specs against workload requirements.
Resolution:
Upgrade to storage class with higher IOPS/throughput if performance is insufficient.
Consider provisioned IOPS volumes for consistent performance.
Monitor stall metrics after storage changes to validate improvements.
Other Diagnostic Actions#
Monitor write buffer utilization: Check
__rocksdb_stalls_(stops|stalls)_memtable_limitmetrics for write buffer pressureReview garbage collection settings: Check
garbageCollection.deleteKeyspaceQuantileandgarbageCollection.minFreeCapacityconfigurationAnalyze workload patterns: Review
ddcs_m_adapter_entry_cache_totalandddcs_m_adapter_get_totalmetrics over time to identify capacity issuesCompare warm vs cold performance: Monitor latency metrics during warm and cold loads to quantify cache impact
Prevention#
Proactive Monitoring#
Set up alerts for:
Cache hit ratio thresholds: Alert when cache hit ratios drop below 70% for extended periods
Cache miss rate increases: Alert on significant increases in cache miss rates
Write buffer pressure: Alert when write buffer utilization approaches limits
Memory pressure: Alert when pod memory usage approaches limits to prevent eviction
Capacity Planning#
Estimate cache requirements: Calculate cache sizes based on average scene sizes and access patterns
Plan for cache growth: Account for cache growth as workloads scale
Monitor cache trends: Track cache usage trends to predict when capacity increases are needed
Test cache effectiveness: Validate cache configurations under expected production load