DDCS: Cache Misses and Performance Degradation#

Overview#

DDCS caches derived data that is computationally expensive to generate. When cache misses occur frequently, render time is spent regenerating derived content before getting to your desired workload, significantly increasing scene load times and reducing overall system performance.

Cache misses and performance degradation can occur when:

  • Cache is cold (scene or assets not preloaded)

  • Cache eviction due to memory or disk pressure

  • Misconfigured cache size parameters

  • Insufficient cache capacity for workload patterns

  • Write buffer pressure causing cache eviction

Symptoms and Detection Signals#

Visible Symptoms#

  • User-facing slowdowns - Scene loads taking significantly longer than expected

  • Inconsistent performance - Performance varies dramatically between “warm” and “cold” scene loads

Log Messages#

Write Buffer Pressure#

Where to find these logs:

  • Pod: ddcs-*

  • Location: DDCS Pod

  • Application: DDCS

  • Description: Logs indicating writes are behind or are failing

# LEVEL: Error
# SOURCE: DDCS/RocksDB
kubernetes.pod_name: ddcs-* and
message: "*timeout while trying to write to DB*" or
         "*write failed because the database is busy*" or
         "*failed to write*" or
         "*the engine is busy and cannot currently process the request*"

Metric Signals#

The following Prometheus metrics can be used to further determine the cause of the problem. During normal operation these counts may climb. These metrics relate to the write performance of the storage medium. When the engine is writing or wants to write data faster than the medium allows for, scene load times will suffer.

RocksDB Write Stall Metrics#

ddcs_rocksdb_intrinsic_gauge{name="..."}

RocksDB write stalls occur when the database throttles writes due to resource pressure. These stalls come in two forms: delays (soft stalls that slow down writes) and stops (hard stalls that completely block write operations). The total counts are tracked in __rocksdb_stalls_total_delays and __rocksdb_stalls_total_stops, which provide an overall view of how frequently RocksDB is throttling writes due to resource constraints.

Write stalls related to L0 file count limits indicate that compaction cannot keep up with write volume. When the L0 file count approaches limits, RocksDB triggers delays tracked by __rocksdb_stalls_delays_l0_file_count_limit as an early warning. If the problem worsens and the limit is exceeded, write operations are completely blocked, counted in __rocksdb_stalls_stops_l0_file_count_limit.

Pending compaction bytes stalls occur when the compaction backlog grows faster than it can be processed, typically suggesting disk I/O bottlenecks. As pending compaction bytes approach limits, delays are recorded in __rocksdb_stalls_delays_pending_compaction_bytes. When the limit is exceeded, writes are stopped entirely, tracked by __rocksdb_stalls_stops_pending_compaction_bytes.

Memtable stalls happen when write buffers fill faster than they can be flushed to disk. Initial pressure triggers delays captured in __rocksdb_stalls_delays_memtable_limit, warning that write buffers are filling up. If the memtable limit is reached, RocksDB blocks all writes until space is available, counted in __rocksdb_stalls_stops_memtable_limit.

Database Error Metrics#

ddcs_m_adapter_error_total{kind="..."}

The database engine reports errors through several error types that indicate write capacity issues. During normal operation, these error counts will gradually climb as the system handles occasional resource contention. However, when these errors sharply increase together, it signals that the database engine cannot keep up with the write load. The rocks_timed_out error indicates write operations exceeded timeout thresholds, while rocks_busy shows the engine rejected operations due to resource saturation. The rocks_try_again error suggests temporary unavailability requiring retry logic. Monitoring these error types collectively provides early warning of database performance degradation.

Cache Performance Metrics#

ddcs_m_adapter_entry_cache_total{operation="..."}
ddcs_m_adapter_get_total{level="..."}

Cache effectiveness is measured through hit and miss ratios tracked by the ddcs_m_adapter_entry_cache_total metric. When operation="hit", the metric counts successful cache retrievals, while operation="miss" tracks requests that required data generation or disk access. Comparing these values produces the cache hit ratio, where ratios below 70-80% indicate cache effectiveness issues such as cold cache conditions, undersizing, or eviction pressure.

Beyond simple hit/miss counts, the cache location matters significantly for performance. The ddcs_m_adapter_get_total metric breaks down retrievals by storage level. Requests served from level="in_memory" or level="rocksdb_in_memory" represent the ideal case requiring no disk I/O and providing the fastest response times. In contrast, level="rocksdb_disk_seek" indicates cache misses that forced disk reads, significantly degrading performance. High disk seek counts relative to in-memory gets suggest cache eviction or cold cache conditions forcing expensive disk operations.

I/O Performance Metrics#

ddcs_m_adapter_io_histogram_bucket{io_kind="..."}
ddcs_m_adapter_bytes_returned_total{}

Disk I/O performance is tracked through histograms that capture operation latency. The ddcs_m_adapter_io_histogram_bucket metric measures time spent on storage operations, with io_kind="rocksdb_read" tracking read latency and io_kind="rocksdb_write" measuring write latency. High read latency values suggest disk I/O bottlenecks or cache misses forcing disk access, while elevated write latencies indicate write buffer pressure or storage performance issues causing overall performance degradation.

The volume of data movement provides additional insight into cache effectiveness through ddcs_m_adapter_bytes_returned_total, which tracks total bytes returned across all storage levels. When this metric shows high values relative to in-memory cache hits, it indicates increased disk I/O activity, suggesting cache eviction or cold cache conditions are forcing the system to read more data from disk rather than serving it from fast in-memory caches.

Root Cause Analysis#

Possible Causes#

Cold Cache (Scene or Assets Not Preloaded)#

A cold cache occurs when scenes or assets have not been preloaded into DDCS. During initial scene loads, GPUs encounter assets for the first time and must generate derived content synchronously, which is then written to DDCS. This synchronous generation and writing process significantly increases scene load times compared to warm cache scenarios where derived data is already available.

Cold cache conditions are expected during initial scene loads or when new assets are introduced. However, if cache miss ratios remain consistently high across multiple scene loads or after cache warm-up procedures, it may indicate that the cache is not sized correctly for the workload or that cache eviction is occurring too frequently.

Cache Eviction Due to Memory/Disk Pressure#

Cache eviction occurs when memory or disk pressure forces DDCS to remove cached data to make room for new writes. Write buffers evict cached key-value pairs when full, and RocksDB may evict data from caches when memory pressure occurs.

Cache size parameters (sys.cache_size for row cache, sys.block_cache_size for block cache) may be misconfigured for the workload. Too small caches result in frequent eviction and cache misses, while too large caches may cause memory pressure and OOM kills.

Storage Medium has Insufficient Performance#

Insufficient storage performance occurs when the underlying persistent volume cannot keep up with DDCS write and read operations. When storage IOPS or throughput is insufficient, RocksDB cannot flush write buffers or perform compaction fast enough, causing write stalls and cache eviction. This manifests as high stall metrics, slow I/O histograms, and increased disk seek operations.

Storage performance requirements depend on your installation environment and workload. The storage class and persistent volume configuration must provide sufficient IOPS and throughput for the attached volumes. Consult your cloud service provider (CSP) documentation for storage class performance characteristics and scaling options.

Other Possible Causes#

  1. Insufficient Cache Capacity

    • Total cache size insufficient for workload data set

    • Cache not sized for peak scene sizes

    • Multiple concurrent scenes exceeding cache capacity

  2. Write Buffer Configuration Issues

    • Too few write buffers (cf.max_write_buffer_number) causing premature eviction

    • Write buffer size too small for write bursts

    • Write buffer pressure causing cache eviction

  3. Garbage Collection Aggressiveness

    • Garbage collection removing cached data too aggressively

    • garbageCollection.deleteKeyspaceQuantile set too high

    • garbageCollection.minFreeCapacity threshold too high

  4. Workload Pattern Changes

    • New scenes or assets not fitting existing cache patterns

    • Increased scene complexity requiring more cache space

    • Concurrent workload increases exceeding cache capacity

Troubleshooting Steps#

Diagnostic Steps#

1. Check Cache Hit/Miss Ratios#

Monitor cache hit and miss metrics to determine if cache is cold or experiencing eviction.

# Query metrics:
# - ddcs_m_adapter_entry_cache_total{operation="hit"}
# - ddcs_m_adapter_entry_cache_total{operation="miss"}

Analysis:

  • High miss rates relative to hits indicate cold cache or cache eviction.

  • Hit ratios below 70-80% suggest cache effectiveness issues.

  • Compare hit rates between initial scene loads (cold) and subsequent loads (warm).

  • Consistently high miss ratios after warm-up indicate cache sizing or eviction problems.

Resolution:

  • If cache is cold, implement warm-up procedures for common scenes.

  • If miss ratios remain high after warm-up, investigate cache sizing (see Cache Eviction section).

  • Monitor ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"} vs ddcs_m_adapter_get_total{level="rocksdb_disk_seek"} to quantify cache effectiveness.

2. Check Cache Hit/Miss and Disk Seek Metrics#

Monitor cache effectiveness metrics to identify eviction patterns.

# Query metrics:
# - ddcs_m_adapter_entry_cache_total{operation="miss"} vs {operation="hit"}
# - ddcs_m_adapter_get_total{level="in_memory|rocksdb_in_memory"} vs {level="rocksdb_disk_seek"}

Analysis:

  • High miss rates relative to hits indicate cache eviction.

  • High rocksdb_disk_seek relative to in_memory|rocksdb_in_memory indicates eviction forcing disk reads.

  • Increasing miss rates over time suggest cache capacity issues.

Resolution:

  • If eviction is occurring, check cache size configuration (see step 4).

  • Monitor ddcs_m_adapter_bytes_returned_total to track data volume requiring disk access.

3. Monitor RocksDB Stall Metrics#

Check RocksDB stall metrics to identify storage performance bottlenecks.

# Access DDCS metrics endpoint
kubectl port-forward -n ddcs svc/<ddcs-service-name> 3051:3051
curl http://localhost:3051/metrics | grep "__rocksdb_stalls"

# Query stall metrics:
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_total_(stops|delays)"}
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_l0_file_count_limit"}
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|delays)_pending_compaction_bytes"}
# - ddcs_rocksdb_intrinsic_gauge{name=~"__rocksdb_stalls_(stops|stalls)_memtable_limit"}

Analysis:

  • High stall counts indicate RocksDB throttling writes due to storage performance limits.

  • L0 file count limit stalls suggest compaction cannot keep up with write volume.

  • Pending compaction bytes stalls indicate compaction backlog due to slow disk I/O.

  • Memtable limit stalls indicate write buffers cannot flush fast enough.

Resolution:

  • If stalls are high, investigate storage class IOPS and throughput capabilities.

  • Review ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"} for slow write operations.

  • Consult CSP documentation to upgrade storage class or increase volume performance.

4. Check I/O Performance Metrics#

Monitor I/O histogram metrics to quantify storage performance issues.

# Query I/O metrics:
# - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_read"}
# - ddcs_m_adapter_io_histogram_bucket{io_kind="rocksdb_write"}

Analysis:

  • High read I/O latency indicates slow disk reads, suggesting cache misses requiring disk access.

  • High write I/O latency indicates slow disk writes, suggesting write buffer pressure.

  • Compare I/O latencies against storage class performance specifications.

Resolution:

  • If I/O latencies are high, upgrade storage class or increase volume IOPS/throughput.

  • Monitor ddcs_m_adapter_error_total{kind=~"rocks_timed_out|rocks_busy|rocks_try_again"} for storage-related errors.

  • Review storage class configuration and consider higher performance tiers.

5. Review Storage Class and Volume Configuration#

Verify storage class provides sufficient IOPS and throughput for the workload.

# Check PVC storage class
kubectl get pvc -n ddcs
kubectl describe pvc -n ddcs <pvc-name>

# Check storage class configuration
kubectl get storageclass
kubectl describe storageclass <storage-class-name>

Analysis:

  • Storage class performance characteristics determine available IOPS and throughput.

  • Insufficient storage performance causes RocksDB stalls and cache eviction.

  • Compare storage class specs against workload requirements.

Resolution:

  • Upgrade to storage class with higher IOPS/throughput if performance is insufficient.

  • Consider provisioned IOPS volumes for consistent performance.

  • Monitor stall metrics after storage changes to validate improvements.

Other Diagnostic Actions#

  • Monitor write buffer utilization: Check __rocksdb_stalls_(stops|stalls)_memtable_limit metrics for write buffer pressure

  • Review garbage collection settings: Check garbageCollection.deleteKeyspaceQuantile and garbageCollection.minFreeCapacity configuration

  • Analyze workload patterns: Review ddcs_m_adapter_entry_cache_total and ddcs_m_adapter_get_total metrics over time to identify capacity issues

  • Compare warm vs cold performance: Monitor latency metrics during warm and cold loads to quantify cache impact

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Cache hit ratio thresholds: Alert when cache hit ratios drop below 70% for extended periods

  • Cache miss rate increases: Alert on significant increases in cache miss rates

  • Write buffer pressure: Alert when write buffer utilization approaches limits

  • Memory pressure: Alert when pod memory usage approaches limits to prevent eviction

Capacity Planning#

  • Estimate cache requirements: Calculate cache sizes based on average scene sizes and access patterns

  • Plan for cache growth: Account for cache growth as workloads scale

  • Monitor cache trends: Track cache usage trends to predict when capacity increases are needed

  • Test cache effectiveness: Validate cache configurations under expected production load