UCC: Metadata Cache Undersizing#

Overview#

UCC uses NGINX’s proxy_cache_path keys_zone parameter to allocate shared memory for tracking cached items. This metadata cache (keys zone) stores information about which URLs are cached, their cache keys, expiration times, and storage locations. When the metadata cache is undersized, NGINX silently evicts the oldest cache metadata using LRU (Least Recently Used), causing cache misses even when content is present on disk.

The metadata cache is configured via the keys_zone parameter. For example:

proxy_cache_path /cache/data
                 levels=1:2
                 keys_zone=ucc_cache:256m
                 inactive=1d
                 max_size=500g;

Here, keys_zone=ucc_cache:256m allocates 256 MiB of shared memory. Each cache entry consumes approximately 200-300 bytes, so 256 MiB tracks roughly 900,000-1,300,000 items. The default 256 MiB is often insufficient for large simulation workloads with millions of USD assets.

Sizing formula:

keys_zone_size = (unique_urls * 250 bytes) * 1.5 headroom
# Example: 1.5M URLs * 250 bytes * 1.5 = 562 MB → use 512m or 1g

When undersized, NGINX evicts metadata for the oldest cached items (no error messages logged). Clients requesting evicted items experience cache MISSes despite content being on disk, triggering re-fetches from S3.

Symptoms and Detection Signals#

Visible Symptoms#

  • High cache miss rates despite warm cache - Cache showing MISS for content that was previously cached

  • Frequent S3 re-fetches - Content being fetched from S3 despite existing on disk

  • Silent cache metadata eviction - No error messages; NGINX quietly evicts oldest metadata via LRU

  • Inconsistent cache hit ratios - Hit ratios varying significantly between identical workload runs

Cache Miss Patterns#

Where to find these logs:

  • Location: UCC Access Logs

  • Application: NGINX

  • Description: HTTP access logs showing cache MISS for the same file repeatedly

# SOURCE: NGINX Access Logs
# Look for patterns like:
upstream_cache_status: "MISS"
# Same file URI appearing multiple times with MISS status
# despite file being previously cached

Metric Signals#

rate(nginx_http_requests_total{pod=~"usd-content-cache-.*", cache_status="MISS"}[5m])
rate(nginx_http_requests_total{pod=~"usd-content-cache-.*", upstream_addr!="-"}[5m])

Cache effectiveness is measured through request metrics that distinguish between cache hits and misses. The rate of cache MISS responses is tracked by nginx_http_requests_total with cache_status="MISS", where high MISS rates exceeding 30% during warm cache periods indicate cache ineffectiveness. When the metadata cache (keys zone) is undersized, NGINX silently evicts the oldest cache metadata using LRU, preventing it from tracking previously cached items. This causes clients to experience MISSes for content that exists on disk but has lost its metadata entry.

Upstream proxy activity reveals when UCC must fetch from S3 rather than serving from cache. The metric nginx_http_requests_total filtered by upstream_addr!="-" tracks requests proxied to upstream S3. High rates during warm cache scenarios indicate the cache is not serving content effectively, forcing UCC to repeatedly fetch the same content from S3. This pattern, combined with high MISS rates, strongly suggests metadata cache undersizing is preventing proper cache operation.

Root Cause Analysis#

Known Causes#

Metadata cache undersizing is typically caused by using the default 256 MiB keys_zone size for workloads with >1 million cached items, or by applying configuration to the wrong Helm values section.

Default Metadata Cache Size Insufficient#

When asset count exceeds metadata tracking capacity (~1M items for 256 MiB), NGINX silently evicts older cache metadata. Clients requesting evicted items experience cache MISSes and re-fetch from S3.

Configuration Applied to Wrong Helm Values Section#

A common deployment mistake is setting metadataMemorySize in the wrong section of the Helm values file. Configuration must be under nginx.proxyCache.paths[] to take effect.

Other Possible Causes#

  1. Highly Complex URL Structures - Very long URLs consuming more bytes per cache key

  2. Multiple Cache Zones Competing for Memory - Multiple cache paths with separate keys_zones

  3. Workload Access Patterns - Extremely diverse URLs preventing effective LRU

Troubleshooting Steps#

Diagnostic Steps#

1. Verify Configuration Was Applied Correctly#

Check if metadata cache size was applied and rendered correctly in NGINX config.

# Check Helm values
helm get values <ucc-release-name> -n ucc -o yaml | grep -B 5 -A 5 "metadataMemorySize"

# Expected location:
# nginx:
#   proxyCache:
#     paths:
#       - name: s3
#         metadataMemorySize: 512m  # or larger

# Check rendered NGINX configuration
kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep "keys_zone"

# Should show: keys_zone=s3:512m (or configured size)
# If shows 256m, configuration was not applied or was overridden

Analysis:

  • Mismatch between Helm values and rendered config indicates configuration not applied.

  • Missing metadataMemorySize or wrong YAML section causes default 256m to be used.

Resolution:

  • If mismatch found, correct Helm values location (must be under nginx.proxyCache.paths).

  • Reapply: helm upgrade \<release\> \<chart\> -n ucc -f values.yaml

  • Verify rendered config post-upgrade.

2. Monitor Cache Miss Rates and Decide If Resize Is Needed#

Check cache hit ratios to determine if metadata cache is undersized.

# Check cache HIT/MISS distribution
kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \
  awk -F'"' '{print $26}' | sort | uniq -c

# Calculate hit ratio: HIT / (HIT + MISS) * 100%
# Warm cache target: >80% HIT rate

# If warm cache shows <70% HIT rate, investigate
# Metadata cache undersizing is one possible cause

# Estimate unique URL count accessed
kubectl logs -n ucc <ucc-pod> --tail=50000 | \
  grep "request_uri" | awk -F'"' '{print $8}' | sort -u | wc -l

# Compare against metadata cache capacity
# 256m ~ 1M items; 512m ~ 2M items; 1g ~ 4M items

Analysis:

  • Warm cache with <70% HIT rate may indicate metadata cache issues.

  • If unique URL count exceeds keys_zone capacity, resize is needed.

Resolution:

  • If unique URLs > capacity, increase metadataMemorySize (step 3).

  • If HIT rate is low for other reasons, see Poor Cache Hit Ratios runbook.

3. Increase Metadata Cache Size#

Resize keys_zone based on workload asset count.

# Edit Helm values
# Under nginx.proxyCache.paths (for each backend):
# - name: s3
#   metadataMemorySize: 1g  # Increased from 256m
#   path: /cache/s3

# Apply via Helm upgrade
helm upgrade <ucc-release-name> <chart-path> -n ucc -f values.yaml

# Restart pods to pick up new configuration
kubectl rollout restart statefulset -n ucc <ucc-statefulset>

# Verify new configuration applied
kubectl get configmap -n ucc <ucc-nginx-config> -o yaml | grep "keys_zone"

Analysis:

  • Recommended sizes: 512m for 1M-2M URLs; 1-2g for >2M URLs.

Resolution:

  • Apply via Helm upgrade.

  • Monitor cache HIT rates post-resize to validate improvement (target >80%).

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Cache miss rate thresholds: Alert when MISS rate exceeds 30% for warm cache

  • Cache hit ratio degradation: Alert when HIT ratio drops below 70%

Configuration Best Practices#

  • Size keys_zone for asset count: Use formula from Overview section

  • Start with 512 MiB - 1 GiB: For workloads with 1M-4M unique URLs

  • Verify configuration location: Ensure metadataMemorySize is under nginx.proxyCache.paths[]

  • Validate post-deployment: Verify rendered NGINX config matches Helm values

Capacity Planning#

  • Estimate unique URL count: Analyze scenes to determine asset count

  • Calculate metadata requirement: Use sizing formula; add 50% headroom

  • Test under peak load: Validate keys_zone size with maximum concurrent simulation count