UCC: Upstream S3 Connection Spikes and High Connect Time#

Overview#

UCC fetches uncached content from upstream S3 buckets during cache MISS scenarios. When the number of concurrent connections to S3 becomes excessive (>50,000-60,000), connection establishment times increase dramatically (P99 connect time from milliseconds to 5-10 seconds), S3 may return 503 (Service Unavailable) errors, and cloud provider SNAT ports may become exhausted.

NGINX opens a new TCP connection to S3 for every request in the current release. Each connection requires a TCP handshake plus TLS handshake, consuming time and SNAT ports. With high cache MISS rates (>3,000 RPS to S3), this quickly exhausts available connections and SNAT capacity.

Additionally, S3 uses DNS-based load balancing with short TTLs (5-60 seconds) to distribute requests. High DNS cache TTL in NGINX prevents discovery of new S3 endpoints as S3 scales, concentrating traffic on a subset of endpoints and worsening connection bottlenecks.

Note: Connection reuse to S3 is not available in the current release. Future versions will include a Lua-based connection pooling balancer. Operators experiencing this issue should focus on reducing cache MISS rates and monitoring SNAT capacity.

Symptoms and Detection Signals#

Visible Symptoms#

  • High upstream connection counts - Connections to S3 exceeding 50,000-60,000 (estimated from SNAT metrics)

  • Slow S3 connect times - P99 connect time exceeding 5-10 seconds (baseline should be <100ms)

  • S3 503 errors - S3 returning “Service Unavailable” due to request rate limits

  • SNAT port exhaustion warnings - Cloud provider reporting high SNAT port usage (>70% of allocated)

Log Messages#

S3 503 Service Unavailable#

Where to find these logs:

  • Location: UCC Access Logs

  • Description: Logs showing S3 returning 503 errors

# SOURCE: NGINX Access Logs
# Look for upstream 503 responses:
upstream_status: "503"
host: "*s3*.amazonaws.com*"

Metric Signals#

histogram_quantile(0.99, upstream_connect_time_seconds_bucket)
rate(nginx_http_requests_total{pod=~"usd-content-cache-.*", upstream_status="503"}[5m])

Connection establishment performance to S3 is tracked through upstream connect time histograms. The P99 quantile calculated from upstream_connect_time_seconds_bucket reveals tail latency for establishing S3 connections. Values exceeding 1-2 seconds indicate connection establishment bottlenecks, with baseline performance typically under 100ms. When connection counts become excessive (>50,000-60,000), P99 connect times can spike to 5-10 seconds, indicating severe connection saturation where the system struggles to establish new TCP and TLS connections to S3.

S3 service availability issues manifest in error rate metrics tracked through nginx_http_requests_total with upstream_status="503". Sharp increases in 503 error rates indicate S3 is rate limiting requests due to excessive request volumes. When UCC opens too many concurrent connections or sends too many requests per second, S3 returns “Service Unavailable” responses to protect itself from overload.

Cloud provider SNAT metrics provide visibility into outbound connection capacity. Azure tracks UsedSnatPorts and AllocatedSnatPorts on load balancers, while AWS monitors ActiveConnectionCount on NAT Gateways. High values approaching the allocated limit indicate SNAT exhaustion risk, where the system runs out of available ports for new outbound connections. Alert when usage exceeds 70% of allocated ports to allow time for remediation before complete exhaustion occurs.

Root Cause Analysis#

Known Causes#

Upstream S3 connection spikes are caused by lack of connection reuse in the current release. No operator-side configuration workaround is available.

No Connection Reuse to S3#

The current release cannot reuse connections to dynamically resolved S3 bucket hostnames. Each request opens a new TCP and TLS connection, causing connection count to scale linearly with request rate. High cache MISS rates drive S3 request rates, which drive connection count.

For example, 3,000 requests/sec to S3 with 10-second connection lifetime results in 30,000 concurrent connections. With no reuse, connection count grows until SNAT exhaustion or performance degradation occurs.

No operator workaround exists. Future versions will include connection pooling; operators should focus on reducing cache MISS rates to lower S3 request volume.

High Cache MISS Rate Driving S3 Requests#

Excessive cache MISSes cause UCC to fetch from S3 more frequently than necessary. Each MISS triggers an upstream request and consumes a new connection. Root causes of high MISS rates include undersized metadata cache, cold cache, or cache eviction.

Check cache HIT/MISS ratio:

# Query cache status from access logs
kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \
  awk -F'"' '{print $26}' | sort | uniq -c

# Expected for warm cache: >70% HIT rate
# High MISS rates trigger more S3 connections

Other Possible Causes#

  1. SNAT Port Allocation Insufficient - Allocated SNAT ports too low for outbound connection volume

  2. S3 Rate Limiting - S3 returning 503 when request rate exceeds bucket limits

  3. Network Latency to S3 - High RTT causing connections to remain open longer

Troubleshooting Steps#

Diagnostic Steps#

1. Monitor Upstream Connection Count and Connect Time#

Measure connection metrics to quantify the issue.

# Monitor SNAT usage (proxy for connection count)
# Azure:
# az monitor metrics list \
#   --resource <load-balancer-resource-id> \
#   --metric "UsedSnatPorts,AllocatedSnatPorts,SnatConnectionCount"

# AWS:
# CloudWatch metrics for NAT Gateway: ActiveConnectionCount

# Check upstream connect time from logs
kubectl logs -n ucc <ucc-pod> --tail=10000 | \
  grep "upstream_connect_time" | awk -F'"' '{print $32}' | \
  awk '{if($1>0) {sum+=$1; count++}} END {if(count>0) print "Avg:", sum/count}'

# Parse P99 connect time manually or via log aggregation tool

Analysis:

  • SNAT usage >70% indicates connection exhaustion risk.

  • P99 connect times >1-2s indicate saturation (baseline <100ms).

Resolution:

  • No configuration fix available in current release.

  • Focus on reducing cache MISS rate (step 2).

2. Reduce Cache MISS Rate to Lower S3 Request Volume#

Since connection reuse is not available, reduce the number of upstream S3 requests by improving cache effectiveness.

# Check current cache HIT/MISS ratio
kubectl logs -n ucc <ucc-pod> --tail=10000 | grep "upstream_cache_status" | \
  awk -F'"' '{print $26}' | sort | uniq -c

# If MISS rate >30%, investigate causes:
# - Cold cache: implement pre-warming (see Cache Expiration Stampede runbook)
# - Metadata cache undersized: increase keys_zone (see Metadata Cache runbook)
# - Disk issues: see Data Disk Bandwidth runbook

Resolution:

  • Improve cache HIT rates to reduce S3 request volume.

  • Each 10% improvement in HIT rate reduces S3 connections proportionally.

  • See related runbooks for cache effectiveness improvements.

3. Monitor Cloud Provider SNAT Allocation#

Verify SNAT port allocation is sufficient and monitor usage trends.

# Azure: Check load balancer SNAT metrics
# az monitor metrics list \
#   --resource <load-balancer-resource-id> \
#   --metric "AllocatedSnatPorts,UsedSnatPorts"

# Calculate headroom: (allocated - peak_used) / allocated
# If <30% headroom, request SNAT allocation increase

# AWS: Check NAT Gateway capacity
# Monitor ActiveConnectionCount and ConnectionAttemptCount

Resolution:

  • If SNAT usage consistently >70%, coordinate with cloud provider for allocation increase.

  • Plan for headroom as workload scales.

Prevention#

Proactive Monitoring#

Set up alerts for:

  • Upstream connect time degradation: Alert when P99 connect time exceeds 500ms

  • S3 503 error rate: Alert when S3 returns 503 errors (rate >1% of requests)

  • SNAT port utilization: Alert when usage exceeds 70% of allocated ports

  • Cache MISS rate increases: Alert when MISS rate exceeds 30% for warm cache

Mitigation Strategies#

  • Pre-warm cache before production loads: Reduce initial MISS burst (see Cache Expiration Stampede runbook)

  • Improve cache hit ratios: Address metadata cache and disk sizing (see related runbooks)

  • Monitor SNAT capacity: Track SNAT usage trends; request increases before exhaustion

  • Coordinate with cloud provider: Work with support to increase SNAT allocation if needed

Future Improvements#

Connection pooling and reuse features are under development for future releases. These features will significantly reduce upstream connection count and improve performance. Operators experiencing this issue should monitor release notes for availability.