DDCS: Configure#
This guide provides detailed information on the Derived Data Cache Service (DDCS) component and its use in a self-hosted NVCF cluster.
DDCS reduces scene load time and improves performance when properly configured and sized for the workload. Derived data generation is computationally expensive and time-consuming. DDCS trades network bandwidth for compute time by caching derived data, allowing multiple GPUs to share pre-generated content.
Derived data is often many times the size of the source content. During initial scene loads, GPUs generate the most derived content as they encounter assets for the first time. While DDCS may add some time to cold scene loads (as content must be generated and written to DDCS synchronously), the generated data becomes immediately available to other GPUs. During “warm” scene loads, render workers read all derived content from DDCS instead of regenerating it, significantly reducing load times.
Base Configuration#
DDCS requires some configuration to be properly installed. Create a file
on your local machine called values.yaml. A base configuration is
provided in the following dropdown.
image:
pullSecrets:
- name: ngc-container-pull
cluster:
replicas: 1
selfAntiAffinity: false
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- compute
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 5
podAffinityTerm:
labelSelector:
matchExpressions:
- key: ddcs
operator: In
values:
- "kvnode"
topologyKey: "kubernetes.io/hostname"
container:
resources:
#limits:
# memory: 32Gi
requests:
memory: 32Gi
storage:
volume:
size: 330Gi
storageClassName: "gp3"
settings:
storageLimit: 300G
engine:
sys.cache_size: "10G"
sys.block_cache_size: "18G"
cf.max_write_buffer_number: 128
sys.increase_parallelism: 8
db.max_background_jobs: 8
monitoring:
enabled: false
Complete Configuration Reference#
The base configuration above covers the essential settings for most deployments. For advanced configuration options or to explore all available settings, refer to the complete values file below. This reference includes all configuration options available in the DDCS Helm chart, including advanced settings for TLS, OpenTelemetry, and resource management.
Full Configuration - View the full reference
values.yamlfile.
1. Provision and Scale#
Proper provisioning and scaling are critical for DDCS performance. When undersized or misconfigured for the workload, simulations will slow down or even fail. Ensure adequate network bandwidth and storage capacity based on your GPU count and scene complexity.
This guide assumes a “CPU/GPU” node split. That is, workloads that do not require a GPU are scheduled on Kubernetes nodes without GPUs (CPU nodes). Workloads that do require a GPU are scheduled on Kubernetes nodes with GPUs (GPU nodes).
As a baseline plan for ~3.3 Gbps for each GPU in the cluster.
AWS EKS Example#
Assuming the SKUs for CPU/GPU nodes are m5.8xlarge and
g6e.4xlarge respectively.
SKU |
vCPU |
RAM |
NIC |
GPU |
|---|---|---|---|---|
m5.8xlarge (compute) |
32 |
128 G |
10 Gbps |
|
g6e.4xlarge (gpu) |
16 |
128 G |
20 Gbps |
1 L40S |
Note
If a cluster is provisioned with 25 GPUs a matching 83 Gbps (3.3 Gbps x 25 GPU) of bandwidth is necessary to facilitate caching.
Provision 9 compute nodes with 10Gbps and create 9 DDCS replicas.
A single DDCS pod should be placed on each compute node. The goal is to expand network capability with each pod, distributing cache load across multiple nodes. If 9 compute nodes were provisioned for caching, schedule one DDCS pod on each node.
Important
yaml title="values.yaml" cluster: replicas: 9
2. Memory Configuration#
DDCS memory configuration includes Kubernetes resource limits and requests, along with engine-level cache settings. The base configuration allocates memory for the Point Cache, Block Cache, and Write Buffers.
Here are some common configuration modes that provide for various CPU and RAM allocations.
4 vCPU x 16GB RAM#
container:
resources:
#limits:
# memory: 16Gi
requests:
memory: 16Gi
settings:
engine:
sys.cache_size: "4G"
sys.block_cache_size: "8G"
cf.max_write_buffer_number: 64
sys.increase_parallelism: 4
db.max_background_jobs: 4
8 vCPU x 32GB RAM#
container:
resources:
#limits:
# memory: 32Gi
requests:
memory: 32Gi
settings:
engine:
sys.cache_size: "10G"
sys.block_cache_size: "18G"
cf.max_write_buffer_number: 128
sys.increase_parallelism: 8
db.max_background_jobs: 8
12 vCPU x 64GB RAM#
container:
resources:
#limits:
# memory: 32Gi
requests:
memory: 32Gi
settings:
engine:
sys.cache_size: "20G"
sys.block_cache_size: "36G"
cf.max_write_buffer_number: 128
sys.increase_parallelism: 12
db.max_background_jobs: 12
Tip
It is recommended to use one of the above configurations. The configuration above determines how memory is allocated by the cache. It can be controlled with these settings.
Point Cache (
sys.cache_size): The first cache level for serving cached content. Set to10Gin the base configuration.Block Cache (
sys.block_cache_size): The second cache level used by RocksDB for metadata, filters, and file blocks. Set to18Gin the base configuration.Write Buffers (
cf.max_write_buffer_number): In-memory buffers that absorb write bursts before flushing to disk. Each buffer consumes 64MB. The base configuration sets this to128buffers (approximately 8GB total capacity).
3. Volume Configuration#
Reading from persistent storage, even a network-attached volume, is often faster than regenerating derived data. Therefore, DDCS persists content to a Kubernetes volume.
storage:
volume:
size: 330Gi
storageClassName: "gp3"
size determines the volume size that will be created and attached to
each pod. Depending on the cloud environment, volume size may also
control the IOPS and throughput of the volume. Refer to your cloud
provider’s documentation for specific performance characteristics.
storageClassName determines the performance characteristics of
persistent volumes. Select a storage class that provides high IOPS and
throughput suitable for database workloads. Ideal conditions are 4000
IOPS and greater than 800 Mb/s sustained throughput.
storageLimit controls the maximum disk content before garbage
collection is triggered. This value should be 30-60GB smaller than the
volume size to provide headroom for filesystem operations and prevent
disk exhaustion. Filling a volume will result in IO errors and
instability. When disk usage reaches 60% of this limit, garbage
collection begins automatically. Smaller volumes will trigger garbage
collection more frequently, which can reduce performance.
Example Kubernetes StorageClass for AWS#
If a StorageClass named gp3 does not already exist in your cluster,
one can be created using the following configuration:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp3
iops: "5000"
throughput: "1000"
Apply this StorageClass to your cluster:
kubectl apply -f storageclass-gp3.yaml
4. Telemetry#
DDCS exports Prometheus metrics for monitoring cache performance, hit rates, and storage utilization. Collection of these metrics is important for diagnosing potential problems with DDCS performance and optimizing cache configuration.
monitoring:
enabled: false
Metrics Configuration:
monitoring.enabled: When true, enables Prometheus metrics collection. Metrics are exposed via a Kubernetes service in Prometheus format.
Important
The ServiceMonitor CRD must be installed in the cluster for
ServiceMonitor resources to work.
Configuration Recommendations#
The following configurations provide complete values files for different cluster sizes. Each configuration includes all base settings optimized for the specified GPU count and bandwidth requirements.
image:
pullSecrets:
- name: ngc-container-pull
cluster:
replicas: 2 # 2 nodes x 12.5 Gbps = 25 Gbps capacity
selfAntiAffinity: false
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- compute
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 5
podAffinityTerm:
labelSelector:
matchExpressions:
- key: ddcs
operator: In
values:
- "kvnode"
topologyKey: "kubernetes.io/hostname"
container:
resources:
#limits:
# memory: 32Gi
requests:
memory: 32Gi
storage:
volume:
size: 330Gi
storageClassName: "gp3"
settings:
storageLimit: 300G
engine:
sys.cache_size: "10G"
sys.block_cache_size: "18G"
cf.max_write_buffer_number: 128
sys.increase_parallelism: 8
db.max_background_jobs: 8
monitoring:
enabled: false
image:
pullSecrets:
- name: ngc-container-pull
cluster:
replicas: 3 # 3 nodes x 12.5 Gbps = 37.5 Gbps capacity
selfAntiAffinity: false
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- compute
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 5
podAffinityTerm:
labelSelector:
matchExpressions:
- key: ddcs
operator: In
values:
- "kvnode"
topologyKey: "kubernetes.io/hostname"
container:
resources:
#limits:
# memory: 32Gi
requests:
memory: 32Gi
storage:
volume:
size: 330Gi
storageClassName: "gp3"
settings:
storageLimit: 300G
engine:
sys.cache_size: "10G"
sys.block_cache_size: "18G"
cf.max_write_buffer_number: 128
sys.increase_parallelism: 8
db.max_background_jobs: 8
monitoring:
enabled: false
image:
pullSecrets:
- name: ngc-container-pull
cluster:
replicas: 6 # 6 nodes x 12.5 Gbps = 75 Gbps capacity
selfAntiAffinity: false
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- compute
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 5
podAffinityTerm:
labelSelector:
matchExpressions:
- key: ddcs
operator: In
values:
- "kvnode"
topologyKey: "kubernetes.io/hostname"
container:
resources:
#limits:
# memory: 32Gi
requests:
memory: 32Gi
storage:
volume:
size: 330Gi
storageClassName: "gp3"
settings:
storageLimit: 300G
engine:
sys.cache_size: "10G"
sys.block_cache_size: "18G"
cf.max_write_buffer_number: 128
sys.increase_parallelism: 8
db.max_background_jobs: 8
monitoring:
enabled: false
Summary#
This guide covered the configuration options for DDCS, including scaling considerations, memory allocation, storage sizing, and monitoring setup. Proper configuration of these settings is essential for optimal DDCS performance in your self-hosted NVCF cluster.
Once you have prepared your values.yaml file with the appropriate
configuration, proceed to the deployment guide to
deploy DDCS using Helm.