DDCS: Configure#

This guide provides detailed information on the Derived Data Cache Service (DDCS) component and its use in a self-hosted NVCF cluster.

DDCS reduces scene load time and improves performance when properly configured and sized for the workload. Derived data generation is computationally expensive and time-consuming. DDCS trades network bandwidth for compute time by caching derived data, allowing multiple GPUs to share pre-generated content.

Derived data is often many times the size of the source content. During initial scene loads, GPUs generate the most derived content as they encounter assets for the first time. While DDCS may add some time to cold scene loads (as content must be generated and written to DDCS synchronously), the generated data becomes immediately available to other GPUs. During “warm” scene loads, render workers read all derived content from DDCS instead of regenerating it, significantly reducing load times.

Base Configuration#

DDCS requires some configuration to be properly installed. Create a file on your local machine called values.yaml. A base configuration is provided in the following dropdown.

image:
  pullSecrets:

    - name: ngc-container-pull

cluster:
  replicas: 1
  selfAntiAffinity: false
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 100
          preference:
            matchExpressions:

              - key: node-type
                operator: In
                values:

                  - compute
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 5
          podAffinityTerm:
            labelSelector:
              matchExpressions:

                - key: ddcs
                  operator: In
                  values:

                    - "kvnode"
            topologyKey: "kubernetes.io/hostname"

  container:
    resources:
      #limits:
      #  memory: 32Gi
      requests:
        memory: 32Gi

    storage:
      volume:
        size: 330Gi
        storageClassName: "gp3"

    settings:
      storageLimit: 300G
      engine:
        sys.cache_size: "10G"
        sys.block_cache_size: "18G"
        cf.max_write_buffer_number: 128
        sys.increase_parallelism: 8
        db.max_background_jobs: 8

    monitoring:
      enabled: false

Complete Configuration Reference#

The base configuration above covers the essential settings for most deployments. For advanced configuration options or to explore all available settings, refer to the complete values file below. This reference includes all configuration options available in the DDCS Helm chart, including advanced settings for TLS, OpenTelemetry, and resource management.

1. Provision and Scale#

Proper provisioning and scaling are critical for DDCS performance. When undersized or misconfigured for the workload, simulations will slow down or even fail. Ensure adequate network bandwidth and storage capacity based on your GPU count and scene complexity.

This guide assumes a “CPU/GPU” node split. That is, workloads that do not require a GPU are scheduled on Kubernetes nodes without GPUs (CPU nodes). Workloads that do require a GPU are scheduled on Kubernetes nodes with GPUs (GPU nodes).

As a baseline plan for ~3.3 Gbps for each GPU in the cluster.

AWS EKS Example#

Assuming the SKUs for CPU/GPU nodes are m5.8xlarge and g6e.4xlarge respectively.

SKU

vCPU

RAM

NIC

GPU

m5.8xlarge (compute)

32

128 G

10 Gbps

g6e.4xlarge (gpu)

16

128 G

20 Gbps

1 L40S

Note

If a cluster is provisioned with 25 GPUs a matching 83 Gbps (3.3 Gbps x 25 GPU) of bandwidth is necessary to facilitate caching.

Provision 9 compute nodes with 10Gbps and create 9 DDCS replicas.

A single DDCS pod should be placed on each compute node. The goal is to expand network capability with each pod, distributing cache load across multiple nodes. If 9 compute nodes were provisioned for caching, schedule one DDCS pod on each node.

Important

yaml title="values.yaml" cluster:   replicas: 9

2. Memory Configuration#

DDCS memory configuration includes Kubernetes resource limits and requests, along with engine-level cache settings. The base configuration allocates memory for the Point Cache, Block Cache, and Write Buffers.

Here are some common configuration modes that provide for various CPU and RAM allocations.

4 vCPU x 16GB RAM#

container:
  resources:
    #limits:
    #  memory: 16Gi
    requests:
      memory: 16Gi

  settings:
    engine:
      sys.cache_size: "4G"
      sys.block_cache_size: "8G"
      cf.max_write_buffer_number: 64
      sys.increase_parallelism: 4
      db.max_background_jobs: 4

8 vCPU x 32GB RAM#

container:
  resources:
    #limits:
    #  memory: 32Gi
    requests:
      memory: 32Gi

  settings:
    engine:
      sys.cache_size: "10G"
      sys.block_cache_size: "18G"
      cf.max_write_buffer_number: 128
      sys.increase_parallelism: 8
      db.max_background_jobs: 8

12 vCPU x 64GB RAM#

container:
  resources:
    #limits:
    #  memory: 32Gi
    requests:
      memory: 32Gi

  settings:
    engine:
      sys.cache_size: "20G"
      sys.block_cache_size: "36G"
      cf.max_write_buffer_number: 128
      sys.increase_parallelism: 12
      db.max_background_jobs: 12

Tip

It is recommended to use one of the above configurations. The configuration above determines how memory is allocated by the cache. It can be controlled with these settings.

  • Point Cache (sys.cache_size): The first cache level for serving cached content. Set to 10G in the base configuration.

  • Block Cache (sys.block_cache_size): The second cache level used by RocksDB for metadata, filters, and file blocks. Set to 18G in the base configuration.

  • Write Buffers (cf.max_write_buffer_number): In-memory buffers that absorb write bursts before flushing to disk. Each buffer consumes 64MB. The base configuration sets this to 128 buffers (approximately 8GB total capacity).

3. Volume Configuration#

Reading from persistent storage, even a network-attached volume, is often faster than regenerating derived data. Therefore, DDCS persists content to a Kubernetes volume.

storage:
  volume:
    size: 330Gi
    storageClassName: "gp3"

size determines the volume size that will be created and attached to each pod. Depending on the cloud environment, volume size may also control the IOPS and throughput of the volume. Refer to your cloud provider’s documentation for specific performance characteristics.

storageClassName determines the performance characteristics of persistent volumes. Select a storage class that provides high IOPS and throughput suitable for database workloads. Ideal conditions are 4000 IOPS and greater than 800 Mb/s sustained throughput.

storageLimit controls the maximum disk content before garbage collection is triggered. This value should be 30-60GB smaller than the volume size to provide headroom for filesystem operations and prevent disk exhaustion. Filling a volume will result in IO errors and instability. When disk usage reaches 60% of this limit, garbage collection begins automatically. Smaller volumes will trigger garbage collection more frequently, which can reduce performance.

Example Kubernetes StorageClass for AWS#

If a StorageClass named gp3 does not already exist in your cluster, one can be created using the following configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  iops: "5000"
  throughput: "1000"

Apply this StorageClass to your cluster:

kubectl apply -f storageclass-gp3.yaml

4. Telemetry#

DDCS exports Prometheus metrics for monitoring cache performance, hit rates, and storage utilization. Collection of these metrics is important for diagnosing potential problems with DDCS performance and optimizing cache configuration.

monitoring:
  enabled: false

Metrics Configuration:

  • monitoring.enabled: When true, enables Prometheus metrics collection. Metrics are exposed via a Kubernetes service in Prometheus format.

Important

The ServiceMonitor CRD must be installed in the cluster for ServiceMonitor resources to work.

Configuration Recommendations#

The following configurations provide complete values files for different cluster sizes. Each configuration includes all base settings optimized for the specified GPU count and bandwidth requirements.

image:
  pullSecrets:

    - name: ngc-container-pull

cluster:
  replicas: 2  # 2 nodes x 12.5 Gbps = 25 Gbps capacity
  selfAntiAffinity: false
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 100
          preference:
            matchExpressions:

              - key: node-type
                operator: In
                values:

                  - compute
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 5
          podAffinityTerm:
            labelSelector:
              matchExpressions:

                - key: ddcs
                  operator: In
                  values:

                    - "kvnode"
            topologyKey: "kubernetes.io/hostname"

  container:
    resources:
      #limits:
      #  memory: 32Gi
      requests:
        memory: 32Gi

    storage:
      volume:
        size: 330Gi
        storageClassName: "gp3"

    settings:
      storageLimit: 300G
      engine:
        sys.cache_size: "10G"
        sys.block_cache_size: "18G"
        cf.max_write_buffer_number: 128
        sys.increase_parallelism: 8
        db.max_background_jobs: 8

    monitoring:
      enabled: false
image:
  pullSecrets:

    - name: ngc-container-pull

cluster:
  replicas: 3  # 3 nodes x 12.5 Gbps = 37.5 Gbps capacity
  selfAntiAffinity: false
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 100
          preference:
            matchExpressions:

              - key: node-type
                operator: In
                values:

                  - compute
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 5
          podAffinityTerm:
            labelSelector:
              matchExpressions:

                - key: ddcs
                  operator: In
                  values:

                    - "kvnode"
            topologyKey: "kubernetes.io/hostname"

  container:
    resources:
      #limits:
      #  memory: 32Gi
      requests:
        memory: 32Gi

    storage:
      volume:
        size: 330Gi
        storageClassName: "gp3"

    settings:
      storageLimit: 300G
      engine:
        sys.cache_size: "10G"
        sys.block_cache_size: "18G"
        cf.max_write_buffer_number: 128
        sys.increase_parallelism: 8
        db.max_background_jobs: 8

    monitoring:
      enabled: false
image:
  pullSecrets:

    - name: ngc-container-pull

cluster:
  replicas: 6  # 6 nodes x 12.5 Gbps = 75 Gbps capacity
  selfAntiAffinity: false
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 100
          preference:
            matchExpressions:

              - key: node-type
                operator: In
                values:

                  - compute
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:

        - weight: 5
          podAffinityTerm:
            labelSelector:
              matchExpressions:

                - key: ddcs
                  operator: In
                  values:

                    - "kvnode"
            topologyKey: "kubernetes.io/hostname"

  container:
    resources:
      #limits:
      #  memory: 32Gi
      requests:
        memory: 32Gi

    storage:
      volume:
        size: 330Gi
        storageClassName: "gp3"

    settings:
      storageLimit: 300G
      engine:
        sys.cache_size: "10G"
        sys.block_cache_size: "18G"
        cf.max_write_buffer_number: 128
        sys.increase_parallelism: 8
        db.max_background_jobs: 8

    monitoring:
      enabled: false

Summary#

This guide covered the configuration options for DDCS, including scaling considerations, memory allocation, storage sizing, and monitoring setup. Proper configuration of these settings is essential for optimal DDCS performance in your self-hosted NVCF cluster.

Once you have prepared your values.yaml file with the appropriate configuration, proceed to the deployment guide to deploy DDCS using Helm.