Tutorial 9 - Runtime CPU/GPU Decision

The CPU/GPU data node creates various attributes for use in a CUDA-based GPU compute or a CPU-based compute, where the decision of which to use is made at runtime rather than compile time. A few representative types are used, though the list of potential attribute types is not exhaustive. See omni.graph.docs.ogn_attribute_types for the full list.

OgnTutorialCpuGpuData.ogn

The ogn file shows the implementation of a node named “omni.graph.tutorials.CpuGpuData”, which has attributes whose memory type is determined at runtime by the input named isGPU. The algorithm of the node is implemented in CUDA, but in such a way that it can run on either the CPU or the GPU, depending on where the attribute data lives.

 1{
 2    "CpuGpuData" : {
 3        "version": 1,
 4        "categories": "tutorials",
 5        "memoryType": "any",
 6        "description": [
 7            "This is a tutorial node. It illustrates how to access data whose memory location, CPU or GPU, is ",
 8            "determined at runtime in the compute method. The data types are the same as for the purely CPU and ",
 9            "purely GPU tutorials, it is only the access method that changes. The input 'is_gpu' determines where ",
10            "the data of the other attributes can be accessed."
11        ],
12        "metadata":
13        {
14            "uiName": "Tutorial Node: Attributes With CPU/GPU Data"
15        },
16        "inputs": {
17            "is_gpu": {
18                "type": "bool",
19                "memoryType": "cpu",
20                "description": ["Runtime switch determining where the data for the other attributes lives."],
21                "default": false
22            },
23            "a": {
24                "type": "float",
25                "description": ["First value to be added in algorithm 1"],
26                "default": 0.0
27            },
28            "b": {
29                "type": "float",
30                "description": ["Second value to be added in algorithm 1"],
31                "default": 0.0
32            },
33            "points": {
34                "type": "float[3][]",
35                "description": ["Points to be moved by algorithm 2"],
36                "default": []
37            },
38            "multiplier": {
39                "type": "float[3]",
40                "description": ["Amplitude of the expansion for the input points in algorithm 2"],
41                "default": [1.0, 1.0, 1.0]
42            }
43        },
44        "outputs": {
45            "sum": {
46                "type": "float",
47                "description": ["Sum of the two inputs from algorithm 1"]
48            },
49            "points": {
50                "type": "float[3][]",
51                "description": ["Final positions of points from algorithm 2"]
52            }
53        },
54        "tests": [
55            { "inputs:is_gpu": false, "inputs:a": 1.0, "inputs:b": 2.0, "outputs:sum": 3.0 },
56            { "inputs:is_gpu": false, "inputs:a": 5.0, "inputs:b": 3.0, "outputs:sum": 8.0 },
57            {
58                "inputs:is_gpu": false,
59                "inputs:points": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
60                "inputs:multiplier": [2.0, 3.0, 4.0],
61                "outputs:points": [[2.0, 6.0, 12.0], [4.0, 9.0, 16.0]]
62            },
63            {
64                "inputs:is_gpu": true,
65                "inputs:a": 1.0, "inputs:b": 2.0,
66                "inputs:points": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
67                "inputs:multiplier": [2.0, 3.0, 4.0]
68            }
69        ],
70        "$why_disabled": "GPU extraction is not yet implemented so for those tests only inputs are provided",
71        "$tests_disabled_until_gpu_data_extraction_works": [
72            { "inputs:is_gpu": true, "inputs:a": 1.0, "inputs:b": 2.0, "outputs:sum": 3.0, "gpu": ["outputs:sum"] },
73            { "inputs:is_gpu": true, "inputs:a": 5.0, "inputs:b": 3.0, "outputs:sum": 8.0, "gpu": ["outputs:sum"] },
74            {
75                "inputs:is_gpu": true,
76                "inputs:points": [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
77                "inputs:multiplier": [2.0, 3.0, 4.0],
78                "outputs:points": [[2.0, 6.0, 12.0], [4.0, 9.0, 16.0]],
79                "gpu": ["outputs:points"]
80            }
81        ]
82    }
83}

OgnTutorialCpuGpuData.cpp

The cpp file contains the implementation of the compute method, which checks the value of the isGPU attribute and then extracts the data of the specified type to pass to the algorithm in the .cu file.

 1// SPDX-FileCopyrightText: Copyright (c) 2020-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 2// SPDX-License-Identifier: LicenseRef-NvidiaProprietary
 3//
 4// NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
 5// property and proprietary rights in and to this material, related
 6// documentation and any modifications thereto. Any use, reproduction,
 7// disclosure or distribution of this material and related documentation
 8// without an express license agreement from NVIDIA CORPORATION or
 9// its affiliates is strictly prohibited.
10#include <OgnTutorialCpuGpuDataDatabase.h>
11
12// This helper file keeps the GPU and CPU functions organized for easy access.
13// It's only for illustration purposes; you can choose to organize your functions any way you wish.
14
15// The first algorithm is a simple add, to illustrate passing non-array types.
16extern "C" void cpuGpuAddCPU(outputs::sum_t_cpu, inputs::a_t_cpu, inputs::b_t_cpu, size_t);
17extern "C" void cpuGpuAddGPU(outputs::sum_t_gpu, inputs::a_t_gpu, inputs::b_t_gpu);
18
19// The second algorithm shows the more common case of accessing array data.
20extern "C" void cpuGpuMultiplierCPU(outputs::points_t_gpu, inputs::multiplier_t_gpu, inputs::points_t_gpu, size_t);
21extern "C" void cpuGpuMultiplierGPU(outputs::points_t_gpu, inputs::multiplier_t_gpu, inputs::points_t_gpu, size_t);
22
23class OgnTutorialCpuGpuData
24{
25public:
26    static bool compute(OgnTutorialCpuGpuDataDatabase& db)
27    {
28        // Computing the size of the output is independent of CPU/GPU as the size update is lazy and will happen
29        // when the data is requested from the fabric.
30        size_t numberOfPoints = db.inputs.points.size();
31        db.outputs.points.resize(numberOfPoints);
32
33        // Use the runtime input to determine where the calculation should take place.
34        //
35        // Another helpful use of this technique is to first measure the size of data being evaluated and
36        // if it meets a certain minimum threshold move the calculation to the GPU.
37        if (db.inputs.is_gpu())
38        {
39            cpuGpuAddGPU(db.outputs.sum.gpu(), db.inputs.a.gpu(), db.inputs.b.gpu());
40
41            // GPU data is just raw pointers so the size must be passed down to the algorithm
42            cpuGpuMultiplierGPU(
43                db.outputs.points.gpu(), db.inputs.multiplier.gpu(), db.inputs.points.gpu(), numberOfPoints);
44        }
45        else
46        {
47            // CPU version calls the shared CPU/GPU algorithm directly. This could also be implemented in the
48            // CUDA file as an extern "C" function but it is so simple that this is easier.
49            cpuGpuAddCPU(db.outputs.sum.cpu(), db.inputs.a.cpu(), db.inputs.b.cpu(), 0);
50
51            // Extract the CPU data and iterate over all of the points, calling the shared deformation algorithm.
52            // The data is passed through as raw CUDA-compatible bytes so that the deformation algorithm can live
53            // all in one file. If the code were all on the CPU side then the data types could be used as regular
54            // arrays, as showing in the tutorial on array data types.
55            auto rawOutputPoints = db.outputs.points.cpu().data();
56            auto rawInputPoints = db.inputs.points.cpu().data();
57            cpuGpuMultiplierCPU(&rawOutputPoints, &db.inputs.multiplier.cpu(), &rawInputPoints, numberOfPoints);
58        }
59
60        return true;
61    }
62};
63
64REGISTER_OGN_NODE()

OgnTutorialCpuGpuData_CUDA.cu

The cu file contains the implementation of the deformation on the CPU and GPU using CUDA.

  1// Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
  2//
  3// NVIDIA CORPORATION and its licensors retain all intellectual property
  4// and proprietary rights in and to this software, related documentation
  5// and any modifications thereto.  Any use, reproduction, disclosure or
  6// distribution of this software and related documentation without an express
  7// license agreement from NVIDIA CORPORATION is strictly prohibited.
  8//
  9#include <OgnTutorialCpuGpuDataDatabase.h>
 10
 11// ======================================================================
 12// This set of methods is one way of efficiently implementing an algorithm to run on either CPU or GPU:
 13//     xxxCPU() - Looping over all inputs to call the node algorithm
 14//     xxxGPU() - Host function to launch the kernel to run the node algorithm
 15//     xxxCUDA() - Run the algorithm on the block of data in the kernel
 16//     xxx() - Implementation of the actual node algorithm
 17//
 18// This minimizes the code duplication by keeping the node algorithm in a shared location. The tradeoff is that
 19// it is restricted to data types that CUDA understands, though for practical purposes since the algorithm must
 20// run on the GPU this isn't restrictive at all.
 21//
 22// One thing to be careful of is when you are passing arrays of CPU data to a GPU function. Data can only be passed
 23// from the CPU to the GPU by value, which for simple types is the default since the data type will be something
 24// like "float", not "float&" or "float*". However when the CPU data is an array it has to first be copied to the
 25// GPU using something like cudaMalloc/cudaMemcpy/cudaFree. While possible, this is very inefficient. It is better
 26// to specify that the attribute be always on the gpu, or that the decision is made at runtime, so that the
 27// GPU conversion can be handled automatically.
 28
 29// ======================================================================
 30// Algorithm 1 support - add two values together (something so trivial you would never use the GPU for it; here for
 31//                       illustrative purposes only)
 32//
 33namespace
 34{
 35__device__ __host__
 36void cpuGpuAdd(float* sum, float const* a, float const* b, size_t i)
 37{
 38    sum[i] = a[i] + b[i];
 39}
 40
 41// CUDA kernel that runs the addition algorithm on the block value
 42__global__
 43void cpuGpuAddCUDA(outputs::sum_t sum, inputs::a_t a, inputs::b_t b)
 44{
 45    // Make sure the current evaluation block is in range of the available points
 46    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
 47    if (1 < i)
 48        return;
 49
 50    cpuGpuAdd(sum, a, b, i);
 51}
 52}
 53
 54// GPU kernel launcher
 55extern "C"
 56void cpuGpuAddGPU(outputs::sum_t sum, inputs::a_t a, inputs::b_t b)
 57{
 58    // Launch the GPU deformation using the minimum number of threads and blocks
 59    cpuGpuAddCUDA<<<1, 1>>>(sum, a, b);
 60}
 61
 62// CPU version of the algorithm
 63extern "C"
 64void cpuGpuAddCPU(float& sum, float const& a, float const& b, size_t i)
 65{
 66    cpuGpuAdd(&sum, &a, &b, i);
 67}
 68
 69// ======================================================================
 70// Algorithm 2 support - multiply every point in an array by a constant vector
 71//
 72namespace
 73{
 74// The shared algorithm applies the multiplier to the current point
 75extern "C"
 76__device__ __host__
 77void cpuGpuMultiplier(
 78    outputs::points_t outputPoints,
 79    inputs::multiplier_t multiplier,
 80    inputs::points_t inputPoints,
 81    size_t currentPointIndex)
 82{
 83    (*outputPoints)[currentPointIndex].x = (*inputPoints)[currentPointIndex].x * multiplier->x;
 84    (*outputPoints)[currentPointIndex].y = (*inputPoints)[currentPointIndex].y * multiplier->y;
 85    (*outputPoints)[currentPointIndex].z = (*inputPoints)[currentPointIndex].z * multiplier->z;
 86}
 87
 88// The CUDA kernel applies the deformation algorithm to the point under its control
 89__global__
 90void cpuGpuMultiplierCUDA(
 91    outputs::points_t outputPoints,
 92    inputs::multiplier_t multiplier,
 93    inputs::points_t inputPoints,
 94    size_t numberOfPoints)
 95{
 96    // Make sure the current evaluation block is in range of the available points
 97    size_t i = blockIdx.x * blockDim.x + threadIdx.x;
 98    if (numberOfPoints <= i)
 99        return;
100
101    cpuGpuMultiplier(outputPoints, multiplier, inputPoints, i);
102}
103}
104
105// The GPU algorithm launches a CUDA kernel on the set of points for deformation
106extern "C"
107void cpuGpuMultiplierGPU(
108    outputs::points_t outputPoints,
109    inputs::multiplier_t multiplier,
110    inputs::points_t inputPoints,
111    size_t numberOfPoints)
112{
113    // Split the work into 256 threads, an arbitrary number that could be more precisely tuned when necessary
114    const int numberOfThreads = 256;
115    // Block size is the number of points that fit into each of the threads
116    const int numberOfBlocks = (numberOfPoints + numberOfThreads - 1) / numberOfThreads;
117
118    // Launch the GPU deformation using the calculated number of threads and blocks
119    cpuGpuMultiplierCUDA<<<numberOfBlocks, numberOfThreads>>>(outputPoints, multiplier, inputPoints, numberOfPoints);
120}
121
122// The CPU algorithm just loops over the list of points, applying the shared deformation to each point
123extern "C"
124void cpuGpuMultiplierCPU(
125    outputs::points_t outputPoints,
126    inputs::multiplier_t multiplier,
127    inputs::points_t inputPoints,
128    size_t numberOfPoints)
129{
130    for (size_t i = 0; i < numberOfPoints; i++)
131    {
132        cpuGpuMultiplier(outputPoints, multiplier, inputPoints, i);
133    }
134}

CPU/GPU Attribute Access

Here is how the attribute values are returned from the database. Up until now the attribute name has sufficed as the database member that accesses the value through its operator(). The addition of the runtime switch of memory locations is facilitated by the addition of the gpu() and cpu() members.

CPU Function

CPU Type

GPU Function

GPU Type

CUDA Type

inputs.a.cpu()

const float&

inputs.a_t_gpu

const float*

const float*

inputs.b.cpu()

const float&

inputs.b.gpu()

const float*

const float*

outputs.sum.cpu()

float&

outputs.sum.gpu()

float*

float*

inputs.multiplier.cpu()

const GfVec3f&

inputs.multiplier.gpu()

const GfVec3f*

const float3

inputs.points.cpu()

const GfVec3f*

inputs.points.gpu()

const GfVec3f**

const float3**

outputs.points.cpu()

GfVec3f*

outputs.points.gpu()

const GfVec3f**

float3**

Type Information

As there are three different potential types for each attribute when it varies location at runtime (CPU, CPU being passed to GPU, and GPU) there are extra types introduced in order to handle each of them. The CUDA types are handled as before, but on the CPU side there are extra types for the data being passed from the CPU to the GPU.

CPU Type Method

CPU Data Type

GPU Type Method

GPU Data Type

inputs::a_t

const float&

inputs::a_t_gpu

const float*

inputs::b_t

const float&

inputs::b_t_gpu

const float*

outputs::sum_t

float&

outputs::sum_t_gpu

float*

inputs::multiplier_t

const GfVec3f&

inputs::multiplier_t_gpu

const GfVec3f*

inputs::points_t

const GfVec3f*

inputs::points_t_gpu

const GfVec3f**

outputs::points_t

GfVec3f*

outputs::points_t_gpu

const GfVec3f**

On the C++ side the functions defined in the CUDA file are declared as:

extern "C" void cpuGpuMultiplierCPU(outputs::points_t, inputs::multiplier_t, inputs::points_t, size_t);
extern "C" void cpuGpuMultiplierGPU(outputs::points_t_gpu, inputs::multiplier_t_gpu, inputs::points_t_gpu, size_t);

The addition of the _gpu suffix mostly adds an extra layer of indirection to the values, since they exist in the GPU memory namespace. Care must be taken to call the correct version with the correctly extracted data:

if (db.inputs.is_gpu())
{
    cpuGpuMultiplierGPU(
        db.outputs.points.gpu(),
        db.inputs.multiplier.gpu(),
        db.inputs.points.gpu(),
        numberOfPoints
    );
}
else
{
    // Note how array data is extracted in its raw form for passing to the function on the CUDA side.
    // This would be unnecessary if the implementation were entirely on the CPU side.
    cpuGpuMultiplierCPU(
        db.outputs.points.cpu().data(),
        db.inputs.multiplier.cpu(),
        db.inputs.points.cpu().data(),
        numberOfPoints
    );
}

On the CUDA side the function definitions use the existing CUDA types, so their signatures are:

extern "C" void cpuGpuMultiplierCPU(outputs::points_t, inputs::multiplier_t, inputs::points_t, size_t);
extern "C" void cpuGpuMultiplierGPU(outputs::points_t, inputs::multiplier_t, inputs::points_t, size_t);