# Inference API Extensions

## Overview

The Inference API extensions provide a common interface and set of APIs to run inference in machine learning frameworks. Having a shared interface to inference enables developers to rapidly iterate on extensions, easily exploring and changing among different inference frameworks. These extensions can be used on their own by any developer working in Omniverse. We use them across the AI Toybox to simplify development.

Core features of the API include:

A common interface for running model inference and re-loading weights.

Synchronous or asynchronous inference.

Inference in the main kit thread or as a separate isolated thread. Threaded inference is helpful when managing CUDA context conflicts among Omniverse extensions.

Frameworks we currently support:

PyTorch (https://pytorch.org/)

ONNX (https://onnxruntime.ai/)

TensorRT (https://developer.nvidia.com/tensorrt)

## Installation

The inference API is intended to be used by developers authoring kit extensions.

We plan to regularly release support for new versions of these backends. To use one of the extensions simply add a tagged version of one the following kit extensions as a dependency (LINK to explain dependency):

- PyTorch (
`omni.inference.torch`

) Current LTS (

`1.8.2`

):`"omni.inference.torch" = {tag="lts"}`

`1.11.0`

:`"omni.inference.torch" = {tag="1_11_0"}`

`1.12.0`

:`"omni.inference.torch" = {tag="1_12_0"}`

- PyTorch (
- ONNX (
`omni.inference.onxx`

) `1.11.1`

:`"omni.inference.onnx" = {tag="1_11_1"}`

- ONNX (
- TensorRT (
`omni.inference.tensorrt`

) `8.2.5`

:`"omni.inference.tensorrt" = {tag="8_2_5"}`

`8.4.2`

:`"omni.inference.tensorrt" = {tag="8_4_2"}`

- TensorRT (

## Using the Inference API

Setting up a model and running inference requires two calls, one to register the model and one to run inference with a given data payload. Many backends support loading a fresh set of model parameters into an existing model as well.

All models may be initiated in an isolated thread during model registeration. Threads isolate the model from potential conflicts with other processes using CUDA devices during Omniverse execution.

The example below illustrates the basic inference setup and use.

```
# 1) instantiate backend
trt_backend = TensorRTBackend()
# 2) instantiate a model
model = trt_backend.register_model("example.onnx")
# run the model in an isolated thread (unused below)
model_threaded = trt_backend.register_model("example2.onnx", threaded=True)
# produce some 3-channel 256x256 images
batch_size = 16
input_data = np.random.uniform(size=(batch_size,3,256,256))
model_input = {"input": input_data}
# 3) request inference
model_output = model.infer(model_input)
```

### PyTorch

The PyTorch backend supports instantiating a torch model (`torch.nn.Module`

) and loading weights from a serialized weight dictionary.
Models can be used to perform inference and load new weights.

```
from my_library import MyModel
# MyModel is a `torch.nn.Module`
model_weights = "data/model/weights.pth"
model_weights_reload = "data/model/weights_better.pth"
model_device = "cuda:0"
# 1) instantiate backend
torch_backend = TorchBackend()
# 2) instantiate a model
model = torch_backend.register_model(MyModel, model_weights)
# specify a device to load the model on (unused below)
model_device = torch_backend.register_model(MyModel, model_weights, device="cpu")
# produce some 3-channel 256x256 images
batch_size = 16
input_data = np.random.uniform(size=(batch_size,3,256,256))
model_input = {"input": input_data}
# 3) request inference
model_output = model.infer(model_input)
# 4) load new weights
model.load_weights(model_weights_reload)
```

### ONNX

The ONNX backend supports creating an ONNX inference session from a serialized ONNX model file. Inference sessions may specify which devices to use including CPUs and GPUs (CUDA devices). ONNX models do not currently support weight reloading.

```
# 1) instantiate backend
onnx_backend = ONNXBackend()
# 2) instantiate a model in a thread that prefers CUDA for execution
model = onnx_backend.register_model("example.onnx", ["CUDAExecutionProvider", "CPUExecutionProvider"], threaded=True)
# produce some 3-channel 256x256 images
batch_size = 16
input_data = np.random.uniform(size=(batch_size,3,256,256))
model_input = {"input": input_data}
# 3) request inference
model_output = model.infer(model_input)
```

### TensorRT

The TensorRT backend supports building a TensorRT engine from a given ONNX serialized model file.

Key features of TensorRT models:

Allow the model to reload weights using TensorRT weight refitting.

Create a model with dynamic batch sizes.

Inference supports using data in existing GPU buffers or instantiating new GPU buffers at inference time. Key features of TensorRT inference:

Keep data in existing provided GPU buffers.

Instantiate new GPU buffers on request.

Copy output data to CPU.

```
# 1) instantiate backend
trt_backend = TensorRTBackend()
# 2) instantiate a model in a thread, allowing weight reloading
model = trt_backend.register_model("example.onnx", threaded=True, refittable=True)
# produce some 3-channel 256x256 images
batch_size = 16
input_data = np.random.uniform(size=(batch_size,3,256,256))
model_input = {"input": input_data}
# create GPU buffers using Warp
output_buffer = {"output": wp.empty(shape=(batch_size,5), dtype=float32, device="cuda")}
# 3) request inference
# covert the CPU data (numpy) to GPU buffers, produce GPU buffers (not numpy)
model_output = model.infer(model_input, copy_input=True, output_numpy=False)
# write output into `output_buffer`
model.infer(model_input, output_buffer, copy_input=True)
# output_buffer now contains the model output
# 4) load new weights
better_weights = {"out_bias": 10 * np.ones([1, 1, 184560], dtype=np.float32)}
model.load_weights(better_weights)
```