10.7. Training Pose Estimation Model with Synthetic Data

10.7.1. Learning Objectives

50-60 min tutorial

10.7.1.1. Prerequisites

This tutorial requires a working knowledge of the Offline Pose Estimation tutorial.

This tutorial also requires a basic understanding of how to submit jobs on NGC using Base Command. Documentation for how to do so can be found here.

10.7.2. Generating Data on NGC

Generating data on NGC using the OVX clusters allows you to drastically increase the amount of data you can generate compared to your local machine. We use the OVX clusters for data generation since they are optimized for rendering jobs. For training, we will use the DGX clusters, which are optimized for machine learning. Because we will be using two different clusters for generation and training, we will automatically save our generated data to an s3 bucket, which we will then use to load data in during training.

10.7.2.1. Building Your Own Container for Data Generation

In order to build a container to run on NGC, we can use a Dockerfile. To do so, copy the contents below into a file called Dockerfile. Place the Dockerfile in standalone_examples/replicator/offline_pose_generation.

Dockerfile to create container for NGC

# See https://catalog.ngc.nvidia.com/orgs/nvidia/containers/isaac-sim
# for instructions on how to run this container
FROM nvcr.io/nvidia/isaac-sim:2023.1.1

RUN apt-get update && export DEBIAN_FRONTEND=noninteractive && apt-get install s3cmd -y

# Copies over latest changes to pose generation code when building the container
COPY ./ standalone_examples/replicator/offline_pose_generation

Any updates you have made locally to the offline_pose_generation.py and other files in the standalone_examples/replicator/offline_pose_generation folder will be copied over to the container when you build. This enables workflows where you need to modify the existing files inside offline_pose_generation/ (e.g. to generate data for a custom object by modifying the config/ files). To build the container, run:

cd standalone_examples/replicator/offline_pose_generation
docker build -t NAME_OF_YOUR_CONTAINER:TAG .

10.7.2.2. Pushing Docker Container to NGC

To use this new container in NGC, we have to push it first. In order to push a container to NGC, you need to authenticate first. You can do so following this NGC guide.

When pushing a container to NGC, there is a specific naming format that must be followed. The name of the container must be nvcr.io/<ORGANIZATION_NAME>/<TEAM_NAME>/<CONTAINER_NAME>:<TAG>. For more details on pushing containers to NGC, see this guide.

docker push NAME_OF_YOUR_CONTAINER:TAG

10.7.2.3. Adding S3 Credentials to NGC Jobs

If you are planning on using data from an s3 bucket or writing your results to an s3 bucket, you need to add your credentials as part of the job definition. Unfortunately, there is no good way to manage secrets on NGC currently so this has to be done manually.

You can upload your credentials by appending the command below to the beginning of your Run Command in your job definition. Make sure to fill in your credentials in the places marked.

# Credentials for boto3
mkdir ~/.aws
echo "[default]" >> ~/.aws/config
echo "aws_access_key_id = <YOUR_USER_NAME>" >> ~/.aws/config
echo "aws_secret_access_key = <YOUR_SECRET_KEY>" >> ~/.aws/config
# Credentials for s3cmd
echo "[default]" >> ~/.s3cfg
echo "use_https = True" >> ~/.s3cfg
echo "access_key = <YOUR_USER_NAME>" >> ~/.s3cfg
echo "secret_key = <YOUR_SECRET_KEY>" >> ~/.s3cfg
echo "bucket_location = us-east-1" >> ~/.s3cfg
echo "host_base = <YOUR_ENDPOINT>" >> ~/.s3cfg
echo "host_bucket = bucket-name" >> ~/.s3cfg

After pushing the container to NGC, we select this container when creating a job. You can use the following run command:

# ADD YOUR S3 CREDENTIALS HERE
# (see "Adding S3 Credentials to NGC Jobs" section above for more details)

# Run Pose Generation
./python.sh standalone_examples/replicator/offline_pose_generation/offline_pose_generation.py \
--use_s3 --endpoint https://YOUR_ENDPOINT --bucket OUTPUT_BUCKET --num_dome 1000 --num_mesh 1000 --writer DOPE \
--no-window

The --no-window flag is passed in when running the script to run Isaac Sim in headless mode. This overrides any other settings that determine whether the app will run in headless mode or not. Without this flag, we could get an error if the config file we pass in has "headless": false since it is not possible to launch an Isaac Sim window when running in a Docker container.

10.7.2.4. Things to Note

In order to submit a job on the OVX clusters, it must be made pre-emptable. To do this, select Resumable under Preemption Options when creating the job.

10.7.3. Train, Inference, and Evaluate

10.7.3.1. Running Locally

To run the training, inference, and evaluation scripts locally, clone the Dope Training Repo and follow the instructions in the README.md file within the repo.

10.7.3.2. Running on NGC

NGC offers users the ability to scale their training jobs. Since DOPE needs to be trained separately for each class of object, NGC is extremely helpful in enabling multiple models to be trained at once. Furthermore, it reduces the time needed to train models by providing the option to run multi-GPU jobs.

When creating a job, simply copy over the command below to be used as your job’s Run Command on NGC. Be sure to change the parameters according to your need.

If you would like to run the entire training, inference, and evaluation pipeline in one go, you can refer to the Running Entire Pipeline in One Command section below.

# ADD YOUR S3 CREDENTIALS HERE
# (see "Adding S3 Credentials to NGC Jobs" section for more details)

# Change values below:
export endpoint="https://YOUR_ENDPOINT"
export num_gpus=1
export train_buckets="BUCKET_1 BUCKET_2"
export batchsize=32
export epochs=60
export object="CLASS_OF_OBJECT"
export output_bucket="OUTPUT_BUCKET"
export inference_data="PATH_TO_INFERENCE_DATA"

# Run Training
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
train.py --use_s3 \
--train_buckets $train_buckets \
--endpoint $endpoint \
--object $object \
--batchsize $batchsize \
--epochs $((epochs / num_gpus))

# Copy Inference Data Locally
mkdir sample_data/inference_data
s3cmd sync s3://$inference_data sample_data/inference_data

# Run Inference
cd inference/
python inference.py \
--weights ../output/weights \
--data ../sample_data/inference_data \
--object $object

# Run Evaluation
cd ../evaluate
python evaluate.py \
--data_prediction ../inference/output \
--data ../sample_data/inference_data \
--outf ../output/ \
--cuboid

# Store Training and Evaluation Results
cd ../
s3cmd mb s3://$output_bucket
s3cmd sync output/ s3://$output_bucket

10.7.3.3. Running Entire Pipeline in One Command

To make running the entire pipeline easier on NGC, there is also a script run_pipeline_on_ngc.py that can run the entire pipeline with one command. Below is an example of an NGC run command that uses the script to run the entire pipeline:

# ADD YOUR S3 CREDENTIALS HERE
# (see "Adding S3 Credentials to NGC Jobs" section for more details)

python run_pipeline_on_ngc.py \
--num_gpus 1 \
--endpoint https://ENDPOINT \
--object YOUR_OBJECT \
--train_buckets YOUR_BUCKET \
--inference_bucket YOUR_INFERENCE_BUCKET \
--output_bucket YOUR_OUTPUT_BUCKET

10.7.3.4. Building Your Own Training Container with Dockerfile

The easiest way to run this pipeline is with the existing container on NGC that is linked above. Alternatively, there is a Dockerfile in the Dope Training Repo. You can use this to build your own Docker image. Note that this Dockerfile uses the PyTorch Container from NGC as the base image.

Then, assuming that you are in the directory where the Dockerfile is, you can run the command below. For additional information on building Docker images, refer to the official Docker guide.

cd docker
./get_nvidia_libs.sh

docker build -t nvcr.io/nvidian/onboarding/sample-image-dope-training:1.0 .

Here, nvcr.io/nvidian/onboarding/sample-image-dope-training is the name of the image we want to build and 1.0 is the tag. We use this naming convention in order to upload our image as a container to NGC. The reason we need to run get_nvidia_libs.sh is because the visii module that is used in evaluate.py requires drivers that are not in the default PyTorch container we build off of. Thus, we need to manually copy the files over.

Then, to push this container to NGC, we can follow the same steps listed in the Pushing Docker Container to NGC section above.

10.7.4. Summary

This tutorial covered the following topics:

How to generate synthetic data with Isaac Sim on the OVX clusters on NGC. Using these clusters enables you to scale up your synthetic data generation.
How to train and evaluate a DOPE model on NGC using data that has been uploaded to an s3 bucket. This enables you to scale up your model training by training multiple models at once on clusters with multiple GPUs.