7. Training Pose Estimation Model with Synthetic Data¶
7.1. Learning Objectives¶
50-60 min tutorial
7.2. Generating Data on NGC¶
Generating data on NGC using the OVX clusters allows you to drastically increase the amount of data you can generate compared to your local machine.
We use the OVX clusters for data generation since they are optimized for rendering jobs.
For training, we will use the DGX clusters, which are optimized for machine learning.
Because we will be using two different clusters for generation and training, we will automatically save our generated data to an
s3 bucket, which we will then use to load data in during training.
7.2.1. Building Your Own Container for Data Generation¶
In order to build a container to run on NGC, we can use a
Dockerfile. To do so, copy the contents below into a file called
Dockerfile to create container for NGC
# See https://catalog.ngc.nvidia.com/orgs/nvidia/containers/isaac-sim # for instructions on how to run this container FROM nvcr.io/nvidia/isaac-sim:2022.2.0 RUN apt-get update && export DEBIAN_FRONTEND=noninteractive && apt-get install s3cmd -y # Copies over latest changes to pose generation code when building the container COPY ./ standalone_examples/replicator/offline_pose_generation
Any updates you have made locally to the
offline_pose_generation.py and other files in the
standalone_examples/replicator/offline_pose_generation folder will be copied over to the container when you build.
This enables workflows where you need to modify the existing
(e.g. to generate data for a new object instead of the default
003_cracker_box) for your generation job.
To build the container, run:
cd standalone_examples/replicator/offline_pose_generation docker build -t NAME_OF_YOUR_CONTAINER:TAG .
7.2.2. Pushing Docker Container to NGC¶
To use this new container in NGC, we have to push it first. In order to push a container to NGC, you need to authenticate first. You can do so following this NGC guide.
When pushing a container to NGC, there is a specific naming format that must be followed.
The name of the container must be
For more details on pushing containers to NGC, see this guide.
docker push NAME_OF_YOUR_CONTAINER:TAG
7.2.3. Adding S3 Credentials to NGC Jobs¶
If you are planning on using data from an
s3 bucket or writing your results to an
you need to add your credentials as part of the job definition. Unfortunately, there is no good way
to manage secrets on NGC currently so this has to be done manually.
You can upload your credentials by appending the command below to the beginning of your
Run Command in your job definition.
Make sure to fill in your credentials in the places marked.
# Credentials for boto3 mkdir ~/.aws echo "[default]" >> ~/.aws/config echo "aws_access_key_id = <YOUR_USER_NAME>" >> ~/.aws/config echo "aws_secret_access_key = <YOUR_SECRET_KEY>" >> ~/.aws/config # Credentials for s3cmd echo "[default]" >> ~/.s3cfg echo "use_https = True" >> ~/.s3cfg echo "access_key = <YOUR_USER_NAME>" >> ~/.s3cfg echo "secret_key = <YOUR_SECRET_KEY>" >> ~/.s3cfg echo "bucket_location = us-east-1" >> ~/.s3cfg echo "host_base = <YOUR_ENDPOINT>" >> ~/.s3cfg echo "host_bucket = bucket-name" >> ~/.s3cfg
After pushing the container to NGC, we select this container when creating a job. You can use the following run command:
# ADD YOUR S3 CREDENTIALS HERE # (see "Adding S3 Credentials to NGC Jobs" section above for more details) # Run Pose Generation ./python.sh standalone_examples/replicator/offline_pose_generation/offline_pose_generation.py \ --use_s3 --endpoint https://YOUR_ENDPOINT --bucket OUTPUT_BUCKET --num_dome 1000 --num_mesh 1000 --writer DOPE \ --no-window
--no-windowflag is passed in when running the script to run Isaac Sim in
headlessmode. This overrides any other settings that determine whether the app will run in
headlessmode or not. Without this flag, we could get an error if the config file we pass in has
"headless": falsesince it is not possible to launch an Isaac Sim window when running in a Docker container.
7.2.4. Things to Note¶
In order to submit a job on the OVX clusters, it must be made pre-emptable. To do this, select
Resumableunder Preemption Options when creating the job.
7.3. Train, Inference, and Evaluate¶
7.3.1. Running Locally¶
To run the training, inference, and evaluation scripts locally, clone the Dope Training Repo and follow the instructions in the README.md file within the repo.
7.3.2. Running on NGC¶
NGC offers users the ability to scale their training jobs. Since DOPE needs to be trained separately for each class of object, NGC is extremely helpful in enabling multiple models to be trained at once. Furthermore, it reduces the time needed to train models by providing the option to run multi-GPU jobs.
The easiest way to run a training job on NGC is to use the pre-built DOPE Training Container. This container contains all of the necessary packages needed for DOPE training already installed.
When creating a job, simply copy over the command below to be used as your job’s
Run Command on NGC.
Be sure to change the parameters according to your need.
If you would like to run the entire training, inference, and evaluation pipeline in one go, you can refer to the Running Entire Pipeline in One Command section below.
# ADD YOUR S3 CREDENTIALS HERE # (see "Adding S3 Credentials to NGC Jobs" section for more details) # Change values below: export endpoint="https://YOUR_ENDPOINT" export num_gpus=1 export train_buckets="BUCKET_1 BUCKET_2" export batchsize=32 export epochs=60 export object="CLASS_OF_OBJECT" export output_bucket="OUTPUT_BUCKET" export inference_data="PATH_TO_INFERENCE_DATA" # Run Training python -m torch.distributed.launch --nproc_per_node=$num_gpus \ train.py --use_s3 \ --train_buckets $train_buckets \ --endpoint $endpoint \ --object $object \ --batchsize $batchsize \ --epochs $((epochs / num_gpus)) # Copy Inference Data Locally mkdir sample_data/inference_data s3cmd sync s3://$inference_data sample_data/inference_data # Run Inference cd inference/ python inference.py \ --weights ../output/weights \ --data ../sample_data/inference_data \ --object $object # Run Evaluation cd ../evaluate python evaluate.py \ --data_prediction ../inference/output \ --data ../sample_data/inference_data \ --outf ../output/ \ --cuboid # Store Training and Evaluation Results cd ../ s3cmd mb s3://$output_bucket s3cmd sync output/ s3://$output_bucket
7.3.3. Running Entire Pipeline in One Command¶
To make running the entire pipeline easier on NGC, there is also a script
that can run the entire pipeline with one command. Below is an example of an NGC run command that
uses the script to run the entire pipeline:
# ADD YOUR S3 CREDENTIALS HERE # (see "Adding S3 Credentials to NGC Jobs" section for more details) python run_pipeline_on_ngc.py \ --num_gpus 1 \ --endpoint https://ENDPOINT \ --object YOUR_OBJECT \ --train_buckets YOUR_BUCKET \ --inference_bucket YOUR_INFERENCE_BUCKET \ --output_bucket YOUR_OUTPUT_BUCKET
7.3.4. Building Your Own Training Container with Dockerfile¶
The easiest way to run this pipeline is with the existing container on NGC that is linked above.
Alternatively, there is a
Dockerfile in the Dope Training Repo.
You can use this to build your own Docker image.
Note that this
Dockerfile uses the PyTorch Container from NGC
as the base image.
Then, assuming that you are in the directory where the
Dockerfile is, you can run the command below.
For additional information on building Docker images, refer to the official Docker guide.
cd docker ./get_nvidia_libs.sh docker build -t nvcr.io/nvidian/onboarding/sample-image-dope-training:1.0 .
nvcr.io/nvidian/onboarding/sample-image-dope-training is the name of the image we want to build
1.0 is the tag. We use this naming convention in order to upload our image as a container to NGC.
The reason we need to run
get_nvidia_libs.sh is because the
visii module that is used in
requires drivers that are not in the default PyTorch container we build off of. Thus, we need to manually copy the files over.
Then, to push this container to NGC, we can follow the same steps listed in the Pushing Docker Container to NGC section above.
This tutorial covered the following topics:
How to generate synthetic data with Isaac Sim on the OVX clusters on NGC. Using these clusters enables you to scale up your synthetic data generation.
How to train and evaluate a DOPE model on NGC using data that has been uploaded to an
s3bucket. This enables you to scale up your model training by training multiple models at once on clusters with multiple GPUs.