Reinforcement Training Samples

The Omniverse Isaac Sim Reinforcement Learning samples are provided to demonstrate the synthetic data and domain randomization extensions to train a Reinforcement Learning (RL) agent entirely in simulation.


When running the samples below with headless set to False, its recommended to use Ctrl-C to kill the script rather than closing the Omniverse Isaac Sim editor window.

JetBot Lane Following Sample

The JetBot is an educational robotics platform designed with an emphasis on AI. Our goal is to create an agent that consumes a state (images generated by the camera on the robot), and computes the optimal target velocities for each wheel of the JetBot such that it follows a road. To run the jetbot sample, after following the Setup, navigate to the jetbot sample directory source/python_samples/jetbot, and execute python

JetBot RL Training

Acquiring Road Tile Assets

This sample expects the following assets to exist on the Nucleus server specified in experiences/isaac-sim-python.json during Setup.

  • /Library/Props/Road_Tiles/Parts/p4336p01.usd

  • /Library/Props/Road_Tiles/Parts/p4342p01.usd

  • /Library/Props/Road_Tiles/Parts/p4341p01.usd

  • /Library/Props/Road_Tiles/Parts/p4343p01.usd

To acquire these assets first download and install LeoCad

Select All Parts on the right and search for and add the following parts to the model:

  • 4336p01 Baseplate 32x32 Road 6-Stud Straight

  • 4342p01 Baseplate 32x32 Road 6-Stud Curve

  • 4341p01 Baseplate 32x32 Road 6-Stud T-Junction

  • 4343p01 Baseplate 32x32 Road 6-Stud Crossroad

LeoCad Tiles

Click on each part and under properties set the color to Dark Bluish Grey (or any other desired color)

LeoCad Tiles

Export the model via the File->Export->COLLADA and import into Blender for further processing

Blender Import

Set the translation of each tile to (0,0,0) and set their names on the right so its easier to keep track of things. The translation needs to be zero so that the tiles are at the origin when imported into Omniverse

Blender Translate

Select each tile and export as .fbx using the following settings

Blender Export

With the following names:

  • p4336p01.fbx

  • p4342p01.fbx

  • p4341p01.fbx

  • p4343p01.fbx

The p at the start is because usd names/paths cannot start with a number

In Omniverse Isaac Sim create a folder on your Nucleus server at /Library/Props/Road_Tiles/Parts/ and in this folder import all of the .fbx files with the following settings:

|isaac-sim| Import

Once imported and placed in the correct location, the jetbot training sample should be able to load them

If the tiles are offset by 90/180 degrees or you want a custom offset, this can be applied by setting the offset variable in degrees. in

self.tile_usd = {
    0: None,
    1: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4336p01.usd", "offset": 180
    2: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4342p01.usd", "offset": 180},
    3: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4341p01.usd", "offset": 180},
    4: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4343p01.usd", "offset": 180},


The model created through this sample consumes a state consisting of two, 224 x 224 RGB images scaled to [0,1], stacked channel-wise. In the simulation these images are generated by the camera after every JetbotEnv.step call made to the environment. During each step, JetbotEnv.updates_per_step physics steps are called, each advancing the simulation forward by 1/30 seconds. This means that the images in the state are approximately 1/10th of a second apart in simulation time. The agent infers its kinematics using this time difference, so when we transfer to the real JetBot, we must sample an image from the camera once every tenth of a second.

The state is then processed by a series of convolution layers and each of these layers are passed through a rectifier (ReLU) activation. These layers are: 32-8x8s4, (32, 8x8 kernals with stride 4), 64-4x4s2, 128-3x3s1, 128-3x3s1. No padding is used. The features from the final convolution are flattened and fully connected to a layer of 512 rectifier units. These features then pass through a fully connected network that splits into the policy and value function outputs. The specifics of this network can be defined using the net_arch variable in the script.

    "width": 224,
    "height": 224,
    "renderer": "RayTracedLighting",
    "headless": False,
    "experience": f'{os.environ["EXP_PATH"]}/isaac-sim-python.json',

# use this to switch from training to evaluation

def train():
    omniverse_kit = OmniKitHelper(CUSTOM_CONFIG)

    # we disable all anti aliasing in the render because we want to train on the raw camera image.
    omniverse_kit.set_setting("/rtx/post/aa/op", 0)

    env = JetbotEnv(omniverse_kit, max_resets=10, updates_per_step=3)

    checkpoint_callback = CheckpointCallback(save_freq=1000, save_path="./params/", name_prefix="rl_model")

    net_arch = [512, 256, dict(pi=[128, 64, 32], vf=[128, 64, 32])]
    policy_kwargs = {"net_arch": net_arch, "features_extractor_class": CustomCNN, "activation_fn": torch.nn.ReLU}

    model = PPO("CnnPolicy", env, verbose=1, tensorboard_log="tensorboard", policy_kwargs=policy_kwargs, device="cuda")
    # model = PPO.load("",env)

The policy output of the agent returns mean values of a spherical gaussian. During training, this gaussian is sampled for the action to be executed and during evaluation the mean values are used as the action. The value output of the agent returns the quantity of reward the agent expects to receive in total before reaching a terminal state. The value function is only used during training.

Training and Environment

Training is executed using the Stable Baselines 3 implementation of Proximal Policy Optimization (PPO). During training, the policy is executed on the environment for 2048 time steps. Actions during these steps are sampled from the policy and the state, action, reward, and log likelihood of the action are saved to a buffer. When enough data has been collected, the buffer is sampled in batches, and on each batch a gradient step is executed to maximize the advantage the current iteration of the policy has over the initial iteration. This is repeated for n_epochs=10, which is an argument of the SB3 PPO class.

The environment is a physics simulation, rendered via RTX and run in omniverse. It leverages numerous features of Omniverse Isaac Sim, but most importantly it uses the synthetic data, dynamic control, and domain randomization extensions. The synthetic data extension is what manages the camera and is responsible for not only image encoding, but also for creating things like semantic segmentations of the scene (we only need the RBG image for this sample). The Dynamic control extension is what provides a clean interface to the articulatable portions of a robot, allowing to do things like set poses and target velocities while adhering to joint constraints. Finally, the domain randomization sample is used to provide visual distractors and to randomize textures and materials in the scene. Because we intend to transfer this to a real robot in an arbitrary real world environment, we want the agent to be able to generalize as much as possible. By randomizing the scene every few resets we increase the domain of states that the robot is exposed to during training, thus increasing the liklihood that the environment we transfer the agent into falls within the domain of the agent.

When the environment is initialized the tile and JetBot USDs are loaded into memory, along with the textures and materials used for domain randomization. Tiles are selected to form a randomized loop track with a dimension defined by JetbotEnv.shape. When the track is finished, the JetBot is spawned above a random location along the center line of the track with a random orientation. We then simulate the stage until all objects are finished being loaded, allowing the Jetbot to settle on the track. The simulation is updated via calls to the JetbotEnv.step and JetbotEnv.reset. The step function accepts an action in the form of an ndarray, and steps the simulation forward in time by JetbotEnv.updates_per_step frames. At the end of the final frame, the reward is calculated, it is determined if the robot is in a terminal state, and the current view is fetched from the camera. This image is scaled by 1/255 and uniform noise on [-0.05, 0.05] is applied to each pixel. this is then stitched channel wise to the result from the previous step call.

Reward Function and Domain randomization

The reward returned for this environment is

\[R = S \textbf{e}^{\frac{-x^2}{\sigma^2}}\]

Where \(R\) is the reward, \(S\) is the current speed of the JetBot, \(x\) is the shortest distance from the robot to the center line of the track, and \(\sigma\) is a hyperparameter (the standard deviation) that determines the falloff of the reward as you move away from the center line (default is 0.05 ). This reward function is maximized when the JetBot is moving as fast as possible along the center line.

An example of the reward function for a random track

When training with default parameters, we usually start to see “intent” (following the road, attempting to turn, etc…) in the agent after it has been exposed to about 200k states, where the perf of the agent will begin to plateau without further tuning. At default, the domain is randomized once every JetbotEnv.max_resets times the robot reaches a terminal state. Early in training the agent acts randomly and it usually drives off the track, resulting in many short rollouts. As it improves it stays on the track longer and longer, increasing the number of states the robot is exposed to for a single domain, hampering is ability to continue learning. We get around this by changing JetbotEnv.max_resets to 1 when the agent reaches 200k steps.

The robot reaches a terminal state if it stays on the road for JetbotEnv.numsteps, if it moves farther than a tenth of a tile width away from the center line, or if its velocity becomes too large in the reverse direction (The agent is capable of memorizing the track, at which point it doesn’t need to actually see in order to succeed and will sometimes learn to drive backwards as a viable solution. We get around this by randomizing the track layout with the domain and terminating the robot if it drives too fast in reverse (we still want it to be able to back up if it gets stuck). Regardless, when the robot reaches a terminal state, the environment is reset, teleporting the robot to a random location on the track and starting over. If JetbotEnv.max_resets has occurred, new distractors are spawned and all lighting and materials are randomized.

JetBot training results

JetBot RL Trained

Many things below are applicable to JetBot training too.

JetRacer Lane Following Sample

The JetRacer sample is another RL sample, similar to the JetBot sample above, except it trains the JetRacer vehicle below to follow its center racing track. To run the JetRacer sample, after following the Setup, navigate to the JetRacer sample directory source/python_samples/jetracer, and execute python

JetRacer RL Training

If you want to quickly inspect the scene in Omniverse Isaac Sim editor, set the’s headless to False.

    "width": 224,
    "height": 224,
    "renderer": "RayTracedLighting",
    "headless": False,
    "experience": f'{os.environ["EXP_PATH"]}/isaac-sim-python.json',

The width and height above set the JetRacer camera resolution the RL network uses. This is the JetRacer camera point of view when training. It looks grainy because this is the resolution for training. Please take care not to move the camera (like accidentally move the middle mouse wheel) because that will change the camera pose and what the JetRacer is seeing. We also spawn different distractors (bowls, boxes, etc) and randomize different lighting conditions to hopefully make the trained network more robust.

JetRacer Camera POV

After satisfied with the initial visual inspection, you can start training by setting headless to True since headless training is faster.

The JetRacer training app will print out something similar to these while training (this is at 34143 steps):

Number of steps  17
| eval/                   |              |
|    mean_ep_length       | 164          |
|    mean_reward          | 4.57e+03     |
| time/                   |              |
|    fps                  | 32           |
|    iterations           | 3            |
|    time_elapsed         | 1036         |
|    total_timesteps      | 34143        |
| train/                  |              |
|    approx_kl            | 0.0066244313 |
|    clip_fraction        | 0.0312       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.68        |
|    explained_variance   | -0.433       |
|    learning_rate        | 0.0003       |
|    loss                 | 153          |
|    n_updates            | 150          |
|    policy_gradient_loss | 0.00368      |
|    std                  | 0.919        |
|    value_loss           | 1.05e+03     |

The tensorboard_log sets the name of the reward log folder.

model = PPO(
    "CnnPolicy", env, verbose=1, tensorboard_log="tensorboard_rewards", policy_kwargs=policy_kwargs, device="cuda"

After training for a while, you can visualize the rewards by running tensorboard --logdir=tensorboard_rewards in the isaac-sim conda environment (conda activate isaac-sim):

(isaac-sim) user@hostname:~/omni_isaac_sim/source/python_samples/jetracer$ tensorboard --logdir=tensorboard_rewards

Ctrl + Click on that localhost link will launch your browser showing the reward graphs.

TensorBoard 2.3.0 at http://localhost:6006/ (Press CTRL+C to quit)
JetRacer rewards

Evaluate trained models

Training will periodically save checkpoints in the params folder.

checkpoint_callback = CheckpointCallback(save_freq=1000, save_path="./params/", name_prefix="rl_model")

If you want to evaluate how your trained model is doing, you can change the TRAINING_MODE to False and specify a zip file in the params folder in the runEval’s PPO.load() function and run python again to see how the JetRacer behaves. Remember to set headless to True to see it in the editor.

# use this to switch from training to evaluation

def runEval():
    # load a zip file to evaluate here
    agent = PPO.load("params/", device="cuda")

The training also saves the best model so far under the eval_log folder. You can evaluate that with eval_log/

# load a zip file to evaluate here
agent = PPO.load("eval_log/", device="cuda")

Continue training

Remember to switch TRAINING_MODE to True, specify the zip file in train()’s PPO.load() instead of creating a new PPO model.

def train():

    # create a new model
    # model = PPO("CnnPolicy", env, verbose=1, tensorboard_log="tensorboard", policy_kwargs=policy_kwargs, device="cuda")

    # load an existing model and continue training
    model = PPO.load("params/", env) controls where to load the Jetracer usd model:

class Jetracer:
    def __init__(self, omni_kit):
        self.usd_path = nucleus_server + "/Isaac/Robots/Jetracer/jetracer.usd"

command() specifies how to take the network’s produced actions and apply them to the JetRacer vehicle’s accelerating, steering, etc.

This sets up the racing track.

This contains utility functions to calculate distances, whether the JetRacer is racing in the right direction, etc. This is specific for this racing track.

calculate_reward() is where you can tune the reward function.

reset() controls how you want to reset the JetRacer for every episode.

This defines the CNN architecture.

Jetracer training results

We see the JetRacer starting to follow the center lane and drive around the track around 200k steps.

JetRacer trained