# Reinforcement Learning¶

The Omniverse Isaac Sim Reinforcement Learning samples are provided to demonstrate how the synthetic data and domain randomization extensions are used to train a Reinforcement Learning (RL) agent entirely in simulation and transfer to the real world.

## JetBot Lane Following Sample¶

The JetBot is an educational robotics platform designed with an emphasis on AI. Our goal is to create an agent that consumes a state (images generated by the camera on the robot), and computes the optimal target velocities for each wheel of the JetBot such that it follows a road. We also wrote a blogpost Training Your JetBot in NVIDIA Isaac Sim showing how to train the JetBot in Isaac Sim and transfer to the real JetBot.

To run the jetbot sample, after following the Default Python environment, navigate to the root folder where python.sh is and run ./python.sh python_samples/jetbot/jetbot_train.py.

Note

When running the samples below with the Isaac Sim editor (without --headless), its recommended to use Ctrl-C to kill the script rather than closing the Omniverse Isaac Sim editor window.

This section explains how the road is set up. If you can already run the sample above, you can skip to the next section Model to learn how the Reinforment Learning is set up.

This sample expects the following assets to exist on the Nucleus server specified in apps/omni.isaac.sim.python.kit during Setup.

• /Library/Props/Road_Tiles/Parts/p4336p01.usd

• /Library/Props/Road_Tiles/Parts/p4342p01.usd

• /Library/Props/Road_Tiles/Parts/p4341p01.usd

• /Library/Props/Road_Tiles/Parts/p4343p01.usd

Select All Parts on the right and search for and add the following parts to the model:

• 4336p01 Baseplate 32x32 Road 6-Stud Straight

• 4342p01 Baseplate 32x32 Road 6-Stud Curve

• 4341p01 Baseplate 32x32 Road 6-Stud T-Junction

Click on each part and under properties set the color to Dark Bluish Grey (or any other desired color)

Export the model via the File->Export->COLLADA and import into Blender for further processing

Set the translation of each tile to (0,0,0) and set their names on the right so its easier to keep track of things. The translation needs to be zero so that the tiles are at the origin when imported into Omniverse

Select each tile and export as .fbx using the following settings

With the following names:

• p4336p01.fbx

• p4342p01.fbx

• p4341p01.fbx

• p4343p01.fbx

The p at the start is because usd names/paths cannot start with a number

In Omniverse Isaac Sim create a folder on your Nucleus server at /Library/Props/Road_Tiles/Parts/ and in this folder import all of the .fbx files with the following settings:

Once imported and placed in the correct location, the jetbot training sample should be able to load them

If the tiles are offset by 90/180 degrees or you want a custom offset, this can be applied by setting the offset variable in degrees. in road_environment.py

self.tile_usd = {
0: None,
1: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4336p01.usd", "offset": 180
2: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4342p01.usd", "offset": 180},
3: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4341p01.usd", "offset": 180},
4: {"asset": nucleus_server + "/Library/Props/Road_Tiles/Parts/p4343p01.usd", "offset": 180},
}


### Model¶

The model created through this sample consumes a state consisting of two, 224 x 224 RGB images scaled to [0,1], stacked channel-wise. In the simulation these images are generated by the camera after every JetbotEnv.step call made to the environment. During each step, JetbotEnv.updates_per_step physics steps are called, each advancing the simulation forward by 1/30 seconds. This means that the images in the state are approximately 1/10th of a second apart in simulation time. The agent infers its kinematics using this time difference, so when we transfer to the real JetBot, we must sample an image from the camera once every tenth of a second.

The state is then processed by a series of convolution layers and each of these layers are passed through a rectifier (ReLU) activation. These layers are: 32-8x8s4, (32, 8x8 kernals with stride 4), 64-4x4s2, 128-3x3s1, 128-3x3s1. No padding is used. The features from the final convolution are flattened and fully connected to a layer of 512 rectifier units. These features then pass through a fully connected network that splits into the policy and value function outputs. The specifics of this network can be defined using the net_arch variable in the jetbot_train.py script.

def train(args):
CUSTOM_CONFIG = {
"width": 224,
"height": 224,
"renderer": "RayTracedLighting",
"experience": f'{os.environ["EXP_PATH"]}/omni.isaac.sim.python.kit',
}
omniverse_kit = OmniKitHelper(CUSTOM_CONFIG)

# need to construct OmniKitHelper before importing physics, etc
from jetbot_env import JetbotEnv
import omni.physx

# we disable all anti aliasing in the render because we want to train on the raw camera image.
omniverse_kit.set_setting("/rtx/post/aa/op", 0)

env = JetbotEnv(omniverse_kit, max_resets=args.rand_freq, updates_per_step=3, mirror_mode=args.mirror_mode)

checkpoint_callback = CheckpointCallback(
save_freq=args.save_freq, save_path="./params/", name_prefix=args.checkpoint_name
)

net_arch = [256, 256, dict(pi=[128, 64, 32], vf=[128, 64, 32])]
policy_kwargs = {"net_arch": net_arch, "activation_fn": torch.nn.ReLU}

model = PPO(
"CnnPolicy",
env,
verbose=1,
tensorboard_log=args.tensorboard_dir,
policy_kwargs=policy_kwargs,
device="cuda",
n_steps=args.step_freq,
batch_size=2048,
n_epochs=50,
learning_rate=0.0001,
)

else:

model.learn(
total_timesteps=args.total_steps,
callback=checkpoint_callback,
eval_env=env,
eval_freq=args.eval_freq,
eval_log_path=args.evaluation_dir,
reset_num_timesteps=args.reset_num_timesteps,
)
model.save(args.checkpoint_name)


The policy output of the agent returns mean values of a spherical gaussian. During training, this gaussian is sampled for the action to be executed and during evaluation the mean values are used as the action. The value output of the agent returns the quantity of reward the agent expects to receive in total before reaching a terminal state. The value function is only used during training.

### Training and Environment¶

Training is executed using the Stable Baselines 3 implementation of Proximal Policy Optimization (PPO). During training, the policy is executed on the environment for 2048 time steps. Actions during these steps are sampled from the policy and the state, action, reward, and log likelihood of the action are saved to a buffer. When enough data has been collected, the buffer is sampled in batches, and on each batch a gradient step is executed to maximize the advantage the current iteration of the policy has over the initial iteration. This is repeated for n_epochs=10, which is an argument of the SB3 PPO class.

The environment is a physics simulation, rendered via RTX and run in omniverse. It leverages numerous features of Omniverse Isaac Sim, but most importantly it uses the synthetic data, dynamic control, and domain randomization extensions. The synthetic data extension is what manages the camera and is responsible for not only image encoding, but also for creating things like semantic segmentations of the scene (we only need the RBG image for this sample). The Dynamic control extension is what provides a clean interface to the articulatable portions of a robot, allowing to do things like set poses and target velocities while adhering to joint constraints. Finally, the domain randomization sample is used to provide visual distractors and to randomize textures and materials in the scene. Because we intend to transfer this to a real robot in an arbitrary real world environment, we want the agent to be able to generalize as much as possible. By randomizing the scene every few resets we increase the domain of states that the robot is exposed to during training, thus increasing the liklihood that the environment we transfer the agent into falls within the domain of the agent.

When the environment is initialized the tile and JetBot USDs are loaded into memory, along with the textures and materials used for domain randomization. Tiles are selected to form a randomized loop track with a dimension defined by JetbotEnv.shape. When the track is finished, the JetBot is spawned above a random location along the center line of the track with a random orientation. We then simulate the stage until all objects are finished being loaded, allowing the Jetbot to settle on the track. The simulation is updated via calls to the JetbotEnv.step and JetbotEnv.reset. The step function accepts an action in the form of an ndarray, and steps the simulation forward in time by JetbotEnv.updates_per_step frames. At the end of the final frame, the reward is calculated, it is determined if the robot is in a terminal state, and the current view is fetched from the camera. This image is scaled by 1/255 and uniform noise on [-0.05, 0.05] is applied to each pixel. this is then stitched channel wise to the result from the previous step call.

If you’re running with the editor (without --headless), you can also create another viewport to see what the JetBot is doing by clicking Window / New Viewport Window and changing the new viewport to Perspective, instead of the jetbot_camera.

You can run the training faster in the headless mode by passing in --headless when launching:

./python.sh python_samples/jetbot/jetbot_train.py --headless


### Reward Function and Domain randomization¶

The reward returned for this environment is

$R = S \textbf{e}^{\frac{-x^2}{\sigma^2}}$

Where $$R$$ is the reward, $$S$$ is the current speed of the JetBot, $$x$$ is the shortest distance from the robot to the center line of the track, and $$\sigma$$ is a hyperparameter (the standard deviation) that determines the falloff of the reward as you move away from the center line (default is 0.05 ). This reward function is maximized when the JetBot is moving as fast as possible along the center line.

When training with default parameters, we usually start to see “intent” (following the road, attempting to turn, etc…) in the agent after it has been exposed to about 200k states, where the perf of the agent will begin to plateau without further tuning. At default, the domain is randomized once every JetbotEnv.max_resets times the robot reaches a terminal state. Early in training the agent acts randomly and it usually drives off the track, resulting in many short rollouts. As it improves it stays on the track longer and longer, increasing the number of states the robot is exposed to for a single domain, hampering is ability to continue learning. We get around this by changing JetbotEnv.max_resets to 1 when the agent reaches 200k steps.

The robot reaches a terminal state if it stays on the road for JetbotEnv.numsteps, if it moves farther than a tenth of a tile width away from the center line, or if its velocity becomes too large in the reverse direction (The agent is capable of memorizing the track, at which point it doesn’t need to actually see in order to succeed and will sometimes learn to drive backwards as a viable solution. We get around this by randomizing the track layout with the domain and terminating the robot if it drives too fast in reverse (we still want it to be able to back up if it gets stuck). Regardless, when the robot reaches a terminal state, the environment is reset, teleporting the robot to a random location on the track and starting over. If JetbotEnv.max_resets has occurred, new distractors are spawned and all lighting and materials are randomized.

### JetBot training results¶

Many things below are applicable to JetBot training too.

## JetRacer Lane Following Sample¶

The JetRacer sample is another RL sample, similar to the JetBot sample above, except it trains the JetRacer vehicle below to follow its center racing track. To run the JetRacer sample, after following the Default Python environment, navigate to the root folder where python.sh is and run ./python.sh python_samples/jetracer/jetracer_train.py

The width and height above set the JetRacer camera resolution the RL network uses. This is the JetRacer camera point of view when training. It looks grainy because this is the resolution for training. Please take care not to move the camera (like accidentally move the middle mouse wheel) because that will change the camera pose and what the JetRacer is seeing. We also spawn different distractors (bowls, boxes, etc) and randomize different lighting conditions to hopefully make the trained network more robust.

After satisfied with the initial visual inspection, you can run the training in the headless mode by passing in --headless since it is faster.

The JetRacer training app will print out something similar to these while training (this is at 34143 steps):

Number of steps  17
0.8193072689991853
0.8005960997912133
0.777922437068045
------------------------------------------
| eval/                   |              |
|    mean_ep_length       | 164          |
|    mean_reward          | 4.57e+03     |
| time/                   |              |
|    fps                  | 32           |
|    iterations           | 3            |
|    time_elapsed         | 1036         |
|    total_timesteps      | 34143        |
| train/                  |              |
|    approx_kl            | 0.0066244313 |
|    clip_fraction        | 0.0312       |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.68        |
|    explained_variance   | -0.433       |
|    learning_rate        | 0.0003       |
|    loss                 | 153          |
|    std                  | 0.919        |
|    value_loss           | 1.05e+03     |
------------------------------------------


The tensorboard_log sets the name of the reward log folder.

model = PPO(
"CnnPolicy", env, verbose=1, tensorboard_log="tensorboard_rewards", policy_kwargs=policy_kwargs, device="cuda"
)


After training for a while, you can visualize the rewards by running the following. The folder for –logdir is what you pass in tensorboard_log above. The tensorboard_rewards folder will be at the root folder where python.sh is.

tensorboard --logdir=tensorboard_rewards


Ctrl + Click on that localhost link will launch your browser showing the reward graphs.

TensorBoard 2.3.0 at http://localhost:6006/ (Press CTRL+C to quit)


### Evaluate trained models¶

Training will periodically save checkpoints in the params folder.

checkpoint_callback = CheckpointCallback(save_freq=args.save_freq, save_path="./params/", name_prefix="rl_model")


If you want to evaluate how your trained model is doing, you can run this command:

./python.sh python_samples/jetracer/jetracer_train.py --eval


It will load and evaluate the best model so far under the args.evaluation_dir (defaulted as “eval_log”) folder.

# load a zip file to evaluate here
agent = PPO.load(args.evaluation_dir + "/best_model.zip", device="cuda")


### Continue training¶

You can load an existing model and continue training with this command:

./python.sh python_samples/jetracer/jetracer_train.py --loaded_checkpoint params/<your_model.zip>


### jetracer.py¶

jetracer.py controls where to load the Jetracer usd model:

class Jetracer:
def __init__(self, omni_kit):
...
self.usd_path = nucleus_server + "/Isaac/Robots/Jetracer/jetracer.usd"


command() specifies how to take the network’s produced actions and apply them to the JetRacer vehicle’s accelerating, steering, etc.

### track_enviroment.py¶

This sets up the racing track.

### gtc2020_track_utils.py¶

This contains utility functions to calculate distances, whether the JetRacer is racing in the right direction, etc. This is specific for this racing track.

### jetracer_env.py¶

calculate_reward() is where you can tune the reward function.

reset() controls how you want to reset the JetRacer for every episode.

### jetracer_model.py¶

This defines the CNN architecture.

### Jetracer training results¶

We see the JetRacer starting to follow the center lane and drive around the track around 200k steps.