Performance#

We followed NVIDIA’s Gen AI Perf to benchmark USD Code API performance on four NVIDIA H100 GPUs.

NVIDIA GenAI-Perf is a client-side LLM-focused benchmarking tool, providing key metrics such as TTFT, ITL, TPS, RPS and more. It supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry.

This section includes a step-by-step walkthrough, using GenAI-Perf to benchmark a Llama-3 model inference engine, powered by NVIDIA NIM. Read more in the GenAI-Perf benchmarking documentation.

ISL/OSL: 1000/1000

Concurrency	Average Output Token Throughput (requests/sec)	Average Request Throughput (requests/sec)	Average Time to first token (s)	Average Inter Token Latency (s)
1	14.92	0.62	1.26	0.03
5	22.24	1.00	2.36	0.14
25	23.43	1.14	14.06	0.49
50	24.23	1.18	38.92	0.49
100	23.93	1.27	84.61	0.49
150	23.59	1.18	133.24	0.49
200	23.35	1.23	174.69	0.49
250	23.99	1.17	226.26	0.49

ISL/OSL: 500/2000

Concurrency	Average Output Token Throughput (requests/sec)	Average Request Throughput (requests/sec)	Average Time to first token (s)	Average Inter Token Latency (s)
1	15.95	0.45	1.20	0.03
5	25.22	0.73	2.40	0.13
25	28.11	0.81	13.39	0.49
50	27.43	0.82	42.36	0.49
100	28.01	0.81	97.03	0.49
150	27.71	0.82	147.43	0.49
200	27.47	0.81	193.06	0.49
250	28.48	0.82	249.47	0.49

ISL/OSL: 5000/500

Concurrency	Average Output Token Throughput (requests/sec)	Average Request Throughput (requests/sec)	Average Time to first token (s)	Average Inter Token Latency (s)
1	11.49	0.36	1.76	0.03
5	16.32	0.53	3.54	0.19
25	17.34	0.56	28.18	0.49
50	17.26	0.56	69.32	0.49
100	16.84	0.55	147.70	0.50
150	16.87	0.56	217.07	0.51
200	16.11	0.52	276.02	0.51
250	16.72	0.54	361.77	0.51