Performance#
We followed NVIDIA’s Gen AI Perf to benchmark USD Code API performance on four NVIDIA H100 GPUs.
NVIDIA GenAI-Perf is a client-side LLM-focused benchmarking tool, providing key metrics such as TTFT, ITL, TPS, RPS and more. It supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry.
This section includes a step-by-step walkthrough, using GenAI-Perf to benchmark a Llama-3 model inference engine, powered by NVIDIA NIM. Read more here
Concurrency |
Average Output Token Throughput (requests/sec) |
Average Request Throughput (requests/sec) |
Average Time to first token (s) |
Average Inter Token Latency (s) |
---|---|---|---|---|
1 |
14.92 |
0.62 |
1.26 |
0.03 |
5 |
22.24 |
1.00 |
2.36 |
0.14 |
25 |
23.43 |
1.14 |
14.06 |
0.49 |
50 |
24.23 |
1.18 |
38.92 |
0.49 |
100 |
23.93 |
1.27 |
84.61 |
0.49 |
150 |
23.59 |
1.18 |
133.24 |
0.49 |
200 |
23.35 |
1.23 |
174.69 |
0.49 |
250 |
23.99 |
1.17 |
226.26 |
0.49 |
Concurrency |
Average Output Token Throughput (requests/sec) |
Average Request Throughput (requests/sec) |
Average Time to first token (s) |
Average Inter Token Latency (s) |
---|---|---|---|---|
1 |
15.95 |
0.45 |
1.20 |
0.03 |
5 |
25.22 |
0.73 |
2.40 |
0.13 |
25 |
28.11 |
0.81 |
13.39 |
0.49 |
50 |
27.43 |
0.82 |
42.36 |
0.49 |
100 |
28.01 |
0.81 |
97.03 |
0.49 |
150 |
27.71 |
0.82 |
147.43 |
0.49 |
200 |
27.47 |
0.81 |
193.06 |
0.49 |
250 |
28.48 |
0.82 |
249.47 |
0.49 |
Concurrency |
Average Output Token Throughput (requests/sec) |
Average Request Throughput (requests/sec) |
Average Time to first token (s) |
Average Inter Token Latency (s) |
---|---|---|---|---|
1 |
11.49 |
0.36 |
1.76 |
0.03 |
5 |
16.32 |
0.53 |
3.54 |
0.19 |
25 |
17.34 |
0.56 |
28.18 |
0.49 |
50 |
17.26 |
0.56 |
69.32 |
0.49 |
100 |
16.84 |
0.55 |
147.70 |
0.50 |
150 |
16.87 |
0.56 |
217.07 |
0.51 |
200 |
16.11 |
0.52 |
276.02 |
0.51 |
250 |
16.72 |
0.54 |
361.77 |
0.51 |