Performance#

We followed NVIDIA’s Gen AI Perf to benchmark USD Code API performance on four NVIDIA H100 GPUs.

NVIDIA GenAI-Perf is a client-side LLM-focused benchmarking tool, providing key metrics such as TTFT, ITL, TPS, RPS and more. It supports any LLM inference service conforming to the OpenAI API specification, a widely accepted de facto standard in the industry.

This section includes a step-by-step walkthrough, using GenAI-Perf to benchmark a Llama-3 model inference engine, powered by NVIDIA NIM. Read more here

Concurrency

Average Output Token Throughput (requests/sec)

Average Request Throughput (requests/sec)

Average Time to first token (s)

Average Inter Token Latency (s)

1

14.92

0.62

1.26

0.03

5

22.24

1.00

2.36

0.14

25

23.43

1.14

14.06

0.49

50

24.23

1.18

38.92

0.49

100

23.93

1.27

84.61

0.49

150

23.59

1.18

133.24

0.49

200

23.35

1.23

174.69

0.49

250

23.99

1.17

226.26

0.49

Concurrency

Average Output Token Throughput (requests/sec)

Average Request Throughput (requests/sec)

Average Time to first token (s)

Average Inter Token Latency (s)

1

15.95

0.45

1.20

0.03

5

25.22

0.73

2.40

0.13

25

28.11

0.81

13.39

0.49

50

27.43

0.82

42.36

0.49

100

28.01

0.81

97.03

0.49

150

27.71

0.82

147.43

0.49

200

27.47

0.81

193.06

0.49

250

28.48

0.82

249.47

0.49

Concurrency

Average Output Token Throughput (requests/sec)

Average Request Throughput (requests/sec)

Average Time to first token (s)

Average Inter Token Latency (s)

1

11.49

0.36

1.76

0.03

5

16.32

0.53

3.54

0.19

25

17.34

0.56

28.18

0.49

50

17.26

0.56

69.32

0.49

100

16.84

0.55

147.70

0.50

150

16.87

0.56

217.07

0.51

200

16.11

0.52

276.02

0.51

250

16.72

0.54

361.77

0.51