NVIDIA H200 DeepSeek-V3 Benchmark Report

AI Startups

NVIDIA H200 DeepSeek-V3 Benchmark Report

Overview

Test Setup

Benchmark Results

Analysis and Insights

Conclusion

Overview

This NVIDIA H200 DeepSeek-V3 benchmark report evaluates the performance of the model running on an SGLang server using 8x NVIDIA H200 GPUs and 2x AMD EPYC 9654 CPUs (192 cores total). The goal was to analyze latency, throughput, and scalability under different request loads.

System Configuration

Hardware:

8x NVIDIA H200 GPUs (80GB VRAM each)
2x AMD EPYC 9654 CPUs (192 cores, 384 threads)
1.65 TB RAM

Software:

OS: Ubuntu 20.04.6 LTS (Focal Fossa)
Kernel: 5.15.0-131-generic #141-Ubuntu SMP
Nvidia Driver Version: 565.57.01 CUDA Version: 12.7
SGLang Version: v0.4.2

Test Setup

SGLang server was launched with:

python3 -m sglang.launch_server –model deepseek-ai/DeepSeek-V3 –tp 8 –trust-remote-code –enable-torch-compile –mem-fraction-static 0.8 –disable-cuda-graph –enable-dp-attentio

Parameters:

enable-torch-compile: it attempts to compile the model to improve speed and efficiency by reducing Python overhead and fusing operations
mem-fraction-static: controls how much of the GPU memory is statically allocated avoiding fragmentation
disable-cuda-graph: disables the use of CUDA Graphs for compatibility purposes
enable-dp-attention: can improve throughput for large QPS scenarios for Deepseek models

Benchmark was launched with:

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 300 –request-rate 1 –random-input 1024 –random-output 1024

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 600 –request-rate 2 –random-input 1024 –random-output 1024

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 1200 –request-rate 4 –random-input 1024 –random-output 1024

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 2400 –request-rate 8 –random-input 1024 –random-output 1024

Benchmark Results

Each benchmark test used random input and output lengths of 1024 tokens, simulating real-world workloads with varying request rates. The tests measured key performance metrics related to token generation speed and efficiency:

TTFT (Time To First Token): The time elapsed from the moment an inference request is sent until the model produces its first output token. A lower TTFT indicates faster response times, critical for interactive applications.

ITL (Inter-Token Latency): The time taken between the generation of consecutive tokens in an auto-regressive decoding sequence. ITL measures the step-by-step processing speed, tracking the interval from the completion of token i to the completion of token i+1.

TPOT (Time Per Output Token): The average time required to generate each subsequent token after the first token is produced. This metric reflects the overall efficiency of the model’s token-generation process.
Output Token Throughput: The number of tokens generated per second, indicating how efficiently the model processes requests. Higher throughput means the model can handle larger batch sizes and higher concurrency.

Request Rate (per second)	Requests Completed	Median TTFT (ms)	Median ITL (ms)	Median TPOT (ms)	Output Token Throughput (per second)
1	300	942	185	229	608
2	600	928	189	246	1,244
4	1,200	1,148	224	302	2,398
8	2,400	1,944	248	511	2,288

Analysis and Insights

Latency Analysis:

The Time To First Token (TTFT) decreases slightly between a request rate of 1 and 2 but increases at higher loads (4 and 8 RPS). This suggests that while the system scales well, latency starts to increase due to resource contention at higher loads.

Inter-Token Latency (ITL) follows a similar trend, increasing as the request rate grows, indicating that the per-token processing time is affected by concurrency overhead.

Throughput Trends:

Output token throughput increases significantly with higher request rates, peaking at 2398 tokens/sec at 4 RPS before slightly dropping at 8 RPS (2288 tokens/sec). This suggests the system reaches peak efficiency around 4 RPS before memory bandwidth or compute limitations impact performance.

The drop at 8 RPS could be a result of increased scheduling overhead or memory contention across multiple GPUs.

Scalability Observations:

While the system scales well from 1 to 4 RPS, performance gains diminish beyond 4 RPS, highlighting a potential optimization opportunity for parallelism or memory management.

Conclusion

The NVIDIA H200 GPUs demonstrate strong performance in running DeepSeek-V3 inference workloads with excellent token throughput and competitive latency metrics. The setup efficiently scales up to 4 RPS, beyond which diminishing returns are observed due to system constraints.

Further optimizations could enhance throughput at higher concurrency levels, particularly in reducing TTFT and ITL at peak loads.

This benchmark underscores the NVIDIA H200’s capabilities for large-scale AI workloads, inference tasks, and high-throughput applications in production environments.

Looking forward, refining kernel tuning and optimizing inference engines such as SGLang will likely unlock even greater scalability and throughput. These advancements have the potential to redefine efficiency in high-performance LLM deployment.

14th March 2025

Jeff Hinkle

Achieve More with Superior AI Performance

Join leading companies that have tapped into the full potential of Ionstream. Access unparalleled computational power at a fraction of the cost, and drive your AI initiatives forward with confidence.

Get started

NVIDIA B200

Redefining Al and HPC with one of the most advanced GPUs yet.

NVIDIA H200

Supercharge Al and HPC workloads with larger and faster memory capabilities.

NVIDIA L40S

Accelerate Al and machine learning applications with unprecedented speed and efficiency.

AMD Instinct^TM MI300X

Unleash transformative AI and HPC capabilities with unmatched power.

NVIDIA H200 DeepSeek-V3 Benchmark Report