Executive summary
Benchmarks show the NVIDIA RTX PRO™ 6000 Blackwell running on Akamai Cloud delivers up to 1.63× higher inference throughput than the H100, achieving 24,240 TPS per server at 100 concurrent requests.
Benchmarking Akamai Inference Cloud
This week, Akamai announced the launch of Akamai Inference Cloud. We’ve combined our expertise in globally distributed architectures with NVIDIA Blackwell AI infrastructure to radically rethink and extend the accelerated computing needed to unlock AI's true potential.
The Akamai Inference Cloud platform combines NVIDIA RTX PRO™ Servers — featuring NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs, NVIDIA BlueField-3® DPUs, and NVIDIA AI Enterprise software — with Akamai's distributed cloud computing infrastructure and global edge network, which has more than 4,400 locations worldwide.
Efficient, versatile, and optimized GPUs
Distributed inference and next-generation agentic experiences require GPUs that are efficient, versatile, and optimized for concurrent, real-time workloads. The RTX PRO 6000 Blackwell checks all three boxes. Its FP4 precision mode delivers exceptional throughput at a fraction of the power and cost of datacenter-class GPUs, making it practical to deploy across hundreds of sites.
The architecture supports concurrent and multimodal workloads including text, vision, and speech on a single GPU, reducing the need for specialized accelerators and limiting unnecessary data movement across the network.
NVIDIA RTX Pro Servers are optimized for workloads such as agentic AI, industrial and physical AI, scientific computing, data analysis and simulation, visual computing, and enterprise applications.
NVIDIA highlights that these servers deliver up to 6x higher large language model (LLM) inference throughput, 4x faster synthetic data generation, 7x faster genome sequence alignment, 3x higher engineering simulation throughput, 4x greater real-time rendering performance, and 4x more concurrent multi-instance GPU workloads.
Performance validation
To validate performance, we tested NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs running on Akamai Cloud and benchmarked them against NVIDIA H100 NVL 96GBs using the NVIDIA LaunchPad environment.
Our goal was to understand how next-generation RTX Pro 6000 GPUs perform for real-world inference workloads compared with the industry’s current gold standard.
What the benchmarks show
The benchmark results confirm the design advantage of NVIDIA RTX Pro 6000 Blackwell on Akamai Cloud.
The 1.63x throughput uplift over H100 (FP8) shows that the RTX Pro 6000 Blackwell delivers data center–grade performance in a smaller, easier-to-deploy footprint ideal for distributed environments.
The 1.32x improvement moving from FP8 to FP4 demonstrates how NVIDIA’s precision efficiency directly translates to faster, more cost-efficient inference at the edge.
Sustained performance at 100+ concurrent requests validates the GPU’s ability to handle multi-tenant, latency-sensitive workloads across globally distributed inference.
Together, these results show that Blackwell’s efficiency and concurrency advantages make it the ideal foundation for Akamai’s distributed inference architecture, delivering high throughput, low latency, and scalable performance across our global network.
Benchmark overview
We followed NVIDIA’s benchmarking methodology to assess inference performance under consistent load conditions. In this post, we’ll walk through the setup, methodology, and key findings, and discuss what the results mean for running AI workloads on Akamai Cloud.
Setup
To assess NVIDIA RTX Pro 6000 GPUs on Akamai Cloud we used the Llama-3.3-Nemotron-Super-49B-v1.5, an LLM that is a derivative of Meta Llama-3.3-70B-Instruct (aka the reference model). It is a reasoning model that is post-trained for reasoning, human chat preferences, and agentic tasks, such as RAG and tool calling.
We used two NVIDIA inference microservices (NIM) profiles for the same model to compare precision modes and understand their impact on performance and efficiency. The profiles — tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 and tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 — are identical except for the precision setting.
The first uses FP8 (8-bit floating point) precision, while the second uses NVIDIA’s FP4 (4-bit floating point). NVIDIA’s FP4 version (NVFP4) is supported directly in NVIDIA Blackwell GPUs.
By running both, we aimed to observe how reducing numerical precision affects throughput, and latency. NVFP4 delivers major performance and efficiency gains with less than 1% accuracy loss, enabling faster, lower power inference at scale, while FP8 provides higher numerical accuracy. Comparing the two helps determine the best trade-off between speed, efficiency, and inference fidelity for real-world workloads.
We ran tests on NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs located in Akamai Cloud LAX data center. To compare, we used the NVIDIA H100 GPUs using the NVIDIA LaunchPad environment.
Methodology
For this benchmark, we ran a smoke test designed to measure baseline inference performance under realistic load conditions. Each request processed 200 input tokens and generated 200 output tokens, representing a typical short prompt-and-response interaction for an LLM.
To test scalability and consistency, we executed 100 concurrent runs, allowing us to observe throughput and latency behavior as the system handled a sustained volume of simultaneous inferences. This approach provided a controlled yet representative snapshot of how the model and hardware perform under productionlike workloads.
We measured two key metrics: time to first token (TTFT) and tokens per second (TPS). TTFT, measured in milliseconds, captures how quickly the model begins generating a response after receiving a prompt — an important indicator of latency and user-perceived responsiveness. TPS measures overall throughput, showing how many tokens the system can generate per second once generation begins.
Together, these metrics provide a balanced view of real-world performance, reflecting both the speed of initial inference and the sustained output efficiency under load.
As part of our benchmarking methodology, we ran two sets of tests to evaluate the performance characteristics of the NVIDIA RTX 6000 Blackwell Server Edition GPUs.
FP4 vs. FP8 precision comparison
We tested two NIM profiles on the same model — one using FP8 precision and another using FP4 precision — to measure the impact of NVIDIA’s new FP4 (NVFP4) quantization on inference performance. NVIDIA has highlighted FP4 as a major advancement for efficiency and throughput.RTX 6000 vs. H100 GPU comparison
We then compared the RTX 6000 Blackwell results against H100 GPUs running in the NVIDIA LaunchPad environment to assess real-world inference advantages by looking at the two NIM profiles: FP8 and FP4. This allowed us to evaluate how the RTX 6000 performs not only across precision modes but also relative to NVIDIA’s current data center GPU standard.
Detailed results
We identified that the optimal concurrency (C) level is 100 — meaning at 100 simultaneous inference requests, we observed the most stable and representative performance results. At C = 100, moving from FP8 to FP4 precision on the RTX 6000 resulted in a 1.32x performance improvement, showing the efficiency gains of NVIDIA’s FP4 quantization.
When compared against the H100 using its FP8 precision, the RTX Pro 6000 Blackwell Server delivered a 1.63x performance improvement at NVFP4 precision. Even when using FP8, the Blackwell Server demonstrated a 1.21x advantage, showcasing next-generation inference optimizations that go beyond the older FP8 format.
Overall, at this concurrency level, the RTX Pro 6000 Blackwell Server achieved 3,030.01 tokens per second TPS, which equates up to 24,240.08 TPS with our infrastructure as a service (IaaS) VM offerings, highlighting the strong inference performance and scalability of the Blackwell architecture on Akamai Cloud.
Test 1: FP8 vs. FP4 precision comparison
Performance results on the RTX Pro 6000 Blackwell FP8 to FP4.
LAX: NVIDIA RTX Pro 6000 Blackwell Server FP8
Model |
NIM model profile |
Use case |
Concurrency |
TTFT (ms) |
TPS |
|---|---|---|---|---|---|
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 |
200_200 |
1 |
44.82 |
27.42 |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 |
200_200 |
100 |
102.03 |
2256.3 |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 |
200_200 |
200 |
138.66 |
3606.04 |
LAX: NVIDIA RTX PRO 6000 Blackwell Server FP4
Model |
NIM model profile |
Use case |
Concurrency |
TTFT (ms) |
TPS |
FP4 gain |
|---|---|---|---|---|---|---|
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 |
200_200 |
1 |
47.92 |
29.68 |
1.08x |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 |
200_200 |
100 |
94.45 |
3030.01 |
1.32x |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 |
200_200 |
200 |
3663.26 |
3854.76 |
1.07x |
Test 2: RTX Pro 6000 Blackwell Server vs. H100 GPU comparison
Performance results comparing H100 NVL FP8 vs. RTX Pro 6000 Blackwell Server FP8 and FP4.
LaunchPad: H100 NVL FP8
Model |
NIM model profile |
Use case |
Concurrency |
TTFT (ms) |
TPS |
|---|---|---|---|---|---|
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-h100_nvl-fp8-tp1-pp1-throughput-2321 |
200_200 |
1 |
39.52 |
42.46 |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-h100_nvl-fp8-tp1-pp1-throughput-2321 |
200_200 |
100 |
1612.03 |
1863.08 |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-h100_nvl-fp8-tp1-pp1-throughput-2321 |
200_200 |
200 |
12587.3 |
1828.03 |
LaunchPad: NVIDIA RTX PRO 6000 Blackwell Server FP8
Model |
NIM model profile |
Use case |
Concurrency |
TTFT (ms) |
TPS |
|---|---|---|---|---|---|
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 |
200_200 |
1 |
59.61 |
19.52 |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 |
200_200 |
100 |
243.68 |
1040.33 |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-fp8-tp1-pp1-throughput-2bb5 |
200_200 |
200 |
415.9 |
1344.73 |
LaunchPad: NVIDIA RTX PRO 6000 Blackwell Server FP4
Model |
NIM model profile |
Use case |
Concurrency |
TTFT (ms) |
TPS |
FP4 gain |
|---|---|---|---|---|---|---|
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 |
200_200 |
1 |
81.98 |
23.65 |
1.21x |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 |
200_200 |
100 |
344.24 |
1848.96 |
1.78x |
nvidia/llama-3.3-nemotron-super-49b-v1.5 |
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-tp1-pp1-throughput-2bb5 |
200_200 |
200 |
6660.54 |
1997.3 |
1.49x |
Conclusion
This benchmark set out to evaluate how NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs perform for LLM inference on Akamai Cloud, and how they compare with NVIDIA H100 GPUs with similar assumptions. Using NVIDIA’s recommended benchmarking methodology, we tested both FP8 and FP4 precision modes to understand performance, efficiency, and latency trade-offs.
The results clearly show that FP4 delivers measurable gains, with a 1.32x improvement in throughput over FP8 on the RTX 6000. When compared with the H100 at FP8, the RTX 6000 (FP4) achieved a 1.63x performance improvement, underscoring the potential of the Blackwell architecture for inference workloads.
These findings demonstrate that RTX 6000 GPUs running on Akamai’s distributed cloud can deliver high throughput and efficient scaling for real-world AI inference at lower cost and latency. For teams that are evaluating GPU options, this combination offers a compelling balance of speed, efficiency, and accessibility across a global infrastructure footprint.
Gain access
Register to gain access to RTX Pro 6000 Blackwell Server Edition on Akamai Inference Cloud.
Tags