Exxact HGX-2 TensorEX Server Smashes Deep Learning Benchmarks
For this post, we show deep learning benchmarks for TensorFlow on an Exxact TensorEX HGX-2 Server. This behemoth of a Deep Learning Server has 16 NVIDIA Tesla V100 GPUs.
We ran the standard “tf_cnn_benchmarks.py” benchmark script from TensorFlow’s github. To compare, tests were run on the following networks: ResNet-50, ResNet-152, Inception V3, VGG-16. In addition we compared the FP16 to FP32 performance, and used batch size of 256 (except for ResNet152 FP32, the batch size was 64). As you’ll see, the same tests were run using 1,2,4,8 and 16 GPU configurations. All benchmarks were done using ‘vanilla’ TensorFlow settings for FP16 and FP32.
Notable HGX2 Server Features
- 16x NVIDIA Tesla V100 SXM3
- 81,920 NVIDIA CUDA Cores
- 10,240 NVIDIA Tensor Cores
- .5TB Total GPU Memory
- NVSwitch powered by NVLink 2.4TB/sec aggregate speed
Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP16
Run these FP16 benchmarks
Configure the num_gpus to the number of GPUs desired to test. Change model to desired model architecture.
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --use_fp16=True
Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP32
Run these FP32 benchmarks
To run FP32, remove fp16 flag, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server
Other Notes and Future Plans for HGX2
The HGX2 GPU server is an absolute monster for deep learning or any GPU powered HPC tasks. In the future, we would like to conduct further benchmarks on more models as well as other acceleration methods such as XLA for TensorFlow, where we would expect significant performance gains. Also training models on even larger batch sizes is another area we will consider exploring.
|System||Exxact TensorEX HGX-2|
|GPU||16x NVIDIA Tesla V100 32 GB SXM3|
|CPU||2x Intel Xeon Platinum 8168|
|RAM||1.5 TB DDR4|
|SSD (OS)||1TB x2 NVMe (RAID 1)|
|SSD (Data)||32 TB NVMe Storage|
|Batch Size:||256 per device*|
Interested in More Deep Learning Benchmarks?
- RTX 2080 Ti Deep Learning Benchmarks for TensorFlow
- TITAN RTX Deep Learning Benchmarks for Tensorflow
- NVIDIA Quadro RTX 6000 GPU Benchmarks for TensorFlow
- Quadro RTX 8000 Deep Learning Benchmarks for TensorFlow