For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 8000 GPUs. Our Exxact Valence Workstation was equipped with 4x Quadro RTX 8000’s giving us an awesome 192 GB of GPU memory for our system. We ran the standard tf_cnn_benchmarks.py benchmark script (found here in the official TensorFlow github) with basic ‘vanilla’ settings on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnset. We also compared FP16 to FP32 performance, and used ‘typical’ batch sizes (64 in most cases). We then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations.

Key Points and Observations

  • In most scenarios, large batch size training showed impressive results in images/sec when compared to smaller batch sizes. This is especially true when scaling to the 4 GPU configuration.
    • AlexNet and VGG16 performed better using smaller batch size on a single GPU, but larger batch size performed better on these models when scaling up to 4 GPUs.
  • ResNet-50 and ResNet-152 Showed massive scaling when going from 1-2-4 GPUs, a mind blowing 2,338.84 images/sec for ResNet-50 and 1062.13 images/sec for ResNet-152 at fp16!
  • Using FP16 showed impressive gains in images/sec across most models when using 4 GPUs. (exception AlexNet)
  • The Quadro RTX 8000 with 48 GB RAM is Ideal for training networks that require large batch sizes that otherwise would be limited on lower end GPUs.
  • The Quadro RTX 8000 is an ideal choice for deep learning if you’re restricted to a workstation or single server form factor and want maximum GPU memory.
  • Our workstations with Quadro RTX 8000 can also train state of the art NLP Transformer networks that require large batch size for best performance, a popular application use for the fast growing data science market.

Quadro RTX 8000 Deep Learning Benchmark: Images/sec at FP32, Batch Size 64

1 GPU 2 GPU 4 GPU Batch Size
ResNet50 314.87 590.3 952.8 64
ResNet152 127.71 232.42 418.44 64
InceptionV3 207.53 386.86 655.45 64
InceptionV4 102.41 191.4 337.44 64
VGG16 188.91 337.38 536.95 64
NASNET 160.42 280.07 510.15 64

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmark: Images/sec at FP32, Large Batch Size

1 GPU 2 GPU 4 GPU Batch Size
ResNet50 322.66 622.41 1213.3 512
ResNet152 137.12 249.58 452.77 256
InceptionV3 216.27 412.75 716.47 256
InceptionV4 105.2 201.49 345.79 256
VGG16 166.55 316.46 617 512
NASNET 187.69 348.71 614 512

 

Run these benchmarks 

Configure num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server

 Quadro RTX 8000 Deep Learning Benchmark: Images/sec at FP16, Batch Size 64

1 GPU 2 GPU 4 GPU Batch Size
ResNet50 544.16 972.89 1565.18 64
ResNet152 246.56 412.25 672.87 64
InceptionV3 334.28 596.65 1029.24 64
InceptionV4 178.41 327.89 540.52 64
VGG16 347.01 570.53 637.97 64
NASNET 155.44 282.78 517.06 64

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmark: Images/sec FP16, Large Batch Size

1 GPU 2 GPU 4 GPU Batch Size
ResNet50 604.76 1184.52 2338.84 1024
ResNet152 285.85 529.05 1062.13 512
InceptionV3 391.3 754.94 1471.66 512
InceptionV4 203.67 384.29 762.32 512
VGG16 276.16 528.88 983.85 512
NASNET 196.52 367.6 726.85 512

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=1024 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: Images/sec on AlexNet (FP32, FP16)

1 GPU 2 GPU 4 GPU Batch Size
Alexnet FP16 (Large Batch) 5911.6 11456.11 21828.99 8192
Alexnet FP16 (Normal Batch) 6013.64 11275.54 14960.97 512
Alexnet FP32 (Large Batch) 2825.61 4421.97 8482.39 8192
Alexnet FP32 (Normal Batch) 4103.27 7814.04 10491.22 512

 

Run these deep learning benchmarks 

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in FP32. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

System Specifications

System Exxact Valence Workstation
GPU 4 x NVIDIA Quadro RTX 8000
CPU Intel CORE I7-7820X 3.6GHZ
RAM 32GB DDR4
SSD 480 GB SSD
HDD (data) 10 TB HDD
OS Ubuntu 18.04
NVIDIA DRIVER 410.79
CUDA Version 10
Python 2.7
TensorFlow 1.14
Docker Image tensorflow/tensorflow:nightly-gpu

Other Training Parameters

Dataset Imagenet (synthetic)
Mode: training
SingleSess: False
Batch Size: Varied
Num Batches: 100
Num Epochs: 0.08
Devices: ['/gpu:0']...(varied)
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server

Benchmarks Coming Soon

  • Quadro RTX 8000 with XLA
  • Quadro RTX 6000 (XLA, non XLA)
  • RTX 2080 Ti (more extensive benchmarks)
  • TITAN RTX

That’s it for now! Have any questions? Let us know on social media.

https://www.facebook.com/exxactcorp/