Updated 6/11/2019 with XLA FP32 and XLA FP16 metrics.

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 8000 GPUs. Our Exxact Valence Workstation was equipped with 4x Quadro RTX 8000’s giving us an awesome 192 GB of GPU memory for our system. To demonstrate, we ran the standard  tf_cnn_benchmarks.py benchmark script (found here in the official TensorFlow github). Also, we ran tests on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. For good measure, we compared FP16 to FP32 performance, and used ‘typical’ batch sizes (64 in most cases). Furthermore, we incrementally doubled the batch size until we threw a memory error. Incidentally, all tests ran on1,2 and 4 GPU configurations.

Key Points and Observations

  • In most scenarios, large batch size training showed impressive results in images/sec when compared to smaller batch sizes. This is especially true when scaling to the 4 GPU configuration.
    • AlexNet and VGG16 performed better using smaller batch size on a single GPU, but larger batch size performed better on these models when scaling up to 4 GPUs.
  • ResNet-50 and ResNet-152 Showed massive scaling when going from 1-2-4 GPUs, a mind blowing 4193.48 images/sec for ResNet-50 and 1621.96 images/sec for ResNet-152 at FP16 & XLA!
  • Using FP16 showed impressive gains in images/sec across most models when using 4 GPUs. (exception AlexNet)
  • The Quadro RTX 8000 with 48 GB RAM is Ideal for training networks that require large batch sizes that otherwise would be limited on lower end GPUs.
  • The Quadro RTX 8000 is an ideal choice for deep learning if you’re restricted to a workstation or single server form factor and want maximum GPU memory.
  • Our workstations with Quadro RTX 8000 can also train state of the art NLP Transformer networks that require large batch size for best performance, a popular application for the fast growing data science market.
  • XLA significantly increases the amount of Img/sec across most models, however the most dramatic gains were seen in FP16.

Quadro RTX 8000 Deep Learning Benchmark Snapshot (FP16, FP32, XLA on/off)

Quadro RTX 8000 Deep Learning Benchmarks: FP16, XLA

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=512 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

 

Quadro RTX 8000 Deep Learning Benchmarks: FP32, XLA

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Batch Size 64

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Large Batch Size

 

Run these benchmarks 

Configure num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server

 Quadro RTX 8000 Deep Learning Benchmarks: FP16, Batch Size 64

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Large Batch Size

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=1024 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: AlexNet (FP32, FP16, XLA on, off)

 

Run these deep learning benchmarks 

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in FP32. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True


System Specifications

Training Parameters (non XLA)

Training Parameters (XLA)

More Deep Learning Benchmarks

That’s it for now! Have any questions? Let us know on social media.

https://www.facebook.com/exxactcorp/