For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 6000 GPUs. Our Exxact Valence Workstation was fitted with 4x Quadro RTX 6000’s giving us 96 GB of GPU memory for our system.

We ran the standard “tf_cnn_benchmarks.py” benchmark script (found here in the official TensorFlow github) on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet.

We also compared FP16 to FP32 performance, and used ‘typical’ batch sizes (64 in most cases), then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations. In addition we’ve also ran benchmarks using XLA and noticed substantial improvements.

Key Points and Observations

  • In terms of purely img/sec, the RTX 6000 is on par, with performance of the RTX 8000. The two cards use the same Turing processor, yet have different memory sizes.
  • However in terms of batch size, the RTX 6000 cannot fit the large batch sizes that the RTX 8000 can.
  • XLA significantly increases the amount of Img/sec across most models. This is true for both FP 16 and FP32, however the most dramatic gains were seen in FP 16.

Quadro RTX 6000 Benchmark Snapshot, XLA on/off, FP32, FP16

RTX 6000 TensorFlow Benchmark Snapshot

Quadro RTX 6000 Deep Learning Benchmarks: FP16, Large Batch Size (XLA on)

RTX 6000 XLA FP 16

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Batch Size 64 (XLA off)

RTX 6000 batch size 64 no XLA

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Large Batch Size (XLA off)

 

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP32, Large Batch Size (XLA on)

RTX 6000 Large Batch XLA Fp32

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture use .

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 6000 Deep Learning Benchmarks: FP32, Batch Size 64

RTX 6000 Large Batch 64 FP32

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: FP32 Large Batch Size

RTX 6000 Large Batch FP32

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: Alexnet (FP32, FP16, FP16 XLA on,FP32 XLA off)

Alexnet RTX 6000

 

Run these deep learning benchmarks 

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in fp32.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

System Specifications:

Training Parameters (non XLA)

Training Parameters (XLA)