For this blog article, we conducted deep learning performance benchmarks for TensorFlow using NVIDIA TITAN RTX GPUs. Tests were conducted using an Exxact TITAN Workstation outfitted with 2x TITAN RTXs with an NVLink bridge. We ran the standard “tf_cnn_benchmarks.py” benchmark script found here in the official TensorFlow github.

We ran tests on the following networks: ResNet50, ResNet152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. Furthermore, we compared FP16 to FP32 performance, and compared numbers using XLA. The same tests were conducted using 1 and 2 GPU configurations, and batch size used was the largest that could fit in memory (powers of two).

Key Points and Observations

  • The TITAN RTX an excellent choice if you will need large batch size for training while keeping costs within decent price point.
  • Performance (img/sec) is comparable to Quadro RTX 6000 Benchmark Performance in most instances.
  • Observing this dual GPU configuration, the workstation ran silently, and very cool during training workloads (Note: The chassis offers a lot of airflow).
  • Significant gains were made using XLA in most cases, especially when in FP16.

TITAN RTX Benchmark Snapshot, All Models, XLA on/off, FP32, FP16

TITAN RTX Deep Learning Benchmarks

Titan Workstations

TITAN RTX Deep Learning Benchmarks: FP16 (XLA on)

TITAN RTX Deep Learning Benchmarks - FP16 XLA

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

TITAN RTX Deep Learning Benchmarks: FP16 (XLA off)

TITAN RTX Deep Learning Benchmarks - FP16 XLA off

 

ebook deep learning

Run these benchmarks 

Apply the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

TITAN RTX Deep Learning Benchmarks: FP32 (XLA on)

TITAN RTX Deep Learning Benchmarks - FP32 XLA

 

Run these benchmarks 

Change the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

TITAN RTX Deep Learning Benchmarks: FP32 (XLA off)

TITAN RTX Deep Learning Benchmarks - FP32 no XLA

 

Run these benchmarks 

Set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

TITAN RTX Deep Learning Benchmarks: Alexnet (FP32, FP16, XLA FP16, XLA FP32)

TITAN RTX Deep Learning Benchmarks - Alexnet

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=4096--model=alexnet --variable_update=parameter_server --use_fp16=True

 

Run these benchmarks with XLA

To run with XLA, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=8192 --num_batches=100 --model=alexnet --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

System Specifications:

Training Parameters (non XLA)

Training Parameters (XLA)