For this blog article, we conducted more extensive deep learning performance benchmarks for TensorFlow on NVIDIA GeForce RTX 2080 Ti GPUs. We recently discovered that the XLA library (Accelerated  Linear Algebra) adds significant performance gains, and felt it was worth running the numbers again. Our Exxact Valence Workstation was fitted with 4x RTX 2080 Ti’s and ran the standard “tf_cnn_benchmarks.py” benchmark script found here in the official TensorFlow github. We tested on the the following networks: ResNet50, ResNet152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. Also, we compared FP16 to FP32 performance, and compared numbers using the XLA flag. Furthermore, ran the same tests using 1,2, and 4 GPU configurations. Batch size was largest that could fit into available GPU memory (powers of two).

Key Points and Observations

  • XLA significantly increases the amount of Img/sec across most models. This is true for both FP16 and FP32, however the most dramatic gains were seen in FP16 up to 32% (ResNet50, 4GPU Config, FP16).
  • On certain models we ran into errors when performing benchmarks using XLA (VGG and Alexnet models at FP32).
  •  The ResNet models (ResNet50, ResNet152) showed massive improvements using XLA + FP16.

GeForce RTX 2080 Ti Benchmark Snapshot, All Models, XLA on/off, FP32, FP16

2080 Ti benchmark snapshot

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP16 (XLA on)

XLA FP16 2080 Ti benchmarks Alexnet

 

Run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=64 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP16 (XLA off)

FP16 2080 Ti benchmarks

 

Run these benchmarks 

To run these, set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --variable_update=parameter_server --use_fp16=True

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP32 (XLA on)

2080 Ti benchmarks FP32 XLA

 

Run these benchmarks 

To run XLA FP32, set num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=64 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

GeForce RTX 2080 Ti Deep Learning Benchmarks: FP32 (XLA off)

2080 Ti benchmarks FP32

 

Run these benchmarks 

Set the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

RTX 2080 Ti Deep Learning Benchmarks: Alexnet (FP32, FP16, XLA FP16, XLA FP32)

2080 Ti benchmarks Alexnet

 

How to run these benchmarks 

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=2048 --model=alexnet --variable_update=parameter_server --use_fp16=True

To run these benchmarks with XLA

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=2048 --num_batches=100 --model=alexnet --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

System Specifications:

 

Training Parameters (non XLA)

 

Training Parameters (XLA)

Interested in our deep learning systems? Contact our sales team here.

More Deep Learning Benchmarks