Monitoring the right GPU performance metrics can go a long way in helping you train and deploy deep learning applications. Here are the top 5 metrics you should monitor:

1. GPU Utilization

GPU utilization is one of the primary metrics to observe during a deep learning training session. This metric is readily accessible through popular GPU monitoring interfaces such as NVIDIA’s “NVIDIA-smi”. A GPU’s utilization is defined as the percentage of time one or more GPU kernels are running over the last second, which is analogous to a GPU being utilized by a deep learning program.

Monitoring your deep learning training sessions’ GPU utilization is one of the best indicators to determine if your GPU is actually being used. Moreover, monitoring the real-time utilization trend can help identify bottlenecks in your pre-processing and feature engineering pipelines that might be slowing down your training process.

Deep Learning Workstations Transformer

2. GPU Memory Access and Utilization

Much like the GPU utilization, the state of your GPU’s memory is also a great indicator of how well your GPU is being used in your deep learning process. The NVIDIA-smi has a comprehensive list of memory metrics that can be used to accelerate your model training.

Similar to the GPU utilization, the GPU memory utilization metric is one of the key metrics to monitor over the training process. This metric represents the percentage of time over the last second that the GPU’s memory controller was being utilized to either read or write from memory. Other metrics such as the available memory, used memory and free memory can also prove important, as they provide insights into the efficiency of your deep learning program. Additionally these metrics can be used to fine-tune the batch size for your training samples.

3. Power Usage and Temperatures

Power usage is an important aspect of GPU performance. The power draw on one of your GPUs gives you an indication of how hard it is working, as well as how power-intensive your application would be. This can be especially important for testing deep learning applications for mobile devices, where power consumption is a significant concern.

The power usage is closely associated with the ambient temperature the GPU is being used in. Power draw measured by tools such as NVIDIA-smi is usually monitored at the card’s power supply unit and includes the power consumed by active cooling elements, memory and compute units.

As your GPU’s temperature rises, the ohmic resistance of electronic components increases and fans spin faster, increasing the power draw. For deep learning, a GPU’s power consumption is also important because thermal throttling at high temperatures can slow down the training process.

4. Time to Solution (Training Time)

The time to solution, also known as the training time, is one of the primary metrics used in deep learning models to gauge and benchmark the performance of GPUs. It is important to keep the definition of the solution consistent between all different GPUs. For classification problems such as image classification using convolutional neural networks and NLP applications using recurrent neural networks, this could be a predefined accuracy that the model has to meet. GPU features such as enabling mixed-precision and model optimizations such as tuning the input batch size plays an important role in training time.

5. Throughput

While training time is important during the learning process, the time for inference is important for a deployed model in production. In neural networks, the time for inference is the time needed to make a forward pass through the neural network to come up with a result. The throughput is typically used to measure a GPU’s performance in making fast inferences.

The general metric for throughput is given by the number of samples processed per second by the model on a GPU. However, the exact metric can vary depending on the model architecture and the deep learning application.

For example, the throughput for a convolutional neural network for image classification would be calculated in images/second. In contrast, the throughput for a recurrent neural network being utilized in an NLP application could be done in tokens/second.


Monitoring the right GPU performance metrics can save you a lot of drudgery and time, so you can focus on training or deploying your deep learning applications.

What do you consider your top metrics for evaluating deep learning GPU performance? Are they the same that we discussed? Let us know in the comments below!

Megatron LM Blog