News

NVIDIA CUDA 11 Now Available

June 22, 2020

40 min read

Introducing NVIDIA® CUDA® 11

CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

CUDA 11 enables you to leverage the new hardware capabilities to accelerate HPC, genomics, 5G, rendering, deep learning, data analytics, data science, robotics, and many more diverse workloads.

CUDA 11 is packed full of features, from platform system software to everything that you need to get started and develop GPU-accelerated applications. Highlights of this major release include:

Support for the NVIDIA Ampere GPU architecture, including the new NVIDIA A100 GPU for accelerated scale-up and scale-out of AI and HPC data centers; multi-GPU systems with the NVSwitch fabric such as the DGX A100.
Multi-Instance GPU (MIG) partitioning capability that is particularly beneficial to cloud service providers (CSPs) for improved GPU utilization.
New third-generation Tensor Cores to accelerate mixed-precision, matrix operations on different data types, including TF32 and Bfloat16.
Programming and APIs for task graphs, asynchronous data movement, fine-grained synchronization, and L2 cache residency control.
Performance optimizations in CUDA libraries for linear algebra, FFTs, and matrix multiplication.
Updates to the Nsight product family of tools for tracing, profiling, and debugging of CUDA applications.
Full support on all major CPU architectures, across x86_64, Arm64 server and POWER architectures.

You can find more details on each of these highlights here. The CUDA 11 Release Notes continue below.

NVIDIA CUDA Toolkit Release Notes

The Release Notes for the CUDA Toolkit.

1. CUDA Toolkit Major Components

This section provides an overview of the major components of the NVIDIA^® CUDA^® Toolkit and points to their locations after installation.

CompilerThe CUDA-C and CUDA-C++ compiler, nvcc, is found in the bin/ directory. It is built on top of the NVVM optimizer, which is itself built on top of the LLVM compiler infrastructure. Developers who want to target NVVM directly can do so using the Compiler SDK, which is available in the nvvm/ directory.Please note that the following files are compiler-internal and subject to change without any prior notice.

any file in include/crt and bin/crt
include/common_functions.h, include/device_double_functions.h, include/device_functions.h, include/host_config.h, include/host_defines.h, and include/math_functions.h
nvvm/bin/cicc
bin/cudafe++, bin/bin2c, and bin/fatbinary

ToolsThe following development tools are available in the bin/ directory (except for Nsight Visual Studio Edition (VSE) which is installed as a plug-in to Microsoft Visual Studio, Nsight Compute and Nsight Systems are available in a separate directory).

IDEs: nsight (Linux, Mac), Nsight VSE (Windows)
Debuggers: cuda-memcheck, cuda-gdb (Linux), Nsight VSE (Windows)
Profilers: Nsight Systems, Nsight Compute, nvprof, nvvp, ncu, Nsight VSE (Windows)
Utilities: cuobjdump, nvdisasm

LibrariesThe scientific and utility libraries listed below are available in the lib64/ directory (DLLs on Windows are in bin/), and their interfaces are available in the include/ directory.

cub (High performance primitives for CUDA)
cublas (BLAS)
cublas_device (BLAS Kernel Interface)
cuda_occupancy (Kernel Occupancy Calculation [header file implementation])
cudadevrt (CUDA Device Runtime)
cudart (CUDA Runtime)
cufft (Fast Fourier Transform [FFT])
cupti (CUDA Profiling Tools Interface)
curand (Random Number Generation)
cusolver (Dense and Sparse Direct Linear Solvers and Eigen Solvers)
cusparse (Sparse Matrix)
libcu++ (CUDA Standard C++ Library)
nvJPEG (JPEG encoding/decoding)
npp (NVIDIA Performance Primitives [image and signal processing])
nvblas (“Drop-in” BLAS)
nvcuvid (CUDA Video Decoder [Windows, Linux])
nvml (NVIDIA Management Library)
nvrtc (CUDA Runtime Compilation)
nvtx (NVIDIA Tools Extension)
thrust (Parallel Algorithm Library [header file implementation])

CUDA Samples

Code samples that illustrate how to use various CUDA and library APIs are available in the samples/ directory on Linux and Mac, and are installed to C:\ProgramData\NVIDIA Corporation\CUDA Samples on Windows. On Linux and Mac, the samples/ directory is read-only and the samples must be copied to another location if they are to be modified. Further instructions can be found in the Getting Started Guides for Linux and Mac.

Documentation

The most current version of these release notes can be found online at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Also, the version.txt file in the root directory of the toolkit will contain the version and build number of the installed toolkit.

Documentation can be found in PDF form in the doc/pdf/ directory, or in HTML form at doc/html/index.html and online at http://docs.nvidia.com/cuda/index.html.

CUDA-GDB SourcesCUDA-GDB sources are available as follows:

For CUDA Toolkit 7.0 and newer, in the installation directory extras/. The directory is created by default during the toolkit installation unless the .rpm or .deb package installer is used. In this case, the cuda-gdb-src package must be manually installed.
For CUDA Toolkit 6.5, 6.0, and 5.5, at https://github.com/NVIDIA/cuda-gdb.
For CUDA Toolkit 5.0 and earlier, at ftp://download.nvidia.com/CUDAOpen64/.
Upon request by sending an e-mail to mailto:oss-requests@nvidia.com.

2. CUDA 11.0 RC Release Notes

The release notes for the CUDA^® Toolkit can be found online at http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. This release of CUDA 11 is designated as a Release Candidate (RC).

2.1. CUDA Toolkit Major Component Versions

CUDA Components

Starting with CUDA 11, the various components in the toolkit are versioned independently.

For CUDA 11, the table below indicates the versions:

Table 1. CUDA 11 RC Component Versions

Component Name	Version Information	Supported Architectures
Component Name	Version Information	Supported Architectures	CUDA Runtime (cudart)	11.0.171	x86_64, POWER, Arm64
cuobjdump	11.0.167	x86_64, POWER, Arm64
CUPTI	11.0.167	x86_64, POWER, Arm64
CUDA Demo Suite	11.0.167	x86_64
CUDA GDB	11.0.172	x86_64, POWER, Arm64
CUDA Memcheck	11.0.167	x86_64, POWER
CUDA NVCC	11.0.167	x86_64, POWER, Arm64
CUDA nvdisasm	11.0.167	x86_64, POWER, Arm64
CUDA NVML Headers	11.0.167	x86_64, POWER, Arm64
CUDA nvprof	11.0.167	x86_64, POWER, Arm64
CUDA nvprune	11.0.167	x86_64, POWER, Arm64
CUDA NVRTC	11.0.167	x86_64, POWER, Arm64
CUDA NVTX	11.0.167	x86_64, POWER, Arm64
CUDA NVVP	11.0.167	x86_64, POWER
CUDA Samples	11.0.167	x86_64, POWER, Arm64
CUDA Compute Sanitizer API	11.0.167	x86_64, POWER, Arm64
CUDA cuBLAS	11.0.0.191	x86_64, POWER, Arm64
CUDA cuFFT	10.1.3.191	x86_64, POWER, Arm64
CUDA cuRAND	10.2.0.191	x86_64, POWER, Arm64
CUDA cuSOLVER	10.4.0.191	x86_64, POWER, Arm64
CUDA cuSPARSE	11.0.0.191	x86_64, POWER, Arm64
CUDA NPP	11.0.0.191	x86_64, POWER, Arm64
CUDA nvJPEG	11.0.0.191	x86_64, POWER, Arm64
Nsight Eclipse Plugins	11.0.167	x86_64, POWER
Nsight Compute	2020.1.0.33	x86_64, POWER, Arm64
Nsight Windows NVTX	1.21018621	x86_64, POWER, Arm64
Nsight Systems	2020.2.5.8	x86_64, POWER, Arm64
Nsight Visual Studio Edition (VSE)	2020.1.0.20115	x86_64 (Windows)
NVIDIA Linux Driver	450.36.06	x86_64, POWER, Arm64
NVIDIA Windows Driver	451.22	x86_64 (Windows)

CUDA Driver

Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. See Table 2. For more information various GPU products that are CUDA capable, visit https://developer.nvidia.com/cuda-gpus.

Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.

More information on compatibility can be found at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-runtime-and-driver-api-version.

Table 2. CUDA Toolkit and Compatible Driver Versions

CUDA Toolkit	Linux x86_64 Driver Version	Windows x86_64 Driver Version
CUDA 11.0.189 RC	>= 450.36.06	>= 451.22
CUDA 10.2.89	>= 440.33	>= 441.22
CUDA 10.1 (10.1.105 general release, and updates)	>= 418.39	>= 418.96
CUDA 10.0.130	>= 410.48	>= 411.31
CUDA 9.2 (9.2.148 Update 1)	>= 396.37	>= 398.26
CUDA 9.2 (9.2.88)	>= 396.26	>= 397.44
CUDA 9.1 (9.1.85)	>= 390.46	>= 391.29
CUDA 9.0 (9.0.76)	>= 384.81	>= 385.54
CUDA 8.0 (8.0.61 GA2)	>= 375.26	>= 376.51
CUDA 8.0 (8.0.44)	>= 367.48	>= 369.30
CUDA 7.5 (7.5.16)	>= 352.31	>= 353.66
CUDA 7.0 (7.0.28)	>= 346.46	>= 347.62

For convenience, the NVIDIA driver is installed as part of the CUDA Toolkit installation. Note that this driver is for development purposes and is not recommended for use in production with Tesla GPUs.

For running CUDA applications in production with Tesla GPUs, it is recommended to download the latest driver for Tesla GPUs from the NVIDIA driver downloads site at http://www.nvidia.com/drivers.

During the installation of the CUDA Toolkit, the installation of the NVIDIA driver may be skipped on Windows (when using the interactive or silent installation) or on Linux (by using meta packages).

For more information on customizing the install process on Windows, see http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#install-cuda-software.

For meta packages on Linux, see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas

2.2. General CUDA

CUDA 11.0 adds support for the NVIDIA Ampere GPU microarchitecture (compute_80 and sm_80).
CUDA 11.0 adds support for NVIDIA A100 GPUs and systems that are based on A100. The A100 GPU adds the following capabilities for compute via CUDA:
- Alternate floating point data format Bfloat16 (__nv_bfloat16) and compute type TF32 (tf32)
- Double precision matrix multiply accumulate through the DMMA instruction (see note on WMMA in CUDA C++ and mma in PTX)
- Support for asynchronous copy instructions that allow copying of data asynchronously (LDGSTS instruction and the corresponding cp.async.* PTX instructions)
- Cooperative groups improvements, which allow reduction operation across threads in a warp (using the redux.sync instruction)
- Support for hardware partitioning via Multi-Instance GPU (MIG). See the driver release notes on more information on the corresponding NVML APIs and nvidia-smi CLI tools for configuring MIG instances
Added the 7.0 version of the Parallel Thread Execution instruction set architecture (ISA). For more details on new (sm_80 target, new instructions, new floating point data types in .bf16, .tf32, and new mma shapes) and deprecated instructions, see this section in the PTX documentation.
CUDA 11.0 adds support for the Arm server platform (arm64 SBSA). Note that with this release, only the following platforms are supported with Tesla V100 GPU:
- HPE Apollo 70 (using Marvell ThunderX2™ CN99XX)
- Gigabyte R2851 (using Marvell ThunderX2™ CN99XX)
- Huawei TaiShan 2280 V2 (using Huawei Kunpeng 920)
CUDA supports a wide range of Linux and Windows distributions. For a full list of supported operating systems, see system requirements for more information. The following new Linux distributions are supported in CUDA 11.0.

For x86 (x86_64):
- Red Hat Enterprise Linux (RHEL) 8.1
- Ubuntu 18.04.4 LTS
For Arm (arm64):
- SUSE SLES 15.1
For POWER (ppc64le):
- Red Hat Enterprise Linux (RHEL) 8.1
CUDA C++ includes support for new data types to support new 16-bit floating point data (with 1-sign bit, 8-bit exponent and 7-bit mantissa): __nv_bfloat16 and __nv_bfloat162. See include/cuda_bf16.hpp and the CUDA Math API for more information on the datatype definition and supported arithmetic operations.
CUDA 11.0 adds the following support for WMMA:
- Added support for double (FP64) to the list of available input/output types for 8x8x4 shapes (DMMA.884)
- AND bitwise operation supported for BMMA
- Added support for __nv_bfloat16 and tf32 precision formats for the HMMA 16x16x8 shape
Added support for cooperative kernels in CUDA graphs, including stream capture for cuLaunchCooperativeKernel.
The CUDA_VISIBLE_DEVICES variable has been extended to add support for enumerating Multiple Instance GPUs (MIG) in NVIDIA A100/GA100 GPUs.
Added support for PCIe Relaxed Ordering for GPU initiated writes. This is not enabled by default but can be enabled by setting the following module parameter on Linux x86_64: NVreg_EnablePCIERelaxedOrderingMode.
CUDA 11.0 adds a specification for inter-task memory ordering in the “API Synchronization” subsection of the PTX memory model and allows CUDA’s implementation to be optimized consistent with this addition. In rare cases, code may have assumed a stronger ordering than required by the added specification and may notice a functional regression. The environment variable CUDA_FORCE_INTERTASK_SYSTEM_FENCE may be set to a value of “0” to disable post-10.2 inter-task fence optimizations, or “1” to enable them for 445 and newer drivers. If the variable is not set, code compiled entirely against CUDA 10.2 or older will disable the optimizations and code compiled against 11.0 or newer will enable them. Code with mixed versions may see a combination.

2.3. CUDA Tools

2.3.1. CUDA Compilers

The following new compilers are supported as host compilers for the CUDA compiler (nvcc)
- Clang 9
- GCC 9
- PGI 20.1
- ICC 19.1
- Arm C/C++ 19.2
The default compilation target for nvcc is now sm_52. Other older targets are either deprecated or no longer supported. See the Deprecated Features section for more details.
Added support for Link-Time Optimization (LTO). LTO enables cross-file inlining and optimization when doing separate compilation. To use LTO, add -dlto to both the compile and link commands, for example “nvcc -arch=sm_70 -dlto a.cu b.cu“. LTO is currently in technical preview. See the section titled “Optimization of Separate Compilation” in the nvcc manual for more information.
nvcc added two new flags (‘-Wdefault-stream-launch‘) and (‘-Werror=default-stream-launch‘) to generate a warning and an error, respectively, when a stream argument is not explicitly specified in the <<<…>>> kernel launch syntax. For example:

$ cat j1.cu

__global__ void foo() { }

int main() { foo<<<1,1>>>();

}

$nvcc -Wdefault-stream-launch j1.cu -ptx

j1.cu(2): warning: explicit stream argument not provided in kernel launch

$nvcc -Werror=default-stream-launch j1.cu -c

j1.cu(2): error: explicit stream argument not provided in kernel launch
The compiler optimizer now implements more aggressive dead code elimination for __shared__ variables whose value is not used. For example:

//–__device__ void foo() {

__shared__ int xxx;

xxx = 1;

}

In previous CUDA toolkits, the variable “xxx” is still present in the generated PTX. With CUDA 11 or later, the variable may be removed in the generated PTX, because its value is not used. Marking the variable as “volatile” will inhibit this compiler optimization.
In previous CUDA toolkits, NVRTC on Linux incorrectly added “/usr/include” to the default header file search path. This issue has been fixed; NVRTC in CUDA 11.0 and later will not implicitly add ‘/usr/include‘ to the header file search path.

If some included files are present inside /usr/include, the NVRTC nvrtcCompileProgram() API call must now be explicitly passed the “/usr/include” path with the “-I” flag.
nvcc now allows options that take a single argument to be redefined. If the redefinition is incompatible with the earlier instance, a warning is issued. For example:

// the following command line is now accepted, previously nvcc gave an error
```
<samp>$nvcc -rdc=true -rdc=true -c j1.cu</samp>
```
// the following command line is now accepted with a warning (due to incompatible redefinition of ‘-rdc’ argument), previously nvcc gave an error
```
<samp>$nvcc -rdc=true -rdc=false -c j1.cu</samp>
```
nvcc warning : incompatible redefinition for option ‘relocatable-device-code’
nvcc implements a new flag ‘-extra-device-vectorization‘ , which enables more aggressive vectorization of device code.
Added support for C++17.
Added support for __attribute__((visibility(“default”))).

2.3.2. CUDA Developer Tools

The following developer tools are supported for remote (target) debugging/profiling of applications on macOS hosts:
- Nsight Compute
- Nsight Systems
- cuda-gdb
- NVVP
For new features, improvements, and bug fixes in CUPTI, see the changelog.
For new features, improvements, and bug fixes in Nsight Compute, see the changelog.
Cuda-gdb is now upgraded to support GDB 8.2.
A new tool called Compute Sanitizer, for memory and race condition checking, is now included as part of CUDA 11.0.

2.4. CUDA Libraries

This release of the toolkit includes the following updates:

CUDA Math libraries toolchain uses C++11 features, and a C++11-compatible standard library is required on the host.
cuBLAS 11.0.0
cuFFT 10.1.3
cuRAND 10.2.0
cuSPARSE 11.0.0
cuSOLVER 10.4.0
NPP 11.0.0
nvJPEG 11.0.0

2.4.1. cuBLAS Library

Many performance improvements have been implemented for the NVIDIA Ampere, Volta, and Turing Architecture based GPUs.
With this release, on Linux systems, the cuBLAS libraries listed below are now installed in the /usr/local/cuda-11.0 (./lib64/ for lib and ./include/ for headers) directories as shared and static libraries.
The cuBLASLt logging mechanism can be enabled by setting the following environment variables before launching the target application:
- CUBLASLT_LOG_LEVEL=<level> – while level is one of the following levels:
  - “0” – Off – logging is disabled (default)
  - “1” – Error – only errors will be logged
  - “2” – Trace – API calls will be logged with their parameters and important information
- CUBLASLT_LOG_FILE=<value> – while value is a file name in the format of “<file_name>.%i”, %i will be replaced with the process id. If CUBLASLT_LOG_FILE is not defined, the log messages are printed to stdout.
For matrix multiplication APIs:
- cublasGemmEx, cublasGemmBatchedEx, cublasGemmStridedBatchedEx and cublasLtMatmul has new data type support for BFLOAT16 (CUDA_R_16BF).
- The newly introduced computeType_t changes function prototypes on the API: cublasGemmEx, cublasGemmBatchedEx, and cublasGemmStridedBatchedEx have a new signature that uses cublasComputeType_t for the computeType parameter. Backward compatibility is ensured with internal mapping for C users and with added overload for C++ users.
- cublasLtMatmulDescCreate, cublasLtMatmulAlgoGetIds, and cublasLtMatmulAlgoInit have new signatures that use cublasComputeType_t.
- A new compute type TensorFloat32 (TF32) has been added to provide tensor core acceleration for FP32 matrix multiplication routines with full dynamic range and increased precision compared to BFLOAT16.
- New compute modes Default, Pedantic, and Fast have been introduced to offer more control over compute precision used.
- *Init versions of *Create functions are introduced in cublasLt to allow for simple wrappers that hold all descriptors on stack.
- Experimental feature of cuBLASLt API logging is introduced.
- Tensor cores are now enabled by default for half-, and mixed-precision- matrix multiplications.
- Double precision tensor cores (DMMA) are used automatically.
- Tensor cores can now be used for all sizes and data alignments and for all GPU architectures:
  - Selection of these kernels through cuBLAS heuristics is automatic and will depend on factors such as math mode setting as well as whether it will run faster than the non-tensor core kernels.
  - Users should note that while these new kernels that use tensor cores for all unaligned cases are expected to perform faster than non-tensor core based kernels but slower than kernels that can be run when all buffers are well aligned.

2.4.2. cuFFT Library

Reoptimized power of 2 FFT kernels on Volta and Turing architectures.

2.4.3. cuSPARSE Library

Added new generic APIs and improved performance for sparse matrix-sparse matrix multiplication (SpGEMM): cusparseSpGEMM_workEstimation, cusparseSpGEMM_compute, and cusparseSpGEMM_copy.
SpVV: added support for __nv_bfloat16.

2.4.4. cuSOLVER Library

Added 64-bit APIs for getrf, getrs, potrf, potrs, geqrf, syevd and syevdx.
This release adds more control and helpful functionalities for the Tensor Cores Accelerated Iterative Refinement Solver TCAIRS.
- In addition to the previously released TCAIRS-LU based solver a new TCAIRS-QR based solver for real and complex systems with one or multiple right hand sides is introduced.
- In addition to the FP64, FP32 and FP16 computational precisions two new computational precisions types are supported: the BFLOAT16 and the TensorFloat32 (TF32). Both TCAIRS-LU and TCAIRS-QR come with the five computational precisions options. Tensor Float (TF32), introduced with NVIDIA Ampere Architecture GPUs, is the most robust tensor core accelerated compute mode for the iterative refinement solver. It is able to solve the widest range of problems in HPC arising from different applications and provides up to 4X and 5X speedup for real and complex systems, respectively. On Volta and Turing architecture GPUs, half precision tensor core acceleration is recommended. In cases where the iterative refinement solver fails to converge to the desired accuracy (double precision in most cases), it is recommended to use full double precision factorization and solve (such as [D,Z]GETRF and [D,Z]GETRS or cusolverDn[DD,ZZ]gesv).
- TCAIRS (LU and QR) are released with easy LAPACK-style APIs (drop-in replacement) as well as expert generic APIs that give users a lot of control of the internal of the solver. These support all five computational precisions.
- Simple and Expert APIs now support all five computational precisions.
- Expert TCAIRS solvers APIs allow users to choose between 4 methods of refinement.
- Expert TCAIRS solvers APIs now support a no-refinement option which means they behave as standard Xgesv/Xgels solvers without refinement.
Performance improvements of the TCAIRS solver for NVIDIA Ampere, Volta, and Turing Architecture based GPUs.

2.4.5. NVIDIA Performance Primitives (NPP)

Added batching support for nppiLabelMarkersUF functions.
Added the nppiCompressMarkerLabelsUF_32u_C1IR function.
Added nppiSegmentWatershed functions.
Added sample apps on GitHub demonstrating the use of NPP application managed stream contexts along with watershed segmentation and batched and compressed UF image label markers functions.
Added support for non-blocking streams.

2.4.6. nvJPEG

Hardware accelerated decode is now supported on NVIDIA A100.
The nvJPEG decode API (nvjpegDecodeJpeg()) now has the flexibility to select the backend when creating nvjpegJpegDecoder_t object. The user has the option to call this API instead of making three separate calls to nvjpegDecodeJpegHost(), nvjpegDecodeJpegTransferToDevice(), and nvjpegDecodeJpegDevice().

2.4.7. CUDA Math API

Add arithmetic support for __nv_bfloat16 floating-point data type with 8 bits of exponent, 7 explicit bits of mantissa.
Performance and accuracy improvements in single precision math functions: fmodf, expf, exp10f, sinhf, and coshf.

2.5. Deprecated and Dropped Features

The following features are deprecated or dropped in the current release of the CUDA software. Deprecated features still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.

General CUDA

Support for Red Hat Enterprise Linux (RHEL) and CentOS 6.x is dropped.
Support for Kepler sm_30 and sm_32 architecture based products is dropped.
Support for the following compute capabilities are deprecated in the CUDA Toolkit:
- sm_35 (Kepler)
- sm_37 (Kepler)
- sm_50 (Maxwell)
For more information on GPU products and compute capability, see https://developer.nvidia.com/cuda-gpus.
Support for Linux cluster packages is dropped.
CUDA 11.0 does not support macOS for developing and running CUDA applications. Note that some of the CUDA developer tools are still supported on macOS hosts for remote (target) debugging and profiling. See the CUDA Tools section for more information.
CUDA 11.0 no longer supports development of CUDA applications on the following Windows distributions:
- Windows 7
- Windows 8
- Windows Server 2012 R2
nvGraph is no longer included as part of the CUDA Toolkit installers. See the cuGraph project as part of RAPIDS; the project includes algorithms from nvGraph and more.
The context creation flag CU_CTX_MAP_HOST (to support mapped pinned allocations) is deprecated and will be removed in a future release of CUDA.

CUDA Developer Tools

Nsight Eclipse Edition standalone is dropped in CUDA 11.0.
Nsight Compute does not support profiling on Pascal architectures.
Nsight VSE, Nsight EE Plugin, cuda-gdb, nvprof, Visual Profiler, and memcheck are reducing support for the following architectures:
- Support for Kepler sm_30 and sm_32 architecture based products (deprecated since CUDA 10.2) has beeen dropped.
- Support for the following compute capabilities (deprecated since CUDA 10.2) will be dropped in an upcoming CUDA release:
  - sm_35 (Kepler)
  - sm_37 (Kepler)
  - sm_50 (Maxwell)

CUDA Libraries – cuBLAS

Algorithm selection in cublasGemmEx APIs (including batched variants) is non-functional for NVIDIA Ampere Architecture GPUs. Regardless of selection it will default to a heuristics selection. Users are encouraged to use the cublasLt APIs for algorithm selection functionality.
The matrix multiply math mode CUBLAS_TENSOR_OP_MATH is being deprecated and will be removed in a future release. Users are encouraged to use the new cublasComputeType_t enumeration to define compute precision.

CUDA Libraries — cuSOLVER

TCAIRS-LU expert cusolverDnIRSXgesv() and some of its configuration functions undergo a minor API change.

CUDA Libraries — cuSPARSEThe following functions have been removed:

Hybrid format enums and helper functions: cusparseHybPartition_t, cusparseHybPartition_t, cusparseCreateHybMat, cusparseDestroyHybMat
Triangular solver enums and helper functions: cusparseSolveAnalysisInfo_t, cusparseCreateSolveAnalysisInfo, cusparseDestroySolveAnalysisInfo
Sparse dot product: cusparseXdoti, cusparseXdotci
Sparse matrix-vector multiplication: cusparseXcsrmv, cusparseXcsrmv_mp
Sparse matrix-matrix multiplication: cusparseXcsrmm, cusparseXcsrmm2
Sparse triangular-single vector solver: cusparseXcsrsv_analysis, cusparseCsrsv_analysisEx, cusparseXcsrsv_solve, cusparseCsrsv_solveEx
Sparse triangular-multiple vectors solver: cusparseXcsrsm_analysis, cusparseXcsrsm_solve
Sparse hybrid format solver: cusparseXhybsv_analysis, cusparseShybsv_solve
Extra functions: cusparseXcsrgeamNnz, cusparseScsrgeam, cusparseXcsrgemmNnz, cusparseXcsrgemm
Incomplete Cholesky Factorization, level 0: cusparseXcsric0
Incomplete LU Factorization, level 0: cusparseXcsrilu0, cusparseCsrilu0Ex
Tridiagonal Solver: cusparseXgtsv, cusparseXgtsv_nopivot
Batched Tridiagonal Solver: cusparseXgtsvStridedBatch
Reordering: cusparseXcsc2hyb, cusparseXcsr2hyb, cusparseXdense2hyb, cusparseXhyb2csc, cusparseXhyb2csr, cusparseXhyb2dense

The following functions have been deprecated:

SpGEMM: cusparseXcsrgemm2_bufferSizeExt, cusparseXcsrgemm2Nnz, cusparseXcsrgemm2

CUDA Libraries — nvJPEG

The following multiphase APIs have been removed:
- nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseOne
- nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseTwo
- nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseThree
- nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseOne
- nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseTwo

2.6. Resolved Issues

2.6.1. General CUDA

Fixed an issue where GPU passthrough on arm64 systems was not functional. GPU passthrough is now supported on arm64, but there may be a small performance impact to workloads (compared to bare-metal) on some system configurations.
Fixed an issue where starting X on systems with arm64 CPUs and NVIDIA GPUs would result in a crash.

2.6.2. CUDA Tools

Fixed an issue where NVCC throws a compilation error when a value > 32768 was used in an __attribute__((aligned(value))).
Fixed an issue in PTXAS where a 64-bit integer modulo operation resulted in illegal memory access.
Fixed an issue with nvcc where code using the __is_implicitly_default_constructible type trait would result in an access violation.
Fixed an issue where NVRTC (nvrtcCompileProgram()) would enter into infinite loops triggered by some code patterns.
Fixed a compilation time issue in nvcc to improve handling of large numbers of explicit specialization of function templates.

2.6.3. cuFFT Library

Reduced R2C/C2R plan memory usage to previous levels.
Resolved bug introduced in 10.1 update 1 that caused incorrect results when using custom strides, batched 2D plans and certain sizes on Volta and later.

2.6.4. cuRAND Library

Introduced CURAND_ORDERING_PSEUDO_LEGACY ordering. Starting with CUDA 10.0, the ordering of random numbers returned by MTGP32 and MRG32k3a generators are no longer the same as previous releases despite being guaranteed by the documentation for the CURAND_ORDERING_PSEUDO_DEFAULT setting. The CURAND_ORDERING_PSEUDO_LEGACY provides pre-CUDA 10.0 ordering for MTGP32 and MRG32k3a generators.
Starting with CUDA 11.0 CURAND_ORDERING_PSEUDO_DEFAULT is the same as CURAND_ORDERING_PSEUDO_BEST for all generators except MT19937. Only CURAND_ORDERING_PSEUDO_LEGACY is guaranteed to provide the same for all future cuRAND releases.

2.6.5. cuSOLVER Library

Fixed a race condition of GETRF when running with other kernels concurrently.
Fixed the pivoting strategy of [c|z]getrf to be compliant with LAPACK.
Fixed NAN and INF values that might result in the TCAIRS-LU solver when FP16 was used and matrix entries are outside FP16 range.
Fixed the pivoting strategy of [c|z]getrf to be compliant with LAPACK.
Previously, cusolverSpDcsrlsvchol could overflow 32-bit signed integer when zero fill-in is huge. Such overflow causes memory corruption. cusolverSpDcsrlsvchol now returns CUSOLVER_STATUS_ALLOC_FAILED when integer overflow happens.

CUDA Math API

Corrected documented maximum ulp error thresholds in erfcinvf and powf.
Improved cuda_fp16.h interoperability with Visual Studio C++ compiler.
Updated libdevice user guide and CUDA math API definitions for j1, j1f, fmod, fmodf, ilogb, and ilogbf math functions.

2.6.7. NVIDIA Performance Primitives (NPP)

Improved quality of nppiLabelMarkersUF functions.
nppiCompressMarkerLabelsUF_32u_C1IR can now handle a huge number of labels generated by the nppiLabelMarkersUF function.

2.6.8. CUDA Profiling Tools Interface (CUPTI)

The cuptiFinalize() API now allows on-demand detachability of the profiling tool.

2.7. Known Issues

2.7.1. General CUDA

The nanosleep PTX instruction for Volta and Turing is not supported in this release of CUDA. It may be fully supported in a future releaseof CUDA. There may be references to nanosleep in the compiler headers (such as include/crt/sm_70_rt*). Developers are encouraged to not use this instruction in their CUDA applications on Volta and Turing until it is fully supported.
Read-only memory mappings (via CU_MEM_ACCESS_FLAGS_PROT_READ in CUmemAccess_flags) with cuMemSetAccess() API will result in an error. Read-only memory mappings are currently not supported and may be added in a future release of CUDA.
Note that the R450 driver bundled with this release of CUDA 11 does not officially support Windows 10 May 2020 Update and may have issues
GPU workloads are executed on GPU hardware engines. On Windows, these engines are represented by “nodes”. With Hardware Scheduling disabled for Windows 10 May 2020 Update, some NVIDIA GPU engines are represented by virtual nodes, and multiple virtual nodes may represent more than one GPU hardware engine. This is done to achieve better parallel execution of workloads. Examples of these virtual nodes are “Cuda”, “Compute_0”, “Compute_1”, and “Graphics_1” as shown in Windows Task Manager. These correspond to the same underlying hardware engines as the “3D” node in Windows Task Manager. With Hardware Scheduling enabled, the virtual nodes are no longer needed, and Task Manager shows only the “3D”node for the previous “3D” node and multiple virtual nodes shown before, combined. CUDA is still supported in this scenario.

2.7.2. CUDA Tools

The legacy profiling tools nvprof and NVVP do not support the NVIDIA Ampere architecture.
Arithmetic is not supported on __nv_bfloat16 floating point variables in the Nsight debugger watch window.
In some cases, cuda-gdb has a dependency on Python that can be resolved by installing the libpython-dev packages on Linux. For example, on Ubuntu use: sudo apt install libpython-dev.
For remote debugging on macOS with cuda-gdb, disassembly of code is not supported and may return an error. This issue will be addressed in the production release of CUDA 11.0.

2.7.3. CUDA Compiler

Sample 0_Simple/simpleSeparateCompilation fails to build with the error “cc: unknown target ‘gcc_ntox86”. The workaround to allow the build to pass is by passing additionally EXTRA_NVCCFLAGS=”-arbin $QNX_HOST/usr/bin/aarch64-unknown-nto-qnx7.0.0-ar”.

2.7.4. NVIDIA Performance Primitives (NPP)

The nppiCopy API is limited by CUDA thread for large image size. Maximum image limits is a minimum of 16 * 65,535 = 1,048,560 horizontal pixels of any data type and number of channels and 8 * 65,535 = 524,280 vertical pixels for a maximum total of 549,739,036,800 pixels.

nvJPEG

NVJPEG_BACKEND_GPU_HYBRID has an issue when handling bit-streams which have corruption in the scan.

More About CUDA

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords.

The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications. The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime.

Topics

Have any questions?

News