At the 2016 International Supercomputing Conference, NVIDIA introduced the NVIDIA® Tesla® P100 GPU accelerator for PCIe servers. The PCIe variant was announced to meet the unprecedented computational demands planted on modern data centers. The Tesla P100 is anticipated to deliver massive leaps in performance and value compared with CPU-based systems.
With HPC data centers needing to support the ever-growing demands of scientists and researchers while staying within a tight budget, the old approach of deploying lots of commodity computenodes with vast interconnect overhead has shown that the substantial costs doesn’t exactly equate to a huge increase of data center performance. The introduction of the NVIDIA Tesla P100 accelerators should help remedy this situation as it is designed to boost throughput and save money for HPC and hyperscale data centers. Powered by the brand new NVIDIA Pascal™ architecture, Tesla P100 for PCIe-based servers enables a single node to replace up to half-rack of commodity CPU nodes by delivering lightning-fast performance in a broad range of HPC applications. Handling the same workload with far fewer nodes means customers can save up to 70% in overall data center costs.
Earlier this year at the 2016 GPU Technology Conference, NVIDIA showcased the Tesla P100 along with the new DGX-1® supercomputer. It was the first time we witnessed a Pascal architecture powered Tesla GPU and NVIDIA’s new mezzanine connector, also known as the SXM2 interface. While previous Tesla series GPUs came in a PCIe form factor, NVIDIA’s SXM2 connector was a necessary upgrade to optimize NVIDIA’s high-speed NVLink bus. However, NVIDIA recognized that not all users will want to build their systems around mezzanine connections, so naturally a PCIe version of the P100 was also created.
NVIDIA Tesla Family Specification Comparison
|Tesla P100 (SXM2)||Tesla P100 (16GB)||Tesla P100 (12GB)||Tesla M40|
|Memory Clock||1.4Gbps HBM2||1.4Gbps HBM2||1.4Gbps HBM2||6Gbps GDDR5|
|Memory Bus Width||4096-bit||4096-bit||3072-bit||384-bit|
|Half Precision||21.2 TFLOPS||18.7 TFLOPS||18.7 TFLOPS||6.8 TFLOPS|
|Single Precision||10.6 TFLOPS||9.3 TFLOPS||9.3 TFLOPS||6.8 TFLOPS|
|Double Precision||5.3 TFLOPS||4.7 TFLOPS||4.7 TFLOPS||213 GFLOPS|
|Max Power Consumption||300W||250W||250W||250W|
The PCIe-based Tesla P100 will come in two versions: one with a 4096-bit memory bus width, 16 GB VRAM, and a 4MB L2 cache, while the second includes a 3072-bit memory bus width, 12 GB VRAM, and a 3MB L2 Cache. With the mezzanine version touting a boost clock of 1.48 GHz and the PCIe version, 1.3 GHz, the latter is essentially a downclocked version of the former. Though the “underdog,” the P100 for PCIe-based servers is definitely not “underpowered,” as it still delivers 18.6 TFLOPS of half-precision performance, more than capable of handling massive server banks.
Tesla P100 is said to be “reimagined from silicon to software,” crafted with innovation at every level. It features four groundbreaking technologies that deliver a dramatic jump in performance:
• New Pascal Architecture: Delivering 5.3 and 10.6 TeraFLOPS of double and single precision performance for HPC, 21.2 TeraFLOPS of FP16 for Deep learning.
• NVLink: The World’s first high-speed Interconnect for multi-GPU scalability with 5x boost in performance (NVLink is not featured on the PCIe version).
• CoWoS ® with HBM2: Unifying data and compute into a single package for up to 3X memory bandwidth over prior-generation solutions.
• Page Migration Engine: Parallel programming has become simpler by enabling datasets beyond the physical limits of GPU memory.
The PCIe-based NVIDIA Tesla P100 GPU accelerator is expected to be available beginning in Q4 2016 from Exxact Corporation. Exxact Tensor Series servers featuring the Tesla P100 will also be available in Q4 2016. Users not planning on establishing the data center route, or wanting to test the P100 before the PCIe variant ships out, should consider NVIDIA’s P100-powered DGX-1® supercomputer; it features eight Tesla P100 accelerators delivering 170 teraflops of half-precision peak performance, equivalent to 250 CPU-based servers, and can be ordered through Exxact here.