The Intel Xeon Phi Knights Landing support the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.

Intel KNL Supported AVX-512 Extensions

Extension Description
AVX-512F Foundation
AVX-512CD Conflict Detection
AVX-512PF Prefetch
AVX-512ER Exponents and Reciprocals

A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.

 

Instruction Latency Tables for Vector Vs. Scalar Instructions

Vector Instructions

Instruction Latency Bandwidth
Simple Int 2 2
FMA Vectorizations 6 2
Mask Ops 2 2
X87 / MMX 6 1
EMU (AVX-512ER) 7 0.5
Shuffle / Permutes (1 src) 2 1
Shuffle / Permutes (2 src) 3 0.5
Convert – Same Width 2 1
Convert – Different Width 6 0.2
Vector Loads 5 2
Store and load forwarding 2 2
Gather (8 elems) 15 0.2
Gather (16 elems) 19 0.1
Float to Int move 2 1
Int to Float Move 4 1
DIVSS or SQRTSS 25 0.05
DIVSD or SQRTSD 40 0.03
Packed DIV or SQRT 38 0.1

Scalar Instructions

Instruction Latency Bandwidth
Math 1 2
Int Multiply 3 or 5 1
Store to load forwarding 2 1
Integer Loads 4 1
Integer Division Varies 0.05

Scalar Versus Vector Code Performance – Kernel Sizes

Note that Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.

Example Operations – Operation Costs and Comparisons

for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; }

Instructions Present: Gather, Horizontal Reduction Operations — 2x Load, Gather/Load, Horizontal Reduction & Sum

Vector Cost: 5*N

Scalar Cost: 19*ceiling(N/8)+30

Analysis: Scalar is better for N < 13

for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; }

Instructions Present: Scatter and Division Operations — 3x Load, Scatter/Store, Division

Vector Cost: 38*ceiling(N/8)

Scalar Cost: 19*N

Analysis: Scalar is better if N < 1

for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; }

Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load

Vector Cost: 36*ceiling(N/8)

Scalar Cost: 3*N

Analysis: Scalar code is always optimal, no matter the iteration (N Count)