The Intel Xeon Phi Knights Landing support the new 512-bit Advanced Vector Extension instruction set. Intel KNL (Knights Landing) also supports a few extensions. Only the AVX-512F (Foundation) is a base requirement for the AVX-512 ISA.

## Intel KNL Supported AVX-512 Extensions

Extension | Description |
---|---|

AVX-512F | Foundation |

AVX-512CD | Conflict Detection |

AVX-512PF | Prefetch |

AVX-512ER | Exponents and Reciprocals |

A detailed understanding of the latency and bandwidth involved with different instructions & operations is crucial to writing the most performant code that will be executed on the Intel Xeon Phi Knights Landing processor.

## Instruction Latency Tables for Vector Vs. Scalar Instructions

### Vector Instructions

Instruction | Latency | Bandwidth |
---|---|---|

Simple Int | 2 | 2 |

FMA Vectorizations | 6 | 2 |

Mask Ops | 2 | 2 |

X87 / MMX | 6 | 1 |

EMU (AVX-512ER) | 7 | 0.5 |

Shuffle / Permutes (1 src) | 2 | 1 |

Shuffle / Permutes (2 src) | 3 | 0.5 |

Convert – Same Width | 2 | 1 |

Convert – Different Width | 6 | 0.2 |

Vector Loads | 5 | 2 |

Store and load forwarding | 2 | 2 |

Gather (8 elems) | 15 | 0.2 |

Gather (16 elems) | 19 | 0.1 |

Float to Int move | 2 | 1 |

Int to Float Move | 4 | 1 |

DIVSS or SQRTSS | 25 | 0.05 |

DIVSD or SQRTSD | 40 | 0.03 |

Packed DIV or SQRT | 38 | 0.1 |

### Scalar Instructions

Instruction | Latency | Bandwidth |
---|---|---|

Math | 1 | 2 |

Int Multiply | 3 or 5 | 1 |

Store to load forwarding | 2 | 1 |

Integer Loads | 4 | 1 |

Integer Division | Varies | 0.05 |

## Scalar Versus Vector Code Performance – Kernel Sizes

Note that Vectorized code is not always faster due to the latency/bandwidth of vectorized code versus scalar code, depending on the size of the kernel being vectorized. A general guideline is that kernels/loops that have more than 16 iterations/loops will be faster with vector code versus scalar code.

## Example Operations – Operation Costs and Comparisons

for (i=0; i<N; i++) { sum += a[ind[i]*K + b[i]; } |

Instructions Present: Gather, Horizontal Reduction Operations — 2x Load, Gather/Load, Horizontal Reduction & Sum

Vector Cost: 5*N

Scalar Cost: 19*ceiling(N/8)+30

Analysis: Scalar is better for N < 13

for (i=0; i<N; i++) { c[ind[i]] = a[i] / b[i]; } |

Instructions Present: Scatter and Division Operations — 3x Load, Scatter/Store, Division

Vector Cost: 38*ceiling(N/8)

Scalar Cost: 19*N

Analysis: Scalar is better if N < 1

for (i=0; i<N; i++) { b[ind[i] = a[ind[i]]; } |

Instructions Present: Gather and Scatter Operations – 1x load, Scatter/Store, Gather/Load

Vector Cost: 36*ceiling(N/8)

Scalar Cost: 3*N

Analysis: Scalar code is always optimal, no matter the iteration (N Count)