AI Accelerators Vector/Matrix Processing Capability especially for LLM's and SLM's

Before diving into vector processing capabilities of accelerators, just want to discuss most important use case of it and best example of this would be transformer model used in LLM’s. Although most people by now know this backwards and forwards, but I still need to touch upon it.

Most important vector extension is to facilitate matrix multiplication, and it can be best understood by most generic architecture of Transformer used in majority of large language models which utilizes self-attention mechanisms and feed-forward neural networks to process input sequences. The key components of the architecture include:

Embedding Layer: Converts tokens into dense vectors.
Multi-Head Self-Attention Mechanism Computes relationships between all pairs of tokens, enabling the model to capture long-range dependencies.
Feed-Forward Networks (FFNs) Contains a series of matrix multiplication (Matmul) operations and non-linear functions to provide the non-Linearity and feature Transformation.
Layer Normalization Normalizes activations for stability and faster convergence.
Residual Connections adjustments to avoid gradient vanishment during backpropagation.

Key point here is that Self-Attention and Feed-Forward layers are implemented through a series of matrix multiplication operations. In transformer models there is also optimization done using KV cache for inference and that is being used here in this post to illustrate effect of matrix multiplication on inference (training case is possibly obvious), every time a new token is generated, the model’s attention mechanism needs to consider all previous tokens in the sequence. Without KV caching, the model would recompute the key and value representations for all previous tokens at every step, leading to significant computational overhead and latency.

KV cache has two operations which are Prefill and Decode, Prefill part is Matrix-matrix multiplication and highly Compute-bound whereas the Decode phase where Matrix-vector multiplication and it is primarily Memory-bound. prefill phase is dominated by highly parallel matrix-matrix operations to process the entire prompt and populate the KV cache. And here is the use of vector processing capabilities of the accelerator comes into play. The decode phase is dominated by sequential matrix-vector operations, using and updating the KV cache to efficiently generate output tokens one by one which is primarily memory bound and this can also be used to sped up through vector processing capabilities of an accelerator (vector gather and scatter).

To give color to KV cache, during inference after Input text is converted into tokens and Tokens mapped to embeddings, during Prefill phase, model processes the entire input prompt (all tokens in the initial context) in a single, parallelized forward pass. Just to highlight the difference between prefill and decode, note Matrix-Matrix Multiplications needed where embeddings are passed through the transformer’s attention layers (attention score is calculated) and model computes the key (K) and value (V) matrices for every token in the input sequence. These computed K and V matrices for all input tokens are stored in the KV cache during prefill for use in subsequent decoding steps. During Decode phase, model generates output tokens one at a time (autoregressively), using the previously constructed KV cache to avoid redundant calculations. Here Matrix-Vector Multiplications happens for each new output token, the model computes the query (Q), key (K), and value (V) vectors for just that token, and attends over all cached keys and values from previous tokens. This is a matrix-vector operation (as opposed to matrix-matrix in prefill), which is less efficient for parallel hardware but can also be accelerated using vector mem operations. Hope, I have convinced that accelerating matrix-matrix and matrix-vector multiplication is very important even for inference using KV cache example (which is mostly the case).

Matrix multiplication (matmul) operations follow the General Matrix Multiply (GEMM) pattern which is an important benchmark. In contrast, during the decode phase, the Matmul operations follow the General Matrix-Vector Multiply (GEMV) pattern. The shape for the Matmul operation is represented as [BxMxNxK], where: B[optional] is batch size, M is rows in output and finally N are columns in output; K is reduction dimension (intermediate used in mat mul).

Now I will switch to crux of matter which are the vector and matrix extension capabilities present in most processing elements (or accelerators) and for illustration purpose using RISC-V as an example here and focus on matrix multiplication mainly below. Simplistically general matrix multiplication kernel computes an m x n matrix C, which is the product of an m x k for matrix A and k × n for matrix B (C=AB). Most libraries use Tiled version of this matrix multiplication shown below:

for (i = 0; i < m; i += tile_m) {
 for (j=0; j<n; j += tile_n) 
 {
    for (s=0; s<k; s += tile_k) 
    {
        c_tile [i,j] += a_tile [i,s] * b_tile [s, j];
    }
 }
                                }

Using Risc-V matrix extensions, above code would like as follows:

void matmul_float16(c, a, b, m, k, n) 
{
 //set element type to 16b
    msettype(e16);
 for (i=0; i<m; i+=tile_m)
      {
        //msettile_n is Configuration-Setting Instructions
         tile_m = msettile_m(m-i);
           for (j=0; j<n; j+=tile_n) 
              {
        //for tile matrix multiplication for M, K, N tiles
               tile_n = msettile_n(n-j);
        //clear the acc register using mfemul
               acc = mfemul_mf(acc, 0.f)
               for (s=0; s<k; s+=tile_k)
                 {
                   tile_k = msettile_k(k-s);
                // loop at dim m with tiling
                // loop at dim n with tiling
                // clear acc reg
                // loop at dim k with tiling

               // load left matrix a 
                   tr1 = mlae16_m(&a[i][s], k2); 
                // load right matrix b
                   tr2 = mlbe16_m(&b[s][j], n2); 

               //finally tiled matrix multiply tr1 and tr2
               // double widen output acc
                   acc = mfwma_mm(tr1, tr2)
                 }//end for loop w/index s

      // convert widen result
       acc = mfncvt_f_fw_m(acc);
     // store to matrix c
       msce16_m(acc, &c[i][j], n*2);
              }//end second for loop w/j index
    }//end last for loop w/i index
}

Above code snippet is tiled version and then mapped into cache. What is important in above snippet the number of cycles spent to complete and that involves lot of pre-planning in code before above macro can be called which is not necessarily discussed and it requires sophisticated prefetching.

Two key techniques apart from other used are Register Tiling and Cache Tiling to speed up these matrix-matrix and matrix-vector operations.

Register Tiling: This focuses on maximizing the utilization of the accelerators vector registers during matrix multiplication. It is required to fit data into vector registers, register tiling helps minimize memory access and maximize computational throughput. Note register access is the fastest memory hierarchy in any processing element thus finding elements to multiply in vector registers is a win for performance.

Cache Tiling: Optimizes matrix multiplication by improving data locality in caches, ensuring that data remains in the cache hierarchy for as long as possible for above tiled multiplication to complete. Efficient cache tiling reduces memory latency and improves the performance of the Matmul operation by minimizing cache misses. Preventing cache misses requires Pre-fetching and that is where latency hiding code is needed.

It is possible to not have Matrix Extensions and only vector extensions, in such case it is to be noted that most architectures define VLEN (vector length configurable or architecture agnostic in other words), in such cases the vector utilization can have effect on Matmul operation in case it is implemented using an outer product approach.

In such case, output register tile size is [m, n] like before, where m and n represent the number of elements in the output rows and columns, respectively. Implementing without matrix extension the outer product with RV-V requires m vector grouped registers for output accumulators and 1 additional vector register to load data from the RHS matrix, thus total (m+1) vectors and each grouped element has length of n, thus size = (m+1) * n bits.

In Vector Register Grouping, Multiple vector registers can be grouped together to form a vector register group for RISC-V, so that a single vector instruction can operate on multiple vector registers. Vector register groups allow execution efficiency for longer application vectors.

The number of vector registers in a group here called LMUL is an integer power of two set by another configurable field vlmul field in vtype for RISC-V. For example, (LMUL = 2 ^(vlmul[1:0])), VLMAX = LMUL*VLEN/SEW (std. elem. Width) represents the maximum number of elements that can be operated on with a single vector instruction given the current SEW and LMUL settings. Now let us see the effect on vector utilization with choice of VLEN, elem_size and LMUL. Number of elements which can be loaded in grouped registers are VLEN/elem_size (e.g: 1024b/32b = 32). For example group size is 8 registers and VLEN= 1024 for elem_size=32b, this means for num_elements = (VLEN=1024)/(Elem_Size=32b) = 32 And number of elements accessed in the this operation= 9*(LMUL=1)*32b/(Elem_Size=32b) = 9, it gives utilization merely for this VLEN = 9/32(max elem possible)=28% of grouped vector registers in this specific example of Matmul operation given VLEN, Elem_Size and number of vector register grouped together using LMUL.

Above example illustrates the sub-optimal selections for MATMUL operation using vector extension, one may try LMUL=4, and m=7 that makes (m+1) = 8 and number of columns=LMUL* elem_size=4*32b then m+1 * n this implies number of elements accessed = 8(note m=7)*128b/32b = 32 and we know as before VLEN=1024 with elem_size=32b can support 32 elements, thus this make vector utilization reaches=100%. These examples just illustrate the flexibility in improving performance developing using configurable parameters as noted above and that improves the utilization of grouped vectors registers immensely.