Frameworks and kernels for performance and reliability aspects of AI servers in Training and Inference workloads

From a server perspective Compute (MAC/FMAC) throughput, cache(?) and main Memory bandwidth and latency, IO Storage access performance like IOPs, Network scalability, bandwidth and hop latency in general are prime system level performance primitives which need to be evaluated with dedicated benchmarks to find suitability of server for AI workloads. Each primitive’s needs are further categorized and prioritized based on use cases. From perspective of AI training and inference in general LLM’s need data sharding, model sharding, tensor level compute pipelining… and some of key network operations like SoftMax accelerated using all or some combination of four system level primitives highlighted earlier. In general, most machine learning algorithms including LLM’s need low level parallel patterns acceleration including Matrix Multiplication, Convolution, Scatter, Gather, Reduction, Masking/Filter operation and few more as per use case need, these are realized as combination of hardware and/or software components of AI framework library primitives accessed by server. Security and Reliability needs are not necessarily covered in all these primitives. That is the motivation of Securebits to deliver on needs of these two important attributes (which could be broken down further) without compromising performance.

In this context it is important to outline the frameworks available to evaluate the performance primitives of AI workloads, here in this post comparison of JAX Compiler (uses XLA accelerated linear algebra along with JIT compilation and composable transformations, more tuned to functional programming), Tensorflow (supported by static/dynamic computational graph w/Keras API’s and w/XLA, cuBLAS, cuDNN compilation and production grade), Pytorch (Popular, Dynamic computation graph, pythonic w/GPU,CPU support), Nvidia Native Stack (supported by CUDA, cuBLAS, NCCL, cuDNN, TensorRT and DALI w/python, C++ interfaces, hardware tuned libraries) and finally ROCm (primarily for AMD GPU’s w/HIP which is CUDA like and w/python,C++ interface support). These 5 frameworks can be summarized in following table:

Framework	Nvidia Native Stack	Tensorflow (Google)	Pytorch (Meta)	JAX (Google)	ROCm (AMD)
Kernel/Lib	cuBLAS/cuDNN,NCCL,fused kernels and hardware centric	XLA and capable to call cuDNN/cuBLAS..Nvidia GPU libs	custom CUDA,cuDNN,cuBLAS..Nvidia centric	XLA, supports cuBLAS/cuDNN via XLA	rocBLAS, MIOPEN and still evolving
Inference SDK	TensorRT,Triton	Tesnorflow serving/lite/JS	TorchServe, ONNX, TensorRT	N/A	MIOPEN,ONNX,Triton
Acceleration Support	NVIDIA GPU (native and all other frameworks)	TPU, NVIDIA GPU, CPU	CPU, NVIDIA GPU, AMD ROCm GPU	TPU, NVIDIA GPU and CPU	AMD GPU
Programming Model	CUDA, cuBLAS,NCCL,cuDNN,TensorRT,DALI	Static and Dynamic graphs and keras	Dynamic Computational Graph	Functional, JIT w/XLA	HIP (CUDA like) and other open source
Main Language	C++/Python	Python (w/C++ backend)	Python (w/C++ backend)	Python	Python (C++ via HIP)
Known Performance/Scaling	Depends on hardware tuned kernels and libraries	Highly optimized for TPU and supported GPU’s	Good on GPU’s	Scales on TPU/GPU, Good w/XLA and Python centric	AMD GPU centric
Known Deployments	Cloud and Edge Ready, TensorRT, NGC Conainers	Full Stack, TFX, TF Lite/Serving	ONNX, TorchServe,TensorRT	Flexible but not mature	ONNX, TRITON and ROCm

Communication primitives may need further attention as communication can be overlapped with computation and primitives are topology sensitive like NCCL supports

AllReduce- (aggregates values from all GPU’s by performing reduction like SUM and distribute results back to all GPU’s)

Broadcast- (send data form one GPU to all other)

AllGather- Gathers data from all GPU’s and concatenates results on all GPU’s

ReduceScatter- (this one performs reduction like All Reduce but scatters partial results to all GPU’s)

AlltoAll: each GPU sends distinct chunk of data to all other GPU’s

As mentioned earlier these primitives play important role in sharding of data sets, models, tensors and even helping to provide expert level parallelism across GPU’s or other acceleration hardware.

These are supported on multi-threaded, multi-process models-based AI workloads using CUDA and it also encompasses profiling and observability tools for debugging and tuning. UBER library like Horovod in this regard is very much influenced by NCCL. Key question that needs to be addressed is to map these primitives to underlying hardware either through dedicated accelerated fixed function units or through vector ISA’s (which in general not very performance friendly for communication primitives since they need to traverse a logical hierarchy within system on chip type ASIC). In general other similar libraries besides NCCL are GLOO(Meta), BytePS (Bytedance), UCX (unified communication X) which can support CPU/GPU using CUDA and Infiniband and finally MPI(OPENMPI, MVAPICH) very generic which is very mature and can support heterogeneous hardware but may lead to higher latency.

Besides NCCL which is meant for communication, cuDNN supported by Nvidia CUDA is a deep neural network library and is highly tuned on Nvidia hardware for convolution, activation functions, normalization and RNN in general. cuDNN primitives can provide gains in DOT product, fast forward and backward passes in neural networks. Besides cuDNN, cuBLAS is also a important library which supports basic linear algebra operations for dense matrix multiplication. This library has three levels, level 1 primarily vector operations and provides Dot product acceleration, vector additions, norm calculations and vector swaps. Level-2 is for GEMV which is matric vector multiplication, SYMV/HEMV which is for symmetric/hermitian matrix vector product as a major contribution to ML/LLM workloads, level-3 being most important which contributes to GEMM (matrix to matrix multiplication C=alpha*AB + beta*C), level-3 includes SYMM/HEMM matrix multiplies as well besides hermitian rank-k update. Besides these cuBLAS contains Data Move Functions as well.Nvidia also has cuTENSOR library which supports High-performance tensor operations, enabling efficient tensor contractions and computations essential for multi-dimensional tensor operations in LLMs

Alternate framework to cuBLAS are rocBLAS (AMD GPU centric), OpenBLAS (CPU centric or from intel), OneMKL (Intel), besides that Magma (Multi-Core BLAS) provides CPU and GPU hybrid routines.Most frameworks support data types like Int, Float, Double and few support Complex data types.CUTLASS is functionally closest to cuBLAS and GPU-native.

Nvidia has released cuPARSE as well which can accelerate models exploiting sparsity in weight and gradients.

On the closing note TensorRT and TensorRT-LLM frameworks from Nvidia are optimized inference engines especially employing enhanced KV cache usage for inference and calculating attention.