Frameworks and kernels for performance and reliability aspects of AI servers in Training and Inference workloads

In this context it is important to outline the frameworks available to evaluate the performance primitives of AI workloads, here in this post comparison of JAX Compiler (uses XLA accelerated linear algebra along with JIT compilation and composable transformations, more tuned to functional programming), Tensorflow (supported by static/dynamic computational graph w/Keras API’s and w/XLA, cuBLAS, cuDNN compilation and production grade), Pytorch (Popular, Dynamic computation graph, pythonic w/GPU,CPU support), Nvidia Native Stack (supported by CUDA, cuBLAS, NCCL, cuDNN, TensorRT and DALI w/python, C++ interfaces, hardware tuned libraries) and finally ROCm (primarily for AMD GPU’s w/HIP which is CUDA like and w/python,C++ interface support). These 5 frameworks can be summarized in following table:

FrameworkNvidia Native StackTensorflow (Google)Pytorch (Meta)JAX (Google)ROCm (AMD)
Kernel/LibcuBLAS/cuDNN,NCCL,fused kernels and hardware centric XLA and capable to call cuDNN/cuBLAS..Nvidia GPU libscustom CUDA,cuDNN,cuBLAS..Nvidia centricXLA, supports cuBLAS/cuDNN via XLArocBLAS, MIOPEN and still evolving
Inference SDKTensorRT,TritonTesnorflow serving/lite/JSTorchServe, ONNX, TensorRTN/AMIOPEN,ONNX,Triton
Acceleration SupportNVIDIA GPU (native and all other frameworks)TPU, NVIDIA GPU, CPUCPU, NVIDIA GPU, AMD ROCm GPUTPU, NVIDIA GPU and CPUAMD GPU
Programming ModelCUDA, cuBLAS,NCCL,cuDNN,TensorRT,DALIStatic and Dynamic graphs and kerasDynamic Computational GraphFunctional, JIT w/XLAHIP (CUDA like) and other open source
Main LanguageC++/PythonPython (w/C++ backend)Python (w/C++ backend)PythonPython (C++ via HIP)
Known Performance/ScalingDepends on hardware tuned kernels and librariesHighly optimized for TPU and supported GPU’sGood on GPU’sScales on TPU/GPU, Good w/XLA and Python centricAMD GPU centric
Known DeploymentsCloud and Edge Ready, TensorRT, NGC ConainersFull Stack, TFX, TF Lite/ServingONNX, TorchServe,TensorRTFlexible but not matureONNX, TRITON and ROCm

Communication primitives may need further attention as communication can be overlapped with computation and primitives are topology sensitive like NCCL supports

AllReduce- (aggregates values from all GPU’s by performing reduction like SUM and distribute results back to all GPU’s)

Broadcast- (send data form one GPU to all other)

AllGather- Gathers data from all GPU’s and concatenates results on all GPU’s

ReduceScatter- (this one performs reduction like All Reduce but scatters partial results to all GPU’s)

AlltoAll: each GPU sends distinct chunk of data to all other GPU’s

As mentioned earlier these primitives play important role in sharding of data sets, models, tensors and even helping to provide expert level parallelism across GPU’s or other acceleration hardware.

These are supported on multi-threaded, multi-process models-based AI workloads using CUDA and it also encompasses profiling and observability tools for debugging and tuning. UBER library like Horovod in this regard is very much influenced by NCCL. Key question that needs to be addressed is to map these primitives to underlying hardware either through dedicated accelerated fixed function units or through vector ISA’s (which in general not very performance friendly for communication primitives since they need to traverse a logical hierarchy within system on chip type ASIC). In general other similar libraries besides NCCL are GLOO(Meta), BytePS (Bytedance), UCX (unified communication X) which can support CPU/GPU using CUDA and Infiniband and finally MPI(OPENMPI, MVAPICH) very generic which is very mature and can support heterogeneous hardware but may lead to higher latency.

Besides NCCL which is meant for communication, cuDNN supported by Nvidia CUDA is a deep neural network library and is highly tuned on Nvidia hardware for convolution, activation functions, normalization and RNN in general. cuDNN primitives can provide gains in DOT product, fast forward and backward passes in neural networks. Besides cuDNN, cuBLAS is also a important library which supports basic linear algebra operations for dense matrix multiplication. This library has three levels, level 1 primarily vector operations and provides Dot product acceleration, vector additions, norm calculations and vector swaps. Level-2 is for GEMV which is matric vector multiplication, SYMV/HEMV which is for symmetric/hermitian matrix vector product as a major contribution to ML/LLM workloads, level-3 being most important which contributes to GEMM (matrix to matrix multiplication C=alpha*AB + beta*C), level-3 includes SYMM/HEMM matrix multiplies as well besides hermitian rank-k update. Besides these cuBLAS contains Data Move Functions as well.Nvidia also has cuTENSOR library which supports High-performance tensor operations, enabling efficient tensor contractions and computations essential for multi-dimensional tensor operations in LLMs

Alternate framework to cuBLAS are rocBLAS (AMD GPU centric), OpenBLAS (CPU centric or from intel), OneMKL (Intel), besides that Magma (Multi-Core BLAS) provides CPU and GPU hybrid routines.Most frameworks support data types like Int, Float, Double and few support Complex data types.CUTLASS is functionally closest to cuBLAS and GPU-native.

Nvidia has released cuPARSE as well which can accelerate models exploiting sparsity in weight and gradients.

On the closing note TensorRT and TensorRT-LLM frameworks from Nvidia are optimized inference engines especially employing enhanced KV cache usage for inference and calculating attention.