From a server perspective Compute (MAC/FMAC) throughput, cache(?) and main Memory bandwidth and latency, IO Storage access performance like IOPs, Network scalability, bandwidth and hop latency in general are prime system level performance primitives which need to be evaluated with dedicated benchmarks to find suitability of server for AI workloads. Each primitive’s needs are further categorized and prioritized based on use cases. From perspective of AI training and inference in general LLM’s need data sharding, model sharding, tensor level compute pipelining… and some of key network operations like SoftMax accelerated using all or some combination of four system level primitives highlighted earlier. In general, most machine learning algorithms including LLM’s need low level parallel patterns acceleration including Matrix Multiplication, Convolution, Scatter, Gather, Reduction, Masking/Filter operation and few more as per use case need, these are realized as combination of hardware and/or software components of AI framework library primitives accessed by server. Security and Reliability needs are not necessarily covered in all these primitives. That is the motivation of Securebits to deliver on needs of these two important attributes (which could be broken down further) without compromising performance.
In this context it is important to outline the frameworks available to evaluate the performance primitives of AI workloads, here in this post comparison of JAX Compiler (uses XLA accelerated linear algebra along with JIT compilation and composable transformations, more tuned to functional programming), Tensorflow (supported by static/dynamic computational graph w/Keras API’s and w/XLA, cuBLAS, cuDNN compilation and production grade), Pytorch (Popular, Dynamic computation graph, pythonic w/GPU,CPU support), Nvidia Native Stack (supported by CUDA, cuBLAS, NCCL, cuDNN, TensorRT and DALI w/python, C++ interfaces, hardware tuned libraries) and finally ROCm (primarily for AMD GPU’s w/HIP which is CUDA like and w/python,C++ interface support). These 5 frameworks can be summarized in following table:
| Framework | Nvidia Native Stack | Tensorflow (Google) | Pytorch (Meta) | JAX (Google) | ROCm (AMD) |
| Kernel/Lib | cuBLAS/cuDNN,NCCL,fused kernels and hardware centric | XLA and capable to call cuDNN/cuBLAS..Nvidia GPU libs | custom CUDA,cuDNN,cuBLAS..Nvidia centric | XLA, supports cuBLAS/cuDNN via XLA | rocBLAS, MIOPEN and still evolving |
| Inference SDK | TensorRT,Triton | Tesnorflow serving/lite/JS | TorchServe, ONNX, TensorRT | N/A | MIOPEN,ONNX,Triton |
| Acceleration Support | NVIDIA GPU (native and all other frameworks) | TPU, NVIDIA GPU, CPU | CPU, NVIDIA GPU, AMD ROCm GPU | TPU, NVIDIA GPU and CPU | AMD GPU |
| Programming Model | CUDA, cuBLAS,NCCL,cuDNN,TensorRT,DALI | Static and Dynamic graphs and keras | Dynamic Computational Graph | Functional, JIT w/XLA | HIP (CUDA like) and other open source |
| Main Language | C++/Python | Python (w/C++ backend) | Python (w/C++ backend) | Python | Python (C++ via HIP) |
| Known Performance/Scaling | Depends on hardware tuned kernels and libraries | Highly optimized for TPU and supported GPU’s | Good on GPU’s | Scales on TPU/GPU, Good w/XLA and Python centric | AMD GPU centric |
| Known Deployments | Cloud and Edge Ready, TensorRT, NGC Conainers | Full Stack, TFX, TF Lite/Serving | ONNX, TorchServe,TensorRT | Flexible but not mature | ONNX, TRITON and ROCm |
Communication primitives may need further attention as communication can be overlapped with computation and primitives are topology sensitive like NCCL supports
AllReduce- (aggregates values from all GPU’s by performing reduction like SUM and distribute results back to all GPU’s)
Broadcast- (send data form one GPU to all other)
AllGather- Gathers data from all GPU’s and concatenates results on all GPU’s
ReduceScatter- (this one performs reduction like All Reduce but scatters partial results to all GPU’s)
AlltoAll: each GPU sends distinct chunk of data to all other GPU’s
As mentioned earlier these primitives play important role in sharding of data sets, models, tensors and even helping to provide expert level parallelism across GPU’s or other acceleration hardware.
These are supported on multi-threaded, multi-process models-based AI workloads using CUDA and it also encompasses profiling and observability tools for debugging and tuning. UBER library like Horovod in this regard is very much influenced by NCCL. Key question that needs to be addressed is to map these primitives to underlying hardware either through dedicated accelerated fixed function units or through vector ISA’s (which in general not very performance friendly for communication primitives since they need to traverse a logical hierarchy within system on chip type ASIC). In general other similar libraries besides NCCL are GLOO(Meta), BytePS (Bytedance), UCX (unified communication X) which can support CPU/GPU using CUDA and Infiniband and finally MPI(OPENMPI, MVAPICH) very generic which is very mature and can support heterogeneous hardware but may lead to higher latency.
Besides NCCL which is meant for communication, cuDNN supported by Nvidia CUDA is a deep neural network library and is highly tuned on Nvidia hardware for convolution, activation functions, normalization and RNN in general. cuDNN primitives can provide gains in DOT product, fast forward and backward passes in neural networks. Besides cuDNN, cuBLAS is also a important library which supports basic linear algebra operations for dense matrix multiplication. This library has three levels, level 1 primarily vector operations and provides Dot product acceleration, vector additions, norm calculations and vector swaps. Level-2 is for GEMV which is matric vector multiplication, SYMV/HEMV which is for symmetric/hermitian matrix vector product as a major contribution to ML/LLM workloads, level-3 being most important which contributes to GEMM (matrix to matrix multiplication C=alpha*AB + beta*C), level-3 includes SYMM/HEMM matrix multiplies as well besides hermitian rank-k update. Besides these cuBLAS contains Data Move Functions as well.Nvidia also has cuTENSOR library which supports High-performance tensor operations, enabling efficient tensor contractions and computations essential for multi-dimensional tensor operations in LLMs
Alternate framework to cuBLAS are rocBLAS (AMD GPU centric), OpenBLAS (CPU centric or from intel), OneMKL (Intel), besides that Magma (Multi-Core BLAS) provides CPU and GPU hybrid routines.Most frameworks support data types like Int, Float, Double and few support Complex data types.CUTLASS is functionally closest to cuBLAS and GPU-native.
Nvidia has released cuPARSE as well which can accelerate models exploiting sparsity in weight and gradients.
On the closing note TensorRT and TensorRT-LLM frameworks from Nvidia are optimized inference engines especially employing enhanced KV cache usage for inference and calculating attention.
