Recent AI SOC (System on Chip) Custom Solutions, Features and Architecture for tasks like Ranking/Recommendation Engines, LLM (Text and Multi-Modal) Training and Inference

In Recent past there have been lot of activity of custom chip solutions called accelerators created for Ranking/Recommendation Engines and Large Language Models (text and multi-modal) training and inference to be used in Hyper scalar data centers besides existing and established Nvidia and AMD GPU based solutions. Here I intend to discuss Meta Training Inference Accelerator-Gen5 Olympus and Trainium2 from Amazon solutions primarily created for Cloud servers for Text and Multi-Modal based AI servers with its Features and Architecture and finally Ironwood from Google flagship TPU. This gives an idea about compute, memory, IO and networking subsystems evolution for underlying platform.

Meta Training and Inference Solution called Olympus:(Looks primarily inference part based on specifications)

Source:ai.meta.com	Claims 3x Perf improvement over First Gen
Primary Use Case: Ranking and Recommendation Engine (looks more like inference engine for all practical purposes)	4 Models used to evaluate: Low Complexity and High Complexity Ranking and Recommendation Models and Advertisement’s ranking model
Compute Capabilities	GEMM Dense Compute: 3.5x improvement, GEMM Sparse Compute:7x improvement (int8 708/BF16 354 TFLOPS/s), Dense Compute (int8 354/BF16 177 Tflops/s), SIMD 5.53 TFLOPS/s(int8, BF16)
Memory Capacity	256MB on-chip and LPDDR5 for off-chip planned up to 128GB
Memory Bandwidth	204.8 GB/s offchip,2.7 TB/s Onchip,1TB/s/PE
Connectivity betn Accelerators and Host to/from accelerators	PCIe-Gen5
Scale Out Connectivity	RDMA w/NIC (claims 6x Model serving throughput over first Gen w/2x number of devices supported)
Software Customized Tool Chain	Pytorch 2.0 and runtime includes Triton Compiler/Graph Compiler
Technology/Die Area	5nm @ 1.35Ghz/421 sq mm
Topology of Engines/Part	Grid of 8×8 Processing elements each with 384KB of SRAM
Thermal Design Power	90 Watts
Memory Bandwidth	204.8 GB/s offchip,2.7 TB/s Onchip,1TB/s/PE
Network On Chip protocol	Unknown/Not disclosed but inspired by systolic array and connected in mesh topology with some Routing heuristic similar to Dimensional routing to minimize Hop latency

Based on all published data there seems to be no hardware coherency supported between engines connected on 8×8 grid, therefore the 384KB sram with each processing element is kept coherent with software help. Since RDMA support is added along with NIC connectivity, solution seems to support datacenter-based clusters created using Racks containing such parts. Triton Backend compiler support in software tool chain seems to make GPU like programming hardware agnostic. There is extensive support of telemetry borrowed from Nvidia GPU cluster experience and not discussed here but sufficient to give visibility of utilization and stalls at various level of functional hierarchy. In rack-based system holds up to 72 accelerators consisting of three chassis, each containing 12 boards that house two accelerators each. There is a mention that to clock the chip at 1.35GHz and run it at 90 watts compared to 25 watts for first-generation design.

On training side, Amazon Trainium2 and Trainium3 might be another interesting part since it is offered as alternative to Nvidia GPUs by AWS, I captured from their site features and architecture of trainium2, here are some features captured for Trainium2, Trainium3(besides it is 3nm part and 40% performance improvement over Trainium2) information in public domain is incomplete as of now:

Source: https://awsdocs-neuron.readthedocs-hosted.com/	Claims 6.7x improvement (Flops/s over Tranium1 w/FP8 and 3.7x for BF16 data types)
Primary Use Case	Trainium2 optimized for LLM/Diffusion models whereas Tranium3 optimized for multi-model distributed training
Technology	7nm for Trainium2 and 3nm for Trainium3
Processing Elements	8x NeuronCore-v3 engines /part, Trainium2 supports dynamic shapes and control flow via NeuronCore-v3 ISA extensions, supports 20 special cores with support for collective communication library
Compute	1299(FP8 TFlops/s),667 (BF16 TFLOPS/s)
Memory	224 MB internal SRAM memory,96GiB HBM3 support, DDR support not published
Connectivity w/NeuronLink-v3	Supports 3.5TB/s DMA bandwidth w/inline compression/de-compression support as well as supports DMA barriers. Supports 1.28 TB/sec scale up/out bandwidth per chip.
Memory Bandwidth to HBM3	2.9TB/sec
Scale Up Topology	NeuronLink-v3 2D/3D Torus supports 2D(4×4) and 3D(4x4x4)
TDP/Die Area/ Operating Voltage	Not Available
Data Movement	Supported by 20 special cores called CC cores which supports DMA barriers between external memory and internal SRAM
Software Tool Chain seems to be designed for Neuron Core	Pytorch Neuron, JAX Neuron, TensorFlow Neuron supported with special Neuron libraries for Transformer and training/Inference.
NOC topology	Not disclosed

Connectivity in EC2 instance

Power Profile is missing and possibly though a claim has been made that trainium3 is 40% more energy efficient than trainium2 that is important to know. AWS hosts many different LLM vendors and thus have access to their workload running profiles which would be useful for future optimizations in trainium3 onwards, a clear advantage.

Google Ironwood Tensor Processing Unit is the last solution in this post which might be interesting to look at from capabilities perspective. Ironwood promises to handle the computational and memory demands of models such as Large Language Models (LLMs), Mixture-of-Experts (MoEs), and advanced reasoning tasks supporting both training and serving workloads within the Google Cloud AI Hypercomputer architecture. Based on published metrics it definitely looks very powerful and most of comparisons are w.r.t H100:

Source: https://cloud.google.com/	2x of Trillium (TPU v2), with PODs configuration offers performance in Exaflops (e.g:42.5)
Primary Use Case	As per documentation not meant for vision and scientific simulations but good for LLM(both dense and sparse)-inference and training(limited), Recommendation Engines. Applications supported Gemini, Alphafold, Translate, Seach and Advertisement
Software Toolchain	Tensorflow, JAX, XLA
Compute	Mainly Systolic Array design, published Exaflop (~42.5) level performance for TPU Pod. Compute per chip is ~4614 Tflops/s and Optimized for Matrix Multiplication and Tensor operations.
Memory	192 GB HBMx memory
Memory Bandwidth	7.2TBps from HBM memory
Interconnect – Scale up Connectivity within TPU Pod	Connects up to 9216 TPU’s with ICI link and 1.2Tbps link bandwidth
Thermal	Needs Liquid Cooling
Energy Efficiency	2x of previous generation
Programmability (Low)	Primarily used within Google Cloud

Although last part seems powerful, but from usability point of view google cloud is now the only option to use it. Also internal details are scant and thermal might be cost adder since it needs advance cooling solutions.

I will go workload mapping and architectural deep dive in future blogs of these parts but above gives idea where the infrastructure development with hyper scalars is moving. I will leave the judgement of these parts to reader, but clearly part needs to be cost effective and thermally manageable.