Telemetry of AI Servers for Emerging Connectivity solutions

Emerging and existing connectivity standards play a significant role in deployment of AI solutions especially for LLM’s deployments either for training or inference use cases. Telemetry plays significant role in optimizations and configuration management of these servers for their unique needs. Connectivity solutions can be categorized within Rack level topologies or among different racks. These AI solutions may be deployed as Pods configured withing single Rack or deployed as PODs among different racks in data center. One rack may have one or many PODs deployed, thus any configuration management or performance/power optimizations need to take into account these topologies and inherent metrics within each POD. My aim here is not dive deeper into topologies used within Racks or across data center, rather focus on telemetry needs for optimal connectivity. In this post I would focus primarily some key metrics of such connectivity solutions and their impact on performance as well power profiles. Example for rack scale deployments could be on PCIe 6.0/7.0, Ultra Accelerator Link (UAL 1.0) and across racks could be NV link and Ethernet solutions. These metrics are inherently hierarchical and need to consider socket level metrics before moving on to Rack level and beyond. Telemetry Hooks provided on Silicon on Chip (SOC) can give a significant insight to upper layer observability and management frameworks in software for performance/Power and configurations optimizations for scalability and cost reduction. For this post I would focus on PCI 6.0/7.0 and UA link and may be in future posts discuss NV link.

Snapshot for PCI-6.0/7.0:

PCIe 6.0: Point to Point PAM4 signaling64GT/s(PAM4 allows 16Ghz clk at the driver)Flit Mode xactns (New: Fixed 256B)-236B TLP+6B DLP+8B CRC+6B FECx16/lane bandwidth:64GT/s x 16 lanes/Direction = 128GBps
PCIe 7.0 Point to Point PAM4 signaling128GT/s(PAM4 32Ghz clk at driver)Flit Mode Xactnsx16/lane bandwidth:128GT/s x16 lanes/direction = 256GBps

Any telemetry designed needs to account for above snapshot for connecting sockets within each POD, Note PCIe 6.0 silicon’s are in market with switches and integrated ISA cores, whereas 7.0 is still in pipeline. Silicon availability exposes many system integration issues and thus telemetry can be designed for optimization much more efficiently. One example might be Round Trip Latency for PCIe 6.0 and 7.0 a useful parameter for inference performance use cases for instance and it can be determined using provisioned telemetry in case above interconnects are used.

One may be tempted to compare above solutions with UA link-1.0 but a validated silicon is still awaited (Few companies like Marvel/Synopsys/Astera Labs have development in works). It promises much higher data rates per lane compared to PCIe 7.0 128GT/s/lane vs 200GT/s/lane and promises to scale up to 2 racks and with stretch goal up to 4 Racks. This gives 200GT/sx4=800GT/s in one direction. Promises to support 1024 accelerators/pod. Mostly Phy layer of Ethernet standard has been kept unchanged whereas Data/Transaction layers have been modified from original IEEE ethernet std. to reduce latency overhead and improve throughput.

For telemetry purposes one may divide it in performance / power perspective differently for Phy sub-layer, Data/Link layer and Transaction layers. Phy layer telemetry might offer insight like Eye opening for signaling and loss in signal power for receiver on the channel, Error Handling/Correction etc… for Forward Error Correction decoding and correction as well as CRC error check and replay of requests. Note These telemetries not only indicate channel health (e.g: errors corrected/detected, channel eye width closing) but also effect bandwidth achieved since errors cause replay of requests and decrease the effective bandwidth and latency of requests which in turn effect flow control mechanisms for upper layers like DLP and TLP by not releasing the credits for producer to generate new traffic. These would also increase the latency to producer of request completion, thus cause indirect stalls to other unrelated telemetry to accelerator core and memory subsystems.

All telemetry framework which does discovery of load across various AI cores/accelerators in socket/across sockets within pod or across different pods need to correlate these networking metrics to other read/write metrics obtained through these telemetry hooks for making sharding decisions for Data or Model parallelism of AI workloads. One point need to be noted about scalability since UA link promises large scalability through its Switches, telemetry can provide practical vs theoretical limits comparison and allow configuration management.

Reliability and Security aspects can be handled with proper telemetry hooks. Few examples, lot of correction events detected by telemetry related to FEC/CRC indicate marginal link stability or sources of EMI near these links, Denial of Service attacks can also be detected using telemetry at TLP and DLP levels of connectivity stack using ID’s and address slices related to requests. Burstiness of traffic captured using telemetry on network traffic might indicate the AI workload profile or indicate potential need for optimizations needed using sharding/interleaving. Some of the related aspects to network telemetry can be uncovered using memory and compute infrastructure related telemetry but key concept is to be able to correlate these events at different subsystems either at pod level or socket level or across pods to derive a coherent plan of action for optimization/security/reliability as the case might be for AI workload used on the given server.