High Radix Data Center Switches Architecture and Feature Comparison for Scale-up and Scale-out Commercial Offerings

Current commercial offering uses high radix switches used in Rack scale computing used predominantly in AI servers in data center. Radix in case of a switch is the measure of number of external ports on a switch, high radix switches in general provide 128 to 256 external ports per switch which tends to reduce the cabling, reduces hierarchical topologies, reduces latency, hop count and thereby reduces power within rack. One of good references of Radix effect in switches is a google tech talk: High Radix Interconnection Networks

Examples of high radix switches available in market are Tomahawk 6(TSMC 3nm) from Broadcom promises to deliver 102.4Tbps. As per documentation can have flexible configuration to 64 × 1.6TbE,128 × 800GbE, 256 × 400GbE, 512 × 200GbE, 512 × 100GbE, or 512 × 50GbE ports. Another ethernet offering include spectrum-X from Nvidia which delivers 51.2Tbps on ethernet using 64/128 ports using 800Gbps and 400Gbps throughput using Bluefield-3 super NICs DPU(data processing units). Besides these two front runners, Cisco Nexus 9000 series where it can support 51.2Tbps using 512 ports where each port is supported by 100Gbps or 64 ports with each port is 800Gbps in optical offering. Lastly, Marvell has offering using Teralynx promises 51.2Tbps with 64 ports of 800Gbps or 128 ports of 400Gbps of scale out ethernet support.Tomahawk 6 looks powerful and also has bigger market share. It delivers high bandwidth using co-packaged optics (Co-packaging with optics eliminates the need to source and manage high-speed pluggable transceivers in deployment). Offers 102.4 Tbps Switching Capacity delivered using 16 x 6.4 Tbps Optical Engines. 512 duplex single-mode optical fibers × 200G PAM4 all-optical I/O. As per BCM78919 specification, it can be used for Data center spine, leaf, ToR(Top of Rack), blade, and aggregation switching. It supports Scale-Up cluster size of 512 XPUs at 200 Gbps per link IEEE 802.3 Compliant.

Before diving deeper into individual offerings, it might be worthwhile to mention all these offerings use internally multi-stage Clos fabric with possibility of high radix scaling.In general Clos network topologies are characterized using following metrics:

  • Bisection bandwidth which is bandwidth sustained across a cut fabric of switch when its supported ports are divided into half and each half sending full line rate traffic to other half. Thus, if a switch supports 51.2Tbps total (full duplex) bandwidth across N ports with speed R, then if the ports are split into half like N/2, then bisection bandwidth of such switch is min (N/2 x R) in each direction for non-blocking operation without oversubscription. For 102.4Tbps bandwidth switch it will be 25.6Tbps for 64 ports switch (64/2 ports x 800Gbps/port) in each direction as an example bare minimum. In order to avoid drop or worst-case traffic you may overprovision.
  • Diameter which is defined as maximum hops between two such supported nodes for implemented Clos. (typically, in spine-leaf clos it is 3-5), but Tomahawk 6 promises port to port 1 hop latency in its Chiplet based 3nm design. Possibly that is because of VOQ (virtual output queuing) and dynamic routing.
  • Throughput under Traffic Metrices: This is worst case flow completion time through this switch measured generally using All In All, Incast or longest matching traffic metrics. Traffic metrices can figure out if the Bi-section bandwidth advertised would hold true under imbalances in packet distribution (which can be further classified within ASIC and Network level).
  • Oversubscription: Raio of edge(server) bandwidth to aggregate core bandwidth (spine or uplink). Note ASIC switch level (internal clos network) oversubscription and network level oversubscription are two different metrics.

Another important aspect might be load balancing method in such high radix switches like mentioned above using multi-stage clos fabric at a network level. ECMP (Equal Cost Multi Path) routing commonly employed which uses packet headers to figure out flow identification and uses Hashes in header fields to figure out next hop. Hash function plays pivotal role in uniform distribution of traffic. Tomahawk advertises using Cognitive Routing 2.0 which basically a dynamic rerouting of packets based on sideband telemetry received to route packets through underutilized paths in its internal Clos fabric. This might help in elephant flows which are common when using all-reduce collectives like in NCCL library, it might prevent queue buildup or do selective dropping during GPU related bursts.

It might be worth mentioning that collective’s libraries like NCCL and MPI workloads most likely to stress these switches for operation like All-to-All, All-Reduce, All-Gather and Broadcast which are pivotal primitives in AI workloads.

As per some published benchmarks (disclaimer: not my measurements), Tomahawk 6 seems to shine on All-Reduce primitive due to advertised cognitive 2.0 routing, Spectrum-4 has good performance for this primitive using RDMA over Converged Ethernet. Tomahawk 6 also promises sub-1-usec latency on Broadcast primitives when supporting 102.4 Tbps through switch whereas Spectrum-4 promises good performance using this primitive on adaptive flows at 51.2Tbps capacity limit support on their switch. Another place might be All-Gather where tomahawk 6 promises good utilization on elephant flows whereas spectrum 4 promises ~90% utilization for inference using this primitive. One can at least takeaway, these primitives’ part of collective library are good source to compare these switches.

Similarly, for MPI based workloads, All-to-All primitive shines using Tomahawk 6 (2x as per some estimates) and sub 5usec latency using their cognitive routing feature. Spectrum-4 matches to this performance with smaller cluster size.

Some methodology notes of benchmarking using NCCL relates to fact that NCCL tests launches one process per GPU while sweeping message sizes 1KB to 1GB and run 1000+ iterations for each size especially fr primitives like all-gather and all-reduce. End goal here is to measure average bandwidth (GB/s) and tail latency in usec. Tests use ring or Tree algorithms using RDMA over Converged Ethernet, MLBENCH might be the suitable benchmark for deep dive on this. For MPI Ohio State University benchmark called OSU which also has similar primitives like all-to-all and all-reduce. It can also sweep message sizes from 4B to 1MB (smaller message sizes) and can report bidirectional bandwidth, loaded latency and validates ECMP (equal cost multi path) congestion control.

It might be instructive to breakdown all above mentioned metrics for switches as we scale the number of racks per pod and pod to pod in data center. Also, what is the contribution of performance of these switches for traffic distribution within Rack as compared to among racks within POD. As per some preliminary information rack to rack traffic distribution has larger performance impact as compared to within Rack traffic distribution, though that is highly skewed by AI training and inference patterns of model being used.