FIT (Failure in Time) Calculation Essential for AI system Reliability metrics and its health evaluation

In this post I want to walk through FIT calculations and its role in determining the reliability of different sub-systems of AI systems (servers or mission critical IOTs). FIT number indicates probability of undetectable errors even after implementing all error mitigation techniques like ECC/CRC/Parity/redundancy… etc. in different parts of system. To be more accurate it measures number of failures in 10^9 operation hours, therefore lower the FIT, better and more reliable is the system.

In general, FIT calculations are evaluated for each sub-system of a server and aggregated thereafter to arrive at complete FIT number for entire system. One may divide the AI server into different parts: Interconnects(e.g: I/O’s like PCIe 6/7.0 or NVMe for secondary storage like SSD or disk storage, communication to PCIe switches/HCA for IO Virtualization or bridging to cluster links, links for scalability and coherency(Scalability of more than one socket either for SMP configurations or for scaling among different domains and creating clusters) and Networking links like ethernet to switches to create server or Rack scale interconnects, Power feeds(e.g.: Regulators, AC/DC converter Subsystems, On die regulators and power delivery networks etc..), Memory sub-system which includes within socket Primary/Secondary/Tertiary caches, DDRx memory, controller and associated DDRx IO’s, HBMx memory subsystem and associated Serdes links to access it, Processing Elements (ALU’s to do integer and float calculations, Data buses to move data within processing element or communicate to rest of SOC, Register files like integer/vector/float architectural register to support ISA.

It is obvious from above the impact of these FIT calculations on MTBF of system and its impact on architecture level decisions of SOC within socket, board power solutions, memory vendor selection and memory interconnect reliability, deciding the scalability of SOC using IO/Networking switches for data/model sharding needs or virtualization purposes. It also gives insight of scalability limits and yield impact on processing elements within SOC.

One can take a simple illustration of PCIe 6.0 for FIT calculation which is one of the integral parts of modern SOC and standard specifies it needs to maintain BER of 10^-6 using PAM4 signaling on the interconnect. PCIe 6.0 also specifies FEC and CRC as the error detection/correction mechanism for such link. Here is sample calculation:

Post FEC (forward error correction) BER in this case needs to fall and should be <10^-12, 64bit CRC implemented post FEC should give error detection probability of 1-2^-64 which means after CRC the probability of not being able to detect the errors fall to 2^-64~=5.4×10^-20. Now FIT can be calculated as FIT = (Post FEC BER) x (Post CRC BER) x PCIe 6.0 Flits transmitted/Hour = 10^-12 x 5.4 x 10^-20 x 2.3 x 10^14 Flits w/64GT/s rate= 2.3 x 10^-18 Failures/Hour and for 10^9 hours ~= 10^-9 failures which is much smaller than 1-FIT as per requirement of standard. Note this calculation accounts for errors across multiple lanes and also PAM4 marginal signaling integrity including retries as a result of FEC/CRC errors.

Above was example of FIT calculation for PCIe 6.0 link at 64GT/s but FIT needs to be calculated and aggregated for all integral parts/sub-system of AI system (server or IOT) based on its configuration, design choices and architecture to indicate the reliability of its use case. Mostly this must be done at a very early stage of development after all choices have been made.