Understanding CPU and GPU Server Thermals, Evaluation Frameworks, Power/Thermal Stress Generation Through Virus Programs and common thermal Management algorithms

There is lot of cost imperatives in managing and evaluating thermals at system on chip level, server level and at the data center level. Here in this post, I intend to touch upon only for system on chip and server level. I intend to discuss power feed and thermal interactions, metrics used in thermal monitoring, classify GPU and CPU workloads or virus programs to cause thermal events, some tools and finally algorithms used in thermal management.

In general, single servers are closely correlated with the operation and safety of external voltage regulators (VRMs) and power supplies. Proper thermal control ensures VRMs and PSUs maintain efficiency, stability, and reliability under varying loads. Voltage regulator modules (VRMs) convert server PSU output (e.g., 12V) to CPU/GPU/core voltages, handling high currents which along with power transistors like MOSFETs—generate significant heat. Thermal management systems monitor these components with integrated sensors to prevent excessive junction or case temperatures, activating fans or airflow strategies as needed.

Both VRMs and PSUs are rated for specific safe operating temperature limits (commonly below 70°C for PSU components and up to 100°C for VRM MOSFETs). Exceeding these thresholds due to poor system cooling or excessive power draw causes reduced efficiency, increased thermal stress, shorter component lifespan, and can trigger hardware throttling or emergency shutdowns.Frequent thermal cycling, such as rapid load changes, also accelerates aging of dielectric materials and increases risk of electrical/mechanical faults.In modern servers, temperature data from VRMs and PSUs feed into mainboard firmware or system management controllers. If overheating is detected, the system may:

Reduce CPU/GPU frequency/voltage via firmware DTM(Dynamic Thermal management)
Speed up cooling fans
Reallocate or limit power delivery to specific zones/components
Initiate throttling or shut down to avoid electrical failure

Proper thermal design involves low thermal resistance paths through heatsinks and thermal pads, optimal airflow (fan placement for VRMs/PSUs), and layout decisions (high-current VRMs near active airflow). VRMs (voltage regulator modules) often use special “multiphase buck designs” to distribute heat across many components, mitigating local hot spots and enhancing transient response. Efficient server operation requires coordinated thermal management not just for computer hardware but for external regulators and power supplies.

Common metrics used for validation for thermal monitoring:

Peak Junction Temprature	Maximum per core or package temperature reached
Mean Variance of Temperature	Average/dispersion of core temperatures, spread indicates thermal uniformity
Power Consumption	Aggregate and per-zone (if server divided into multiple zones for management) energy required under thermal management
Performance throughput	IPC, FLOPS, or task completion under thermal management, normalized to baseline is best.
Migration/Throttle Overhead	Count/cost of task migrations, frequency changes, and induced latencies
Recovery Time	Time required for system to return to safe temperature post-stress
Robustness	Policy stability and effectiveness in perturbed or unseen workload/cooling scenarios

Popular programs commonly called “thermal viruses” are used to stress CPU, GPU, or full system thermals. These are synthetic, multithreaded workloads specifically engineered to generate maximum heat and power draw. In general, each company writes internally using their architecture a low abstraction program which can maximally exercise their compute pipelines for integer and floating points, stress the primary caches like instruction, data and MMU subsystems, maximize the request throughput to DDR and HBM memory. These are in general separate tight loop programs for each subsystem and in general written with deep understanding of micro-architecture. But here I will mention open-source programs available to cause thermal virus like effect.

Thermal Virus Programs written for GPU intensive servers have following traits:

Stresses Many cores/threads, SIMD vector units, dedicated memory controllers. Highly parallel, utilizes graphics APIs (e.g:OpenGL, DirectX, Vulkan), massive multi-threaded kernel dispatch, heavy floating-point and texture operations.
These programs primarily stresses GPU core, VRAM, and often VRM on the GPU board, pushes active cooling solutions like fans and heat pipes. These programs cause constant, high, sometimes spiking power loads due to continuous rendering/computational threads across all GPU modules.
Frequently tests both GPU core, Tensor Cores and video/local memory bandwidth/intensity (stress) which may rapidly escalate if fan curves are weak
Heat distributed across GPU die (shader/compute units), VRAM chips, VRMs on graphics card and new GPU’s have tensor cores for ML/AI tasks.
GPU tests heat multiple points on a discrete PCB (core, memory, VRMs), sometimes exposing cooling deficiencies in fans or heatsinks.

Common Thermal virus programs for GPU intensive servers:

FurMark : FurMark creates extreme, continuous graphical workloads (such as heavy OpenGL donut rendering) to push GPUs to their thermal and power limits. It can force even overclocked GPUs to throttle.
MSI Kombustor: is tailored for stress-testing and benchmarking GPUs and compatible with most vendors. It delivers varied graphical and compute stress patterns and is fully multithreaded.
OCCT (overclock checking tool): is a GPU test module, including memory stress tests, put sustained, randomized loads across all GPU cores and memory, ideal for validating thermal throttling mechanisms and RAM/VRAM stability.
3DMark-xx includes specialized stress and stability tests (like Time Spy Stress, Port Royal DLSS/DLSS 3.5), using multithreaded, visually demanding routines to evaluate GPU temperature management, sustained clock speeds, and cooling performance under worst-case loads.3DMark’s DLSS and Frame Generation bench do exercise tensor core blocks for sustained periods
Unigine Superposition advanced stress test, supports DirectX and OpenGL, considered among the most intense for modern GPUs.
PassMark PerformanceTest includes multi-threaded 2D/3D GPU benchmarks for thermal and power evaluation

Some well-known tools for GPU thermal stress test tools are:

gpu-burn which runs using linux and cuda
glmark2 also runs using linux and OpenGL
nvidia-smi is a good tool for real time monitoring on NVIDIA GPU’s
gpustat also used to monitor GPU paramson NVIDIA GPU’s
nvtop is another tool used to monitor GPU usage for NVIDIA GPU’s
radeontop does monitoring for GPU internal params on AMD GPU’s

In addition, for silicon development (e.g: pre-silicon development) level in general another level of effort needed to model thermals and power. Discussion of this is beyond the scope of this post but will cover them in future. In such cases dynamic capacitance can be modeled sub-unit by sub-unit basis and each such sub-unit estimates generated are combined to generate estimate for dynamic capacitance of complete system on chip like GPU or CPU. Sub-unit level dynamic capacitance generated uses traces generated from known power virus programs discussed above at a pre-silicon level to cause activity. These programs generate vectors to be injected to each and every sub-unit of entire SOC using commercial simulators to get thermal/power simulation results frame by frame basis (in case of application divided in frames like Games) on each of the sub-unit design code of system on chip. This helps derive estimates for entire silicon like GPU or another complex system on chip. Since dynamic capacitance modeling is a very detailed and low-level task, it is not possible to cover it here, but every major GPU and CPU design has nF(nano farad) number associated with it derived as described and it is very representative.

On same line as above, Thermal Virus Programs written for CPU intensive operations have following traits:

Targets CPU silicon, on-die caches/Translation lookaside buffers, memory controller, and system VRMs on motherboard
Stresses central processing units which have fewer, wider out-of-order scalar and vector cores, shared cache and DRAM buses.
These programs are intensive arithmetic workloads (integer, floating point) using AVX/AVX2, Small FFTs, or Linpack, maximally parallel via multi-threading
Occasionally targets L1/L2/L3 cache and DRAM, but less often than raw compute units used inside CPU.
Sustained but can have sharp transients if AVX/FMA or similarly heavy SIMD workloads are used, mainly heat concentrated on CPU die/core clusters
Heat is concentrated on CPU die with conduction to motherboard VRMs.

Common Thermal virus programs for CPU intensive processing:

Prime95 (with AVX/Small FFTs): A classic “power virus” for CPUs, this Small FFT test is fully multithreaded and utilizes modern instruction sets (AVX/AVX2/AVX512), generating extremely high heat and sustained power consumption on all cores simultaneously
AIDA64 Stress FPU: Provides a multithreaded stress test (FPU only) mode that heats up all available CPU cores to their limits. Fully utilizes modern CPUs’ multithreading capacities.
Stress-NG: Linux utility capable of stressing CPU, virtual memory, cache and more in a highly multithreaded manner, allowing per-core and cross-core workload generation
OCCT: Offers “Power Supply” and “CPU Linpack” tests, both of which use as many threads as available and are specifically tuned to produce maximum thermal stress.

Key Properties of Power Virus/Stress Tools used in CPU intensive workloads:

Fully utilize all available logical cores and threads.
Often leverage modern SIMD and vector instruction sets (SSE, AVX, FMA).
“power viruses” because their power and thermal load far exceed normal application patterns.
Widely used in system manufacturing, benchmarking, thermal validation, and overclocking communities.

Finally, before leaving ths topic, it might be essential to discuss device level thermals which discusses relationship between Thermal Resistance, Tcase and Junction Temperature for system on chip devices:

T-Junction = T-case + R-theta (Junction to Case thermal resistance) x Power Dissipated (Watts). These three parameters define how effectively heat flows from the silicon die to the environment, determining when the system must throttle performance to prevent overheating. Junction temperature represents active-region temperature where semiconductor activity occurs. Exceeding the rated Tj-max leads to performance degradation or permanent damage from phenomena such as electromigration or thermal runaway. Devices are designed with throttling thresholds below T-jmax to protect silicon reliability. Then case temperature is the measurable surface temperature of the package on a server CPU or GPU, such as the IHS (Integrated Heat Spreader). Because T-junction cannot usually be measured directly, T-case is used with R-Theta (Junction to Case) to estimate T-Junction.

Lastly, thermal gradient between T-junction and T-case decides internal conduction efficiency in System on Chip like CPU/GPU, low R-Theta (Junction to Case) is preferable for efficient packaging. Thermal resistance is a good measure to quantify how easily heat can move in each layer of thermal path. In general, total thermal resistance = R-Theta (Junction to Case: true measure of activity) + R-Theta (Case to Sink: measures thermal greases effectiveness) + R-Theta (Sink to Ambient: measures cooling system effectiveness) for a given System on chip part. Finally, there are many thermal management frameworks in use today. Modern thermal throttling algorithms in servers and system-on-chip (SoC) platforms integrate AI-driven dynamic thermal management (DTM), real-time sensor feedback, and reinforcement learning–based power scheduling to prevent overheating while maximizing performance. Thermal throttling governs the balance between power, temperature, and performance, and it is closely tied to device-level thermal resistance and material properties.

Servers today employ AI-augmented thermal optimization that dynamically modulates CPU/GPU frequencies and cooling fan speeds based on predictive models. Reinforcement learning algorithms learn optimal trade-offs between temperature thresholds and workload scheduling, leading to up to ~40% energy efficiency improvements over classical fixed-threshold methods. In advanced data centers, these models are coupled with sensor fusion systems measuring per-core temperatures and environmental gradients to trigger throttling or reallocation of workloads dynamically. In SoCs and 3D multi-processor systems (3D MPSoCs), DTM frameworks integrate with embedded AI microcontrollers to monitor temperature maps at subunit resolution (cores, interconnects, SRAM blocks). Such systems use a combination of proactive DTM (predicting imminent hotspots and redistributing workloads) and reactive DTM (reducing voltage/frequency immediately after threshold breaches). Such DTM approach maintains reliability while preventing thermal runaway.Common Algorithms used for Thermal management which are mentioned briefly below entails lot more details applicable to both system on chip as well as server system design and thermal management, I have only outlined the mechanism name here with some emphasis to highlight importance since it is beyond the scope of this post:

Dynamic Voltage and Frequency Scaling (DVFS)
Advanced Sensor Feedback Loop
Thermal-Aware Scheduling
Clock Gating and Fetch Gating with varying granularity (coarse and fine)
Fan and Cooling Control Algorithms
Machine Learning/Reinforcement Learning Dynamic Thermal Management
Thermal Prediction and Power Budgeting

I will leave after mentioning some Recent Thermal Throttling Method:

Liquid and immersion cooling: which is Direct-to-chip and two-phase immersion systems significantly drop effective R-theta (thermal resistance), reduces throttling.
Digital twins and AI cooling orchestration uses Real-time simulation to predict local hotspot evolution, reducing unnecessary throttling.
Reinforcement learning–based DTM which uses Adaptive schedulers jointly optimize workload placement and cooling efficiency, as demonstrated in multi-agent control frameworks for data centers.
3D-stacked SoC design has increased layer density and power packaging making R-theta more nonlinear, therefore predictive DTM now integrates package-embedded microfluidic cooling effects.

I will discuss in subsequent posts the thermal and power management algorithms in detail and effect of it on device and system thermals.