Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI published a guide arguing that GPU power limits and undervolting can reduce heat and fan noise in local AI workstations while preserving much of the tokens-per-second rate. The article cites sustained RTX 4090 data showing a 70% power limit at 300 watts kept 93.4% of speed while cutting 90 watts versus stock, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU undervolting and power-limiting guide for local AI inference, arguing that users can cut workstation heat and fan noise with little loss in tokens per second because many LLM workloads are constrained by memory bandwidth rather than GPU core compute.

The confirmed development is the publication of a guide and interactive infographic titled around undervolting GPUs for local inference. It presents power limiting as the first recommended step before buying a cooler, changing a case or rearranging fans. The guide says the method costs nothing and can be tested with common tools such as MSI Afterburner on Windows or nvidia-smi and LACT on Linux.

The source states that a sustained RTX 4090 workload at stock settings drew 390 watts, ran at 72 degrees Celsius and delivered 100% of the measured baseline speed. At a 70% power limit, the same table reports 300 watts, 67 degrees Celsius and 93.4% of the baseline speed, or 90 watts less heat output for a 6.6% speed drop. At 80%, it reports 330 watts and 98.6% of speed; at 55%, it reports 240 watts and 89.2% of speed.

The guide distinguishes between power limiting and direct undervolting. Power limiting is described as the safer starting point because users reduce the GPU’s allowed power draw and let the card manage voltage and clocks. Direct undervolting is presented as a more advanced method that edits the voltage-frequency curve, with the guide naming about 0.9 to 0.95 volts as a starting range for testing.

Why It Matters

The development matters for readers running local LLMs because heat, noise and power draw are practical limits for home and office AI workstations. A high-power GPU can make a system louder, warmer and more expensive to run, even when additional core clock speed adds little to inference throughput.

If the guide’s reported pattern holds for a reader’s workload, a power cap could delay or avoid spending on cooling upgrades. It also gives builders a measurable first step: change one setting, run the real model workload, then compare power draw, temperature and tokens per second.

MSI Gaming GeForce RTX 4070 Ti 12GB GDRR6X 192-Bit Extreme Clock: 2760 MHz HDMI/DP Nvlink Tri-Frozr 3 Ada Lovelace Architecture Graphics Card (RTX 4070 Ti Gaming X Trio 12G)

Chipset: GeForce RTX 4070 Ti
Recommended PSU: 700 W
G-SYNC Technology: Yes

View Latest Price

As an affiliate, we earn on qualifying purchases.

Background

The guide is part of a broader Thorsten Meyer AI series on reducing heat and noise in high-power AI workstations. It frames undervolting and power limiting as the first of five levers because they can be applied in software before hardware changes.

The technical argument rests on a distinction between gaming and local inference. The guide says gaming is often more sensitive to GPU core speed, while many local LLM inference workloads spend much of their time waiting on VRAM bandwidth. That is a source claim, and the article also says results vary by model, quantization, GPU and workload.

The source includes a disclosure that the page contains affiliate links and tells readers to confirm prices and specifications before buying gear. It also states that undervolting and power limits are reversible and widely used, but that users make changes at their own risk.

“Local inference is memory-bound.”

— Thorsten Meyer AI guide

“Power limiting moves one slider and can’t damage anything.”

— Thorsten Meyer AI guide

“Test under your real workload.”

— Thorsten Meyer AI guide

COMeap GPU Power Cable for Dell T3600 T3610 T5600 T5610 T5610 T7600 T7610 5810 T5810 T7810, 8 Pin PCIe to Dual 8 Pin(6+2) Male PCIe Power Adapter 13-inch(34cm)

GPU Power Cable for Dell: 8-pin to dual 8-pin (6+2) ends
Compatibility: Fits specific Dell models only
Note: Not compatible with other motherboards or PSUs

View Latest Price

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear how closely the reported RTX 4090 results will match every local inference setup. The guide itself says figures vary by card, model and workload, and that an undervolt stable for a short test can fail later in a longer run. The source also gives RTX 5090 and other card references, but the article’s strongest specific table is the sustained RTX 4090 example.

New Cooling Fans for Lenovo Legion 7 16IAX7 (Type:82TD),for Legion 7 16ARHA7 (Type:82UH), R9000K ARHA7 Y9000K IAX7 p/n:FPRV DFSCL12E06486J FPRW DFSCL12E16486J 5H40S20727 5H40S20728 DC12V 1A

Compatible Models: Lenovo Legion 7 and R9000K series
Efficient Cooling: Fast cooling with low noise
Stable Operation: Reduces overheating and noise issues

View Latest Price

As an affiliate, we earn on qualifying purchases.

What’s Next

Readers who use the guide are told to start with a power cap around the 60% to 80% band, run their actual inference workload and measure temperature, held clocks, power draw and tokens per second. On Linux, the guide says the cap may need a systemd service or other startup method because settings can reset after reboot.

Fasgear PCIe 5.0 Power Cable 70cm – 16pin (12+4) 12VHPWR Connector for RTX 5070 Ti 4070 4080 4090 to 2x8pin (6+2) PCI-e Male Plugs GPU Sleeved Cord for ASUS EVGA Seasonic Modular Power Supply

Cable Length: 70cm long PCIe 5.0 power cable
Connector Type: 16pin (12+4) 12VHPWR connector
Compatibility: For ASUS, EVGA, Seasonic PSUs

View Latest Price

As an affiliate, we earn on qualifying purchases.

Key Questions

What happened?

Thorsten Meyer AI published a guide and interactive infographic recommending GPU power limiting and undervolting as an early step for reducing heat and noise in local AI inference workstations.

Does the guide prove tokens per second will stay the same?

No. It reports measured and published examples, including an RTX 4090 table, but says results vary by GPU, model, quantization and workload.

The guide recommends starting with a simple power limit, such as 70%, before editing the voltage-frequency curve directly.

What is the main risk?

Power limiting is framed as low-risk because it restricts the GPU, but direct undervolting can cause instability. The guide says users should test with real long-running workloads.

What should readers measure?

The guide says to measure actual tokens per second, GPU temperature, power draw and held clock speed rather than relying on a short benchmark.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

Exercise Bike Setup: Seat Height and Comfort Rules

Author

The Idea Magazine Team

Share article

Why It Matters

MSI Gaming GeForce RTX 4070 Ti 12GB GDRR6X 192-Bit Extreme Clock: 2760 MHz HDMI/DP Nvlink Tri-Frozr 3 Ada Lovelace Architecture Graphics Card (RTX 4070 Ti Gaming X Trio 12G)

Background

COMeap GPU Power Cable for Dell T3600 T3610 T5600 T5610 T5610 T7600 T7610 5810 T5810 T7810, 8 Pin PCIe to Dual 8 Pin(6+2) Male PCIe Power Adapter 13-inch(34cm)

What Remains Unclear

New Cooling Fans for Lenovo Legion 7 16IAX7 (Type:82TD),for Legion 7 16ARHA7 (Type:82UH), R9000K ARHA7 Y9000K IAX7 p/n:FPRV DFSCL12E06486J FPRW DFSCL12E16486J 5H40S20727 5H40S20728 DC12V 1A

What’s Next

Fasgear PCIe 5.0 Power Cable 70cm – 16pin (12+4) 12VHPWR Connector for RTX 5070 Ti 4070 4080 4090 to 2x8pin (6+2) PCI-e Male Plugs GPU Sleeved Cord for ASUS EVGA Seasonic Modular Power Supply