TL;DR
Thorsten Meyer AI published a guide arguing that GPU power limits and undervolting can reduce heat and fan noise in local AI workstations while preserving much of the tokens-per-second rate. The article cites sustained RTX 4090 data showing a 70% power limit at 300 watts kept 93.4% of speed while cutting 90 watts versus stock, though results vary by card, model and workload.
Thorsten Meyer AI has published a GPU undervolting and power-limiting guide for local AI inference, arguing that users can cut workstation heat and fan noise with little loss in tokens per second because many LLM workloads are constrained by memory bandwidth rather than GPU core compute.
The confirmed development is the publication of a guide and interactive infographic titled around undervolting GPUs for local inference. It presents power limiting as the first recommended step before buying a cooler, changing a case or rearranging fans. The guide says the method costs nothing and can be tested with common tools such as MSI Afterburner on Windows or nvidia-smi and LACT on Linux.
The source states that a sustained RTX 4090 workload at stock settings drew 390 watts, ran at 72 degrees Celsius and delivered 100% of the measured baseline speed. At a 70% power limit, the same table reports 300 watts, 67 degrees Celsius and 93.4% of the baseline speed, or 90 watts less heat output for a 6.6% speed drop. At 80%, it reports 330 watts and 98.6% of speed; at 55%, it reports 240 watts and 89.2% of speed.
The guide distinguishes between power limiting and direct undervolting. Power limiting is described as the safer starting point because users reduce the GPU’s allowed power draw and let the card manage voltage and clocks. Direct undervolting is presented as a more advanced method that edits the voltage-frequency curve, with the guide naming about 0.9 to 0.95 volts as a starting range for testing.
Why It Matters
The development matters for readers running local LLMs because heat, noise and power draw are practical limits for home and office AI workstations. A high-power GPU can make a system louder, warmer and more expensive to run, even when additional core clock speed adds little to inference throughput.
If the guide’s reported pattern holds for a reader’s workload, a power cap could delay or avoid spending on cooling upgrades. It also gives builders a measurable first step: change one setting, run the real model workload, then compare power draw, temperature and tokens per second.

msi Gaming GeForce RTX 3090 24GB GDRR6X 384-Bit HDMI/DP 1875 MHz Ampere Architecture OC Graphics Card (RTX 3090 Suprim X 24G)
Chipset: NVIDIA GeForce RTX 3090
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
The guide is part of a broader Thorsten Meyer AI series on reducing heat and noise in high-power AI workstations. It frames undervolting and power limiting as the first of five levers because they can be applied in software before hardware changes.
The technical argument rests on a distinction between gaming and local inference. The guide says gaming is often more sensitive to GPU core speed, while many local LLM inference workloads spend much of their time waiting on VRAM bandwidth. That is a source claim, and the article also says results vary by model, quantization, GPU and workload.
The source includes a disclosure that the page contains affiliate links and tells readers to confirm prices and specifications before buying gear. It also states that undervolting and power limits are reversible and widely used, but that users make changes at their own risk.
“Local inference is memory-bound.”
— Thorsten Meyer AI guide
“Power limiting moves one slider and can’t damage anything.”
— Thorsten Meyer AI guide
“Test under your real workload.”
— Thorsten Meyer AI guide

COMeap NVIDIA Graphics Card Power Cable 030-0571-000 CPU 8 Pin Male to Dual PCIe 8 Pin Female Adapter for Tesla K80/M40/M60/P40/P100 4 inches (2-Pack) (10cm)
『CPU 8P – Dual PCIe 8P』CPU 8 pin male end to plug into the NVIDIA graphics card, dual…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It is not yet clear how closely the reported RTX 4090 results will match every local inference setup. The guide itself says figures vary by card, model and workload, and that an undervolt stable for a short test can fail later in a longer run. The source also gives RTX 5090 and other card references, but the article’s strongest specific table is the sustained RTX 4090 example.

YiKaiEn 2 Packs 4-Pin PWM Fan Speed Reduction Cable, Optimized Cooling and Noise Reduction, Compatible with Computer Fans for Enhanced Performance 4.5inch (Black Reduce 30% Fan Speed)
【Optimized Cooling & Noise Reduction】: This YIKAIEN 4-Pin PWM fan speed reduction cable helps regulate fan speed for…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
Readers who use the guide are told to start with a power cap around the 60% to 80% band, run their actual inference workload and measure temperature, held clocks, power draw and tokens per second. On Linux, the guide says the cap may need a systemd service or other startup method because settings can reset after reboot.

(2-Pack) COMeap 12 Pin GPU Cable, Dual PCIe 8 Pin Female to Mini 12 Pin Male GPU Power Adapter Extension for NVIDIA GeForce RTX 30 Series 9.5-inch (24cm)
『12 Pin GPU Cable』Dual 8 pin female ends to plug into the power supply, Mini 12 Pin male…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What happened?
Thorsten Meyer AI published a guide and interactive infographic recommending GPU power limiting and undervolting as an early step for reducing heat and noise in local AI inference workstations.
Does the guide prove tokens per second will stay the same?
No. It reports measured and published examples, including an RTX 4090 table, but says results vary by GPU, model, quantization and workload.
What setting does the guide recommend first?
The guide recommends starting with a simple power limit, such as 70%, before editing the voltage-frequency curve directly.
What is the main risk?
Power limiting is framed as low-risk because it restricts the GPU, but direct undervolting can cause instability. The guide says users should test with real long-running workloads.
What should readers measure?
The guide says to measure actual tokens per second, GPU temperature, power draw and held clock speed rather than relying on a short benchmark.
Source: Thorsten Meyer AI