TL;DR
A developer is working to accelerate matrix multiplication in Swift for training large language models on Apple Silicon, aiming to boost performance from gigaflops to teraflops. This effort involves low-level optimization and hardware utilization, with ongoing testing and refinement.
A developer is actively working to optimize matrix multiplication code in Swift for training large language models on Apple Silicon, with initial results indicating a path toward reaching teraflop performance levels.
The developer rewrote Andrej Karpathy’s llm.c in Swift, focusing on low-level performance improvements to accelerate training iterations. Initial implementations were slow, but through targeted optimization—such as leveraging SIMD, AMX, and GPU capabilities—the developer aims to increase the computational throughput from gigaflops to teraflops. The core challenge involves maximizing hardware utilization for matrix multiplication, which is the most computationally intensive component of neural network training. The developer is testing different approaches, including multi-threading and hardware-specific features, to achieve these performance goals.
Current benchmarks show that the initial Swift implementation managed around 0.92 Gflops, with ongoing efforts to push this into the Tflop range. The work is being conducted without relying on existing machine learning frameworks, instead focusing on custom, handwritten kernels to better understand the hardware capabilities of Apple Silicon for ML workloads.
Why It Matters
This development is significant because it demonstrates the potential for high-performance neural network training directly in Swift on Apple Silicon hardware, which could enable more developers to build and experiment with large language models without relying on external libraries or cloud-based solutions. Achieving teraflop-level performance in a native environment could also influence future hardware and software optimization efforts for machine learning on Macs and iPads.
Apple Silicon compatible GPU acceleration card
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
Two years ago, the developer revisited an old neural network project, motivated by the lack of native ML training tools on Mac and the desire for more control over calculations. Inspired by Andrej Karpathy’s llm.c, a simple yet representative implementation of GPT-2 architecture, the developer rewrote the code in Swift to explore performance limits. Initial tests revealed the code was far slower than expected, prompting a series of low-level optimizations. The focus has been on the core matrix multiplication routines, which dominate training time. The effort aligns with broader trends of pushing hardware to its limits for ML tasks, especially as Apple Silicon continues to grow in computational power.
“The goal is to push Swift matrix multiplication from gigaflops to teraflops, leveraging all hardware features of Apple Silicon.”
— Developer
“Maximizing hardware utilization in custom kernels can unlock significant performance gains for ML workloads on Apple Silicon.”
— Hardware expert

Mastering Swift 6: Modern programming techniques for high-performance apps in Swift 6.2
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It remains unclear how close the developer is to achieving true teraflop performance in practice, as benchmarks are still in progress and hardware utilization efficiency is being optimized. The impact of future software updates or hardware revisions is also unknown.

Apple 2021 MacBook Pro with Apple M1 Max Chip, 16-inch, 32GB RAM, 1TB SSD Storage, Space Gray (Renewed)
Apple M1 Max chip for a massive leap in CPU, GPU, and machine learning performance
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
The developer plans to continue refining the matrix multiplication kernels, incorporating GPU acceleration with Metal, and benchmarking performance improvements. Future milestones include reaching and verifying teraflop-level throughput and integrating these kernels into full LLM training routines.
metal GPU development kit for Mac
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What are the main challenges in optimizing matrix multiplication in Swift?
The main challenges include effectively utilizing hardware features like SIMD, AMX, and GPU acceleration, managing multi-threading, and minimizing memory bottlenecks to maximize FLOP throughput.
Can this approach replace existing machine learning frameworks?
While promising for experimentation and understanding hardware limits, handwritten kernels are unlikely to replace mature frameworks like TensorFlow or PyTorch for production due to their complexity and optimization level.
How does this work compare to using dedicated ML hardware like GPUs or TPUs?
This approach aims to maximize the potential of Apple Silicon’s integrated hardware, but dedicated ML accelerators like TPUs typically offer higher performance for large-scale training. Still, native optimization in Swift offers more control and potential for integration into Mac applications.
Will this work benefit other ML tasks beyond LLM training?
Yes, optimized matrix multiplication routines can accelerate various neural network operations, potentially improving performance across a range of ML models on Apple Silicon devices.