Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

TL;DR

NanoEuler introduces a GPT-2 scale language model developed entirely from scratch in C and CUDA. It features a full training pipeline, verified gradients, and runs on a single consumer GPU. The project aims to demonstrate engineering and educational value rather than production readiness.

NanoEuler, a GPT-2 scale language model built entirely from scratch in C and CUDA, has been publicly released. Developed as a research and educational artifact, it features a complete training pipeline, hand-written tokenizer, and verified backpropagation, trained on a single consumer GPU. This project emphasizes engineering transparency and understanding of core ML components.

The project, shared on Hacker News, includes a hand-written byte-level BPE tokenizer, a full training pipeline, and a small showcase model (~1 million parameters) running on CPU, as well as a larger model (~116 million parameters) trained on a single RTX 4070 GPU. The training process incorporates custom CUDA kernels, such as FlashAttention, and verifies gradients through double-precision checks, ensuring correctness of backpropagation.

According to the developer, the model demonstrates fluent English generation but lacks real-world knowledge, consistent with its size and training data. The architecture follows a decoder-only transformer design, with features like RMSNorm, rotary position embeddings, SwiGLU feed-forward, and grouped-query attention. The project aims to own every piece of the pipeline, from tokenization to training kernels, emphasizing transparency and educational value.

At a glance
announcementWhen: announced on Hacker News, current statu…
The developmentThe developer released NanoEuler, a GPT-2 class language model built from scratch in C/CUDA, with a complete training pipeline and verified gradients, trained on a single GPU.

Implications for AI Development and Education

This project underscores the feasibility of building complex language models entirely from scratch, without reliance on existing ML libraries or frameworks. It provides a detailed, transparent example of the inner workings of transformer models, which can serve as an educational resource for researchers and students. Additionally, demonstrating a full training pipeline on a single consumer GPU highlights the accessibility of AI experimentation, albeit at modest model sizes, fostering a deeper understanding of model mechanics and gradient verification.

Amazon

GPU programming CUDA development kit

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background on From-Scratch ML Model Development

Traditional large language models rely heavily on high-level frameworks like PyTorch or TensorFlow, with extensive dependencies and hardware requirements. Building models from scratch in C/CUDA is rare and typically limited to research prototypes or educational projects. The development of NanoEuler reflects ongoing efforts to demystify the core algorithms, such as backpropagation, attention, and tokenization, by implementing them directly in low-level languages. Prior work in neural ODEs and residual networks has influenced the naming and conceptual approach, framing the model as a discretized ODE using Euler steps.

“This is a research/educational artifact, built in public, demonstrating every piece of the training pipeline from scratch.”

— The developer

Amazon

high-performance CUDA GPU for AI

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Limitations and Unverified Claims

While the gradient verification process is thorough, the model’s capabilities remain limited to small-scale text generation, with no real-world knowledge or sophisticated reasoning. The project does not include extensive training data or large-scale fine-tuning, and its performance as an assistant is shallow. It is also uncertain how well the custom kernels and training pipeline will scale or generalize beyond the current experiments.

Amazon

C/CUDA machine learning development books

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Future Developments and Community Engagement

The developer plans to continue refining the training pipeline, potentially scaling the model size, and experimenting with supervised fine-tuning and reinforcement learning with human feedback (RLHF). Open-sourcing the entire codebase invites community contributions, which may lead to improved models, optimized kernels, and broader educational resources. Monitoring updates and community feedback will be key to assessing the project’s evolution.

Amazon

AI training hardware for small models

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Can this model be used for practical applications?

Currently, the model is a research and educational prototype with limited capabilities. It is not designed for production use but demonstrates core ML principles and the feasibility of from-scratch implementations.

How does NanoEuler verify the correctness of its gradients?

The project performs double-precision gradient checks against finite difference estimates, ensuring that backpropagation is correctly implemented.

Will the project scale to larger models?

The developer aims to experiment with larger models and more data, but scaling beyond current sizes will require significant additional work and resources.

Is the codebase available for public review?

Yes, the project is shared openly on Hacker News and likely on associated repositories, encouraging community inspection and contributions.

What inspired the name ‘Euler’ in the project?

The name references Leonhard Euler, who developed the forward-Euler method of numerical integration, which the model’s residual blocks conceptually emulate.

Source: Hacker News

You May Also Like

Wi‑Fi Printing Problems: The Checklist That Fixes Most Issues

Check your Wi-Fi connection and printer settings to resolve common issues—discover the full checklist to get your printer back online.

I put a datacenter GPU in my gaming PC

A gamer repurposes a Tesla V100 SXM2 data center GPU in a consumer PC, achieving 32GB VRAM for AI inference at a fraction of the cost.

Microsoft open-sources “the earliest DOS source code discovered to date”

Microsoft has open-sourced the oldest DOS source code found to date, including 86-DOS 1.00 kernel and utilities, offering new insights into its origins.

Silurus/ooxml: Pixel-faithful Office documents, rendered in the browser

Silurus/ooxml introduces a Rust/WebAssembly-based library for rendering Office documents in browsers with pixel-perfect accuracy, supporting DOCX, XLSX, and PPTX formats.