WebLLM
Overview
Section titled “Overview”WebLLM is a web application that runs large language models entirely in the browser using WebGPU. No server required — the model runs locally on your GPU, keeping conversations private and eliminating API costs.
The challenge
Section titled “The challenge”Running LLMs typically requires server infrastructure, API keys, and sending user data over the network. For privacy-sensitive use cases or offline scenarios, a fully client-side solution is more appropriate — but browser-based ML inference has historically been too slow to be practical.
The approach
Section titled “The approach”WebGPU (the successor to WebGL) exposes GPU compute capabilities to JavaScript, enabling the same kind of parallel processing that powers native ML inference. This project uses WebGPU to run transformer models directly in the browser:
- Model loading — Downloads and caches the model weights in the browser
- GPU inference — Runs the transformer forward pass on the user’s GPU via WebGPU shaders
- Token streaming — Generates tokens one at a time, streaming the response to the UI in real-time
The result is a ChatGPT-like experience running entirely on the client — no API calls, no data leaving the device.
Features
Section titled “Features”- Browser-Native — Runs entirely client-side using WebGPU acceleration
- Model Selection — Choose from multiple available models of different sizes
- Streaming Responses — See tokens generated in real-time as the model produces output
- Resizable Interface — Adjustable response area for comfortable reading
- Privacy by Default — All processing happens locally, no data leaves your browser
- Zero Infrastructure — No server, no API keys, no ongoing costs
Requirements
Section titled “Requirements”WebLLM requires a browser with WebGPU support (Chrome 113+, Edge 113+). A GPU with sufficient VRAM is recommended for reasonable inference speeds.
Why this matters
Section titled “Why this matters”This project demonstrates building cutting-edge ML applications for the browser — the same WebGPU and streaming techniques apply to on-device audio ML models, real-time audio enhancement, and privacy-preserving audio processing tools.
Tech Stack
Section titled “Tech Stack”| Language | TypeScript (89.5%) |
| API | WebGPU |
| ML | Transformer-based LLM inference |
| Runtime | Bun |
Source Code
Section titled “Source Code”The source code is available on the project’s GitHub repository.