WebLLM

Overview

WebLLM is a web application that runs large language models entirely in the browser using WebGPU. No server required — the model runs locally on your GPU, keeping conversations private and eliminating API costs.

The challenge

Running LLMs typically requires server infrastructure, API keys, and sending user data over the network. For privacy-sensitive use cases or offline scenarios, a fully client-side solution is more appropriate — but browser-based ML inference has historically been too slow to be practical.

The approach

WebGPU (the successor to WebGL) exposes GPU compute capabilities to JavaScript, enabling the same kind of parallel processing that powers native ML inference. This project uses WebGPU to run transformer models directly in the browser:

Model loading — Downloads and caches the model weights in the browser
GPU inference — Runs the transformer forward pass on the user’s GPU via WebGPU shaders
Token streaming — Generates tokens one at a time, streaming the response to the UI in real-time

The result is a ChatGPT-like experience running entirely on the client — no API calls, no data leaving the device.

Features

Browser-Native — Runs entirely client-side using WebGPU acceleration
Model Selection — Choose from multiple available models of different sizes
Streaming Responses — See tokens generated in real-time as the model produces output
Resizable Interface — Adjustable response area for comfortable reading
Privacy by Default — All processing happens locally, no data leaves your browser
Zero Infrastructure — No server, no API keys, no ongoing costs

Requirements

WebLLM requires a browser with WebGPU support (Chrome 113+, Edge 113+). A GPU with sufficient VRAM is recommended for reasonable inference speeds.

Why this matters

This project demonstrates building cutting-edge ML applications for the browser — the same WebGPU and streaming techniques apply to on-device audio ML models, real-time audio enhancement, and privacy-preserving audio processing tools.

Tech Stack


Language	TypeScript (89.5%)
API	WebGPU
ML	Transformer-based LLM inference
Runtime	Bun

Source Code

The source code is available on the project’s GitHub repository.