Skip to content

WebLLM

WebLLM is a web application that runs large language models entirely in the browser using WebGPU. No server required — the model runs locally on your GPU, keeping conversations private and eliminating API costs.

Running LLMs typically requires server infrastructure, API keys, and sending user data over the network. For privacy-sensitive use cases or offline scenarios, a fully client-side solution is more appropriate — but browser-based ML inference has historically been too slow to be practical.

WebGPU (the successor to WebGL) exposes GPU compute capabilities to JavaScript, enabling the same kind of parallel processing that powers native ML inference. This project uses WebGPU to run transformer models directly in the browser:

  1. Model loading — Downloads and caches the model weights in the browser
  2. GPU inference — Runs the transformer forward pass on the user’s GPU via WebGPU shaders
  3. Token streaming — Generates tokens one at a time, streaming the response to the UI in real-time

The result is a ChatGPT-like experience running entirely on the client — no API calls, no data leaving the device.

  • Browser-Native — Runs entirely client-side using WebGPU acceleration
  • Model Selection — Choose from multiple available models of different sizes
  • Streaming Responses — See tokens generated in real-time as the model produces output
  • Resizable Interface — Adjustable response area for comfortable reading
  • Privacy by Default — All processing happens locally, no data leaves your browser
  • Zero Infrastructure — No server, no API keys, no ongoing costs

WebLLM requires a browser with WebGPU support (Chrome 113+, Edge 113+). A GPU with sufficient VRAM is recommended for reasonable inference speeds.

This project demonstrates building cutting-edge ML applications for the browser — the same WebGPU and streaming techniques apply to on-device audio ML models, real-time audio enhancement, and privacy-preserving audio processing tools.

LanguageTypeScript (89.5%)
APIWebGPU
MLTransformer-based LLM inference
RuntimeBun

The source code is available on the project’s GitHub repository.