Machine Learning for Audio: Getting Started with Source Separation and Classification

Machine learning has opened up audio capabilities that were impossible with traditional DSP: separating vocals from a mix, identifying sounds in a recording, enhancing speech quality in real-time. But knowing when to use ML — and when traditional DSP is better — is just as important as knowing how.

Here’s what I’ve learned from building ML audio systems.

When ML beats traditional DSP

Traditional DSP is deterministic and efficient. If you can describe the problem mathematically (filter design, dynamics processing, spectral analysis), DSP is usually the better choice — faster, more predictable, and easier to debug.

ML wins when the problem involves pattern recognition that’s hard to express as equations:

Problem	DSP approach	ML approach	Winner
Lowpass filter	Butterworth/Chebyshev	Overkill	DSP
Compression	Envelope follower + gain	Overkill	DSP
Vocal separation	Spectral subtraction (poor)	UNet/transformer (good)	ML
Speaker identification	MFCCs + GMM (okay)	Deep embedding (better)	ML
Noise reduction	Spectral gating (okay)	RNNoise/DeepFilterNet (better)	ML
Audio classification	Hand-crafted features	CNN on spectrograms	ML

The rule of thumb: if a human can hear the difference but you can’t write a formula for it, ML is probably the right tool.

Source separation: how it works

Source separation takes a mixed audio signal and extracts individual components — vocals, drums, bass, other instruments. The most common approach uses a spectrogram masking model:

1. Convert to spectrogram

The mixed audio is transformed to a time-frequency representation using an STFT (Short-Time Fourier Transform). This gives you a 2D image where the x-axis is time, y-axis is frequency, and pixel intensity is magnitude.

import librosa
import numpy as np

# Load mixed audio
mix, sr = librosa.load('song.wav', sr=44100)

# STFT → spectrogram
stft = librosa.stft(mix, n_fft=2048, hop_length=512)
magnitude = np.abs(stft)
phase = np.angle(stft)

2. Predict masks with a neural network

A UNet (encoder-decoder with skip connections) takes the magnitude spectrogram as input and predicts a mask for each source. The mask is a matrix of values between 0 and 1, where 1 means “this time-frequency bin belongs to this source.”

Input: magnitude spectrogram (1025 × T)
  ↓
Encoder: Conv → Pool → Conv → Pool → ...
  ↓
Bottleneck
  ↓
Decoder: UpConv → Concat(skip) → Conv → ...
  ↓
Output: mask per source (1025 × T × N_sources)

3. Apply masks and reconstruct

Multiply each mask by the original magnitude, combine with the original phase, and inverse-STFT back to audio:

vocal_magnitude = magnitude * vocal_mask
vocal_stft = vocal_magnitude * np.exp(1j * phase)
vocals = librosa.istft(vocal_stft, hop_length=512)

The quality of the separation depends on the model architecture, training data, and loss function. State-of-the-art models (Demucs, Open-Unmix) achieve impressive results, but there’s always some bleed between sources.

Audio classification: the spectrogram-as-image approach

Audio classification (identifying what sounds are in a recording) leverages a key insight: spectrograms look like images, and CNNs are great at images.

The pipeline

Preprocess — Convert audio to mel-spectrograms (log-scaled frequency, which matches human perception)
Train — Feed spectrograms through a CNN or transformer as if they were images
Predict — Pass a new audio clip through the same pipeline → get class probabilities

import tensorflow as tf
from tensorflow import keras

# Simple 1D CNN for audio classification
model = keras.Sequential([
    keras.layers.Conv1D(64, 3, activation='relu', input_shape=(sample_length, 1)),
    keras.layers.MaxPooling1D(2),
    keras.layers.Conv1D(128, 3, activation='relu'),
    keras.layers.MaxPooling1D(2),
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(num_classes, activation='softmax')
])

Data is everything

The model is only as good as your training data. For audio classification:

Balanced classes — Equal numbers of examples per class, or use class weights
Augmentation — Time stretching, pitch shifting, adding background noise, random cropping
Real-world noise — Train on data that matches your deployment environment
Sufficient duration — Most audio events need 0.5–2 seconds of context

Deploying ML audio in production

On the server

For offline processing (batch source separation, catalogue classification), run the model on a GPU server. This is the simplest approach — no size constraints, no latency requirements.

In the browser

For real-time applications, you have two options:

TensorFlow.js / ONNX Runtime Web — Run the model in the browser using WebGL or WebGPU for acceleration. Works for smaller models (~10–50MB).
Custom WASM inference — For maximum control, compile a minimal inference engine to WebAssembly. This gives you predictable performance but requires more engineering.

The key constraint is model size. Users won’t wait to download a 500MB model. Techniques like quantization (float32 → int8), pruning, and knowledge distillation can reduce model size by 4–10x with minimal quality loss.

On mobile

Core ML (iOS) and TensorFlow Lite (Android) both support real-time audio inference. The pattern:

Capture audio in real-time (AudioEngine / AudioRecord)
Buffer to the model’s expected input size
Run inference on each buffer
Act on the results (display classification, apply enhancement)

Practical advice

Start with existing models. Don’t train from scratch unless you have a unique dataset or specific requirements. Open-source models for source separation (Demucs), speech enhancement (RNNoise), and audio classification (PANNs) are excellent starting points.

Measure latency, not just accuracy. A model with 95% accuracy at 500ms latency is useless for real-time applications. Profile on your target hardware early.

Consider the cold start. ML models need to load weights into memory. On mobile, this can take 1–3 seconds. On the web, model download time dominates. Design your UX around this.

Hybrid approaches work well. Use ML for the hard part (separation, classification) and traditional DSP for everything else (filtering, dynamics, mixing). Don’t put an ML model where a simple filter would do.

Music Source Separation UNet-based vocal isolation — training, architecture, and results.

Cough Detection 1D CNN audio classification for detecting coughs in audio streams.

Need ML-powered audio features in your product? From source separation to real-time classification, I build production ML audio systems. View services → or let’s talk about your project →.