Skip to content

Machine Learning for Audio: Getting Started with Source Separation and Classification

Machine learning has opened up audio capabilities that were impossible with traditional DSP: separating vocals from a mix, identifying sounds in a recording, enhancing speech quality in real-time. But knowing when to use ML — and when traditional DSP is better — is just as important as knowing how.

Here’s what I’ve learned from building ML audio systems.

Traditional DSP is deterministic and efficient. If you can describe the problem mathematically (filter design, dynamics processing, spectral analysis), DSP is usually the better choice — faster, more predictable, and easier to debug.

ML wins when the problem involves pattern recognition that’s hard to express as equations:

ProblemDSP approachML approachWinner
Lowpass filterButterworth/ChebyshevOverkillDSP
CompressionEnvelope follower + gainOverkillDSP
Vocal separationSpectral subtraction (poor)UNet/transformer (good)ML
Speaker identificationMFCCs + GMM (okay)Deep embedding (better)ML
Noise reductionSpectral gating (okay)RNNoise/DeepFilterNet (better)ML
Audio classificationHand-crafted featuresCNN on spectrogramsML

The rule of thumb: if a human can hear the difference but you can’t write a formula for it, ML is probably the right tool.

Source separation takes a mixed audio signal and extracts individual components — vocals, drums, bass, other instruments. The most common approach uses a spectrogram masking model:

The mixed audio is transformed to a time-frequency representation using an STFT (Short-Time Fourier Transform). This gives you a 2D image where the x-axis is time, y-axis is frequency, and pixel intensity is magnitude.

import librosa
import numpy as np
# Load mixed audio
mix, sr = librosa.load('song.wav', sr=44100)
# STFT → spectrogram
stft = librosa.stft(mix, n_fft=2048, hop_length=512)
magnitude = np.abs(stft)
phase = np.angle(stft)

A UNet (encoder-decoder with skip connections) takes the magnitude spectrogram as input and predicts a mask for each source. The mask is a matrix of values between 0 and 1, where 1 means “this time-frequency bin belongs to this source.”

Input: magnitude spectrogram (1025 × T)
Encoder: Conv → Pool → Conv → Pool → ...
Bottleneck
Decoder: UpConv → Concat(skip) → Conv → ...
Output: mask per source (1025 × T × N_sources)

Multiply each mask by the original magnitude, combine with the original phase, and inverse-STFT back to audio:

vocal_magnitude = magnitude * vocal_mask
vocal_stft = vocal_magnitude * np.exp(1j * phase)
vocals = librosa.istft(vocal_stft, hop_length=512)

The quality of the separation depends on the model architecture, training data, and loss function. State-of-the-art models (Demucs, Open-Unmix) achieve impressive results, but there’s always some bleed between sources.

Audio classification: the spectrogram-as-image approach

Section titled “Audio classification: the spectrogram-as-image approach”

Audio classification (identifying what sounds are in a recording) leverages a key insight: spectrograms look like images, and CNNs are great at images.

  1. Preprocess — Convert audio to mel-spectrograms (log-scaled frequency, which matches human perception)
  2. Train — Feed spectrograms through a CNN or transformer as if they were images
  3. Predict — Pass a new audio clip through the same pipeline → get class probabilities
import tensorflow as tf
from tensorflow import keras
# Simple 1D CNN for audio classification
model = keras.Sequential([
keras.layers.Conv1D(64, 3, activation='relu', input_shape=(sample_length, 1)),
keras.layers.MaxPooling1D(2),
keras.layers.Conv1D(128, 3, activation='relu'),
keras.layers.MaxPooling1D(2),
keras.layers.Flatten(),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(num_classes, activation='softmax')
])

The model is only as good as your training data. For audio classification:

  • Balanced classes — Equal numbers of examples per class, or use class weights
  • Augmentation — Time stretching, pitch shifting, adding background noise, random cropping
  • Real-world noise — Train on data that matches your deployment environment
  • Sufficient duration — Most audio events need 0.5–2 seconds of context

For offline processing (batch source separation, catalogue classification), run the model on a GPU server. This is the simplest approach — no size constraints, no latency requirements.

For real-time applications, you have two options:

  1. TensorFlow.js / ONNX Runtime Web — Run the model in the browser using WebGL or WebGPU for acceleration. Works for smaller models (~10–50MB).

  2. Custom WASM inference — For maximum control, compile a minimal inference engine to WebAssembly. This gives you predictable performance but requires more engineering.

The key constraint is model size. Users won’t wait to download a 500MB model. Techniques like quantization (float32 → int8), pruning, and knowledge distillation can reduce model size by 4–10x with minimal quality loss.

Core ML (iOS) and TensorFlow Lite (Android) both support real-time audio inference. The pattern:

  1. Capture audio in real-time (AudioEngine / AudioRecord)
  2. Buffer to the model’s expected input size
  3. Run inference on each buffer
  4. Act on the results (display classification, apply enhancement)

Start with existing models. Don’t train from scratch unless you have a unique dataset or specific requirements. Open-source models for source separation (Demucs), speech enhancement (RNNoise), and audio classification (PANNs) are excellent starting points.

Measure latency, not just accuracy. A model with 95% accuracy at 500ms latency is useless for real-time applications. Profile on your target hardware early.

Consider the cold start. ML models need to load weights into memory. On mobile, this can take 1–3 seconds. On the web, model download time dominates. Design your UX around this.

Hybrid approaches work well. Use ML for the hard part (separation, classification) and traditional DSP for everything else (filtering, dynamics, mixing). Don’t put an ML model where a simple filter would do.


Need ML-powered audio features in your product? From source separation to real-time classification, I build production ML audio systems. View services → or let’s talk about your project →.