graphic_eq
Sangam
Technical notes

Inside
the analysis
pipeline.

Sangam analyses each uploaded song with a signal-processing pipeline, then a chat agent proposes a mashup plan from the analysis. The agent reads a short text summary of each song — the raw audio stays on the server. The renderer applies the plan with time-stretching, pitch-shifting, and EQ-crossing to produce the final mp3.

22,050
samples per second, mono
~4M
floats in a 3-minute song
6
features extracted per song
7
stages in the pipeline
01 / DSP

Decoding the audio

mp3 → numpy float32
01
decode · pydub + ffmpeg

The audio file becomes a 1D array of floats.

Pydub (via ffmpeg) decodes the mp3 and resamples it to 22,050 Hz mono. The samples come in as int16, then get divided by 2¹⁵ to produce a float32 array in the range [−1, 1]. All later stages operate on this array. Sangam doesn't separate the song into stems; it works on the summed waveform.

samples = seg.set_channels(1).set_frame_rate(22050)
samples /= 2**(8 * sample_width - 1)
▶ Waveform — first 0.6 s
+1.00−1.0
samples[:10] =[0.000, 0.124, 0.318, 0.441, 0.502, 0.418, 0.221, −0.097, −0.348, −0.501]  ·  dtype float32  ·  shape (3,969,000,)
02 / DSP

Estimating tempo

librosa.beat.beat_track
▶ Onset envelope + beat grid
lag ≈ 0.50 s
autocorrelation peak at 0.500 s 60 ÷ 0.5 = 120 BPM. Dynamic programming places beats on local onset maxima, evenly spaced.
02
BPM · librosa

Tempo is the period of repetition in the onset envelope.

Librosa first computes an onset envelope — a signal whose value rises whenever something new starts in the audio. It comes from the STFT: take the time difference of each frequency bin, keep only positive changes, sum across frequencies. Autocorrelation of this envelope then reveals the beat period as the lag at which the signal matches itself most strongly. A final dynamic-programming step snaps beats to nearby onset peaks while keeping them evenly spaced.

tempo, beats = librosa.beat.beat_track(y=samples, sr=22050)
# confidence proxy:
bpm_conf = 1 - std(diff(beats)) / mean(diff(beats))
03 / DSP

Estimating key

chroma cqt · Krumhansl & Kessler 1982
03
key detection · librosa + numpy

Twelve pitch classes, compared against 24 reference profiles.

The constant-Q transform produces log-spaced frequency bins, one per semitone. Folding the octaves on top of each other gives a 12-element chroma vector — the share of energy at each pitch class. Sangam then compares this vector against 24 reference profiles (12 major and 12 minor keys) measured by Krumhansl and Kessler in 1982 by asking listeners how well each note fit a given key context. The profile with the highest Pearson correlation is reported as the song's key.

chroma = librosa.feature.chroma_cqt(samples, sr).mean(axis=1)
# 24 templates → highest Pearson r wins
best_score = max(corrcoef(roll(_MAJOR, i), chroma) for i in range(12))
▶ Chroma profile vs. F♯ minor template
CC#DD#EFF#GG#AA#B
chroma observed  F♯ minor template  ·  winning key: F♯:min
04 / DSP

Measuring energy

root mean square · per-second
▶ rms_curve_db over a 3:38 song
-6 dB-15 dB-25 dBintrobuildhookdropbreakdownoutro
Per-second RMS, dB: −22.4 → −14.1 → −9.7 → −7.2 → −15.9 → −18.6. The drop is 15 dB louder than the intro.
04
energy · pure numpy

RMS energy, summarised second by second.

The samples are split into one-second windows. For each window the code computes √(mean(x²)) and converts the result to dB. The resulting curve is overlaid on the detected section boundaries, so each section gets an average loudness. The segment picker uses these averages to tell loud bars (likely hooks or drops) apart from quieter ones.

for w in windowed(samples, sr):
  rms = sqrt(mean(w**2))
  rms_db = 20 * log10(rms + 1e-9)
05 / DSP

Scoring the downbeats

onset strength · ±100 ms window
05
onset · librosa

Each downbeat gets a punch score.

Every fourth beat from the beat tracker is treated as a downbeat. The code reads the onset envelope in a ±100 ms window around each downbeat and stores the maximum value. Bars where the downbeat lands on a kick drum produce higher scores than bars where it lands on a softer instrument. The segment picker uses these scores when choosing cut points, so transitions tend to land on bars with a clear attack.

onset_env = librosa.onset.onset_strength(samples, sr)
for d in downbeats:
  lo, hi = d100ms, d+100ms
  score = max(onset_env[lo:hi])
▶ Onset strength near 4 downbeats
0.920.610.180.88±100 ms windows around downbeats
Bars 1 and 4 score above 0.85 and become cut candidates; bar 3 (0.18) is rejected. The downbeat grid is the same for all four bars — the onset score is what differs.
06 / DSP

Rendering the mashup

pyrubberband · scipy.signal
▶ EQ crossover at 250 Hz · Butterworth 4th-order
20 Hz250 Hz2 kHz20 kHzcrossoverlow ▸ A's basshigh ▸ B's treble
sosfiltfilt applies the filter forward and then backward, which cancels the phase shift either pass would introduce on its own. The filtered low and high halves are then summed to produce the crossfade.
06
render · rubberband + scipy

Time-stretch, pitch-shift, then EQ-crossfade.

Pyrubberband wraps the rubberband CLI, which uses a phase vocoder to change duration without changing pitch. Sangam calls time_stretch with the ratio of source to target BPM, then pitch_shift in semitones to align keys. The crossfade is built from two 4th-order Butterworth filters at 250 Hz: one keeps the low frequencies of the outgoing clip, the other keeps the high frequencies of the incoming clip. The two halves are summed to produce the join.

stretched = pyrb.time_stretch(b, sr, ratio=128/142)
shifted = pyrb.pitch_shift(stretched, sr, n_steps=2)
sos = sps.butter(4, 250/(sr/2), 'low', output='sos')
Python libraries in the pipeline

The stack.

pydub + ffmpeg
Decode mp3 → AudioSegment, set channels/sample rate, measure dBFS.
sangam.py:349
numpy
The sample array itself. Every arithmetic op — RMS, dB, normalisation.
everywhere
librosa
Beat tracking via onset envelope + autocorrelation. Chroma CQT. Onset strength.
:353, :383, :434
scipy.signal
Butterworth filter design, zero-phase forward-backward filtering.
:1376–1381
pyrubberband
Phase-vocoder time-stretch and pitch-shift (CLI subprocess).
:1046
soundfile
Write the final mixdown to disk as wav / mp3.
:1315
07 / AI

What the chat agent reads

backend/app/chat.py:412

The agent receives one short paragraph per song.

Each paragraph lists the song's BPM, key, duration, detected section labels, and any pre-fetched web metadata. The audio array and the curves derived from it (RMS, chroma, onset envelope, downbeats) are kept on the server and used only by the renderer.

YES What reaches the LLM
bpmsingle scalar
key"F#:min"
duration_sscalar
fine_sections_s[start, end, label, bars]
sections_shard-constraint list
section avg dBcomputed server-side
web_metadatacomposer, year, film
NO What stays server-side
samples4M floats
rms_curve_db~180 floats / song
downbeats_sused by segment picker
onset_strength_at_downbeatscut-point weighting
chroma vectorused only for key
FFT / STFTintermediate, not stored
audio byteskept in the worker
PROMPTThe exact text the model receives
## Uploaded files (in order)
Each file is shown with its `upload_index` ...

[upload_index=0] dilbar.mp3 — 218s, 116 BPM, key F#:min
   sections (6 detected):
     • intro     0:00-0:16  (-22.4 dB avg, 8 bars)
     • build     0:16-0:48  (-14.1 dB avg, 16 bars)
     • hook      0:48-1:20  ( -9.7 dB avg, 16 bars)
     • drop      1:20-1:52  ( -7.2 dB avg, 16 bars)
     • breakdown 1:52-2:24  (-15.9 dB avg, 16 bars)
     • outro     2:24-3:38  (-18.6 dB avg, 32 bars)
   web metadata: composer: Tanishk Bagchi; singer: Neha Kakkar;
                 album/film: Satyameva Jayate; year: 2018

## ONLY THESE SECTIONS EXIST
[upload_index=0] dilbar.mp3 — 6 sections bounded at
  0s, 16s, 48s, 80s, 112s, 144s, 218s.
These are the only timestamps you may cite as section boundaries.
WHYConstrained timestamps

The prompt ends with a list of the only section boundaries the agent is allowed to cite. If the agent proposes a cut at a timestamp that isn't in that list, the segment picker rejects the plan during validation. The prompt also avoids including raw per-second curves, so there are fewer numerical details for the agent to misread.

FLOWAgent decides, renderer executes

The agent's job is to choose which sections to splice and what target BPM and key to aim for. The renderer then reads the resulting plan and does the time-stretching, pitch-shifting, EQ-crossfading, and writing. Given the same input songs and the same plan, the renderer produces the same mp3.

The whole pipeline

audio in.mp3 × 2–6DSP pipelinepydub · librosa · numpystages 1–6SongSummarybpm · key · sections · rmsdownbeats · onset · metatext renderonly 4 fields survive+ avg dB summarisedLLM (Agno)chats with the userreads text, not audioplansegments[] · target bpmtarget key · transitionsrendererpyrubberband · scipystretch · shift · EQ-crossmashup.mp3 outfinal outputreads promptUPLOAD → ANALYSE → PLAN → RENDER