Technical notes

Inside
the analysis
pipeline.

Sangam analyses each uploaded song with a signal-processing pipeline, then a chat agent proposes a mashup plan from the analysis. The agent reads a short text summary of each song — the raw audio stays on the server. The renderer applies the plan with time-stretching, pitch-shifting, and EQ-crossing to produce the final mp3.

22,050

samples per second, mono

~4M

floats in a 3-minute song

features extracted per song

stages in the pipeline

01 / DSP

Decoding the audio

mp3 → numpy float32

decode · pydub + ffmpeg

The audio file becomes a 1D array of floats.

Pydub (via ffmpeg) decodes the mp3 and resamples it to 22,050 Hz mono. The samples come in as int16, then get divided by 2¹⁵ to produce a float32 array in the range [−1, 1]. All later stages operate on this array. Sangam doesn't separate the song into stems; it works on the summed waveform.

samples = seg.set_channels(1).set_frame_rate(22050)
samples /= 2**(8 * sample_width - 1)

▶ Waveform — first 0.6 s

samples[:10] =[0.000, 0.124, 0.318, 0.441, 0.502, 0.418, 0.221, −0.097, −0.348, −0.501] · dtype float32 · shape (3,969,000,)

02 / DSP

Estimating tempo

librosa.beat.beat_track

▶ Onset envelope + beat grid

autocorrelation peak at 0.500 s → 60 ÷ 0.5 = 120 BPM. Dynamic programming places beats on local onset maxima, evenly spaced.

BPM · librosa

Tempo is the period of repetition in the onset envelope.

Librosa first computes an onset envelope — a signal whose value rises whenever something new starts in the audio. It comes from the STFT: take the time difference of each frequency bin, keep only positive changes, sum across frequencies. Autocorrelation of this envelope then reveals the beat period as the lag at which the signal matches itself most strongly. A final dynamic-programming step snaps beats to nearby onset peaks while keeping them evenly spaced.

tempo, beats = librosa.beat.beat_track(y=samples, sr=22050)
# confidence proxy:
bpm_conf = 1 - std(diff(beats)) / mean(diff(beats))

03 / DSP

Estimating key

chroma cqt · Krumhansl & Kessler 1982

key detection · librosa + numpy

Twelve pitch classes, compared against 24 reference profiles.

The constant-Q transform produces log-spaced frequency bins, one per semitone. Folding the octaves on top of each other gives a 12-element chroma vector — the share of energy at each pitch class. Sangam then compares this vector against 24 reference profiles (12 major and 12 minor keys) measured by Krumhansl and Kessler in 1982 by asking listeners how well each note fit a given key context. The profile with the highest Pearson correlation is reported as the song's key.

chroma = librosa.feature.chroma_cqt(samples, sr).mean(axis=1)
# 24 templates → highest Pearson r wins
best_score = max(corrcoef(roll(_MAJOR, i), chroma) for i in range(12))

▶ Chroma profile vs. F♯ minor template

chroma observed F♯ minor template · winning key: F♯:min

04 / DSP

Measuring energy

root mean square · per-second

▶ rms_curve_db over a 3:38 song

Per-second RMS, dB: −22.4 → −14.1 → −9.7 → −7.2 → −15.9 → −18.6. The drop is 15 dB louder than the intro.

energy · pure numpy

RMS energy, summarised second by second.

The samples are split into one-second windows. For each window the code computes √(mean(x²)) and converts the result to dB. The resulting curve is overlaid on the detected section boundaries, so each section gets an average loudness. The segment picker uses these averages to tell loud bars (likely hooks or drops) apart from quieter ones.

for w in windowed(samples, sr):
rms = sqrt(mean(w**2))
rms_db = 20 * log10(rms + 1e-9)

05 / DSP

Scoring the downbeats

onset strength · ±100 ms window

onset · librosa

Each downbeat gets a punch score.

Every fourth beat from the beat tracker is treated as a downbeat. The code reads the onset envelope in a ±100 ms window around each downbeat and stores the maximum value. Bars where the downbeat lands on a kick drum produce higher scores than bars where it lands on a softer instrument. The segment picker uses these scores when choosing cut points, so transitions tend to land on bars with a clear attack.

onset_env = librosa.onset.onset_strength(samples, sr)
for d in downbeats:
lo, hi = d−100ms, d+100ms
score = max(onset_env[lo:hi])

▶ Onset strength near 4 downbeats

Bars 1 and 4 score above 0.85 and become cut candidates; bar 3 (0.18) is rejected. The downbeat grid is the same for all four bars — the onset score is what differs.

06 / DSP

Rendering the mashup

pyrubberband · scipy.signal

▶ EQ crossover at 250 Hz · Butterworth 4th-order

sosfiltfilt applies the filter forward and then backward, which cancels the phase shift either pass would introduce on its own. The filtered low and high halves are then summed to produce the crossfade.

render · rubberband + scipy

Time-stretch, pitch-shift, then EQ-crossfade.

Pyrubberband wraps the rubberband CLI, which uses a phase vocoder to change duration without changing pitch. Sangam calls time_stretch with the ratio of source to target BPM, then pitch_shift in semitones to align keys. The crossfade is built from two 4th-order Butterworth filters at 250 Hz: one keeps the low frequencies of the outgoing clip, the other keeps the high frequencies of the incoming clip. The two halves are summed to produce the join.

stretched = pyrb.time_stretch(b, sr, ratio=128/142)
shifted = pyrb.pitch_shift(stretched, sr, n_steps=−2)
sos = sps.butter(4, 250/(sr/2), 'low', output='sos')

Python libraries in the pipeline

The stack.

pydub + ffmpeg

Decode mp3 → AudioSegment, set channels/sample rate, measure dBFS.

sangam.py:349

numpy

The sample array itself. Every arithmetic op — RMS, dB, normalisation.

everywhere

librosa

Beat tracking via onset envelope + autocorrelation. Chroma CQT. Onset strength.

:353, :383, :434

scipy.signal

Butterworth filter design, zero-phase forward-backward filtering.

:1376–1381

pyrubberband

Phase-vocoder time-stretch and pitch-shift (CLI subprocess).

:1046

soundfile

Write the final mixdown to disk as wav / mp3.

:1315

07 / AI

What the chat agent reads

backend/app/chat.py:412

The agent receives one short paragraph per song.

Each paragraph lists the song's BPM, key, duration, detected section labels, and any pre-fetched web metadata. The audio array and the curves derived from it (RMS, chroma, onset envelope, downbeats) are kept on the server and used only by the renderer.

YES What reaches the LLM

bpmsingle scalar

key"F#:min"

duration_sscalar

fine_sections_s[start, end, label, bars]

sections_shard-constraint list

section avg dBcomputed server-side

web_metadatacomposer, year, film

NO What stays server-side

samples4M floats

rms_curve_db~180 floats / song

downbeats_sused by segment picker

onset_strength_at_downbeatscut-point weighting

chroma vectorused only for key

FFT / STFTintermediate, not stored

audio byteskept in the worker

PROMPTThe exact text the model receives

## Uploaded files (in order)
Each file is shown with its `upload_index` ...

[upload_index=0] dilbar.mp3 — 218s, 116 BPM, key F#:min
   sections (6 detected):
     • intro     0:00-0:16  (-22.4 dB avg, 8 bars)
     • build     0:16-0:48  (-14.1 dB avg, 16 bars)
     • hook      0:48-1:20  ( -9.7 dB avg, 16 bars)
     • drop      1:20-1:52  ( -7.2 dB avg, 16 bars)
     • breakdown 1:52-2:24  (-15.9 dB avg, 16 bars)
     • outro     2:24-3:38  (-18.6 dB avg, 32 bars)
   web metadata: composer: Tanishk Bagchi; singer: Neha Kakkar;
                 album/film: Satyameva Jayate; year: 2018

## ONLY THESE SECTIONS EXIST
[upload_index=0] dilbar.mp3 — 6 sections bounded at
  0s, 16s, 48s, 80s, 112s, 144s, 218s.
These are the only timestamps you may cite as section boundaries.

WHYConstrained timestamps

The prompt ends with a list of the only section boundaries the agent is allowed to cite. If the agent proposes a cut at a timestamp that isn't in that list, the segment picker rejects the plan during validation. The prompt also avoids including raw per-second curves, so there are fewer numerical details for the agent to misread.

FLOWAgent decides, renderer executes

The agent's job is to choose which sections to splice and what target BPM and key to aim for. The renderer then reads the resulting plan and does the time-stretching, pitch-shifting, EQ-crossfading, and writing. Given the same input songs and the same plan, the renderer produces the same mp3.

∞

Insidethe analysispipeline.