Inside
the analysis
pipeline.
Sangam analyses each uploaded song with a signal-processing pipeline, then a chat agent proposes a mashup plan from the analysis. The agent reads a short text summary of each song — the raw audio stays on the server. The renderer applies the plan with time-stretching, pitch-shifting, and EQ-crossing to produce the final mp3.
Decoding the audio
mp3 → numpy float32The audio file becomes a 1D array of floats.
Pydub (via ffmpeg) decodes the mp3 and resamples it to 22,050 Hz mono. The samples come in as int16, then get divided by 2¹⁵ to produce a float32 array in the range [−1, 1]. All later stages operate on this array. Sangam doesn't separate the song into stems; it works on the summed waveform.
samples /= 2**(8 * sample_width - 1)
Estimating tempo
librosa.beat.beat_trackTempo is the period of repetition in the onset envelope.
Librosa first computes an onset envelope — a signal whose value rises whenever something new starts in the audio. It comes from the STFT: take the time difference of each frequency bin, keep only positive changes, sum across frequencies. Autocorrelation of this envelope then reveals the beat period as the lag at which the signal matches itself most strongly. A final dynamic-programming step snaps beats to nearby onset peaks while keeping them evenly spaced.
# confidence proxy:
bpm_conf = 1 - std(diff(beats)) / mean(diff(beats))
Estimating key
chroma cqt · Krumhansl & Kessler 1982Twelve pitch classes, compared against 24 reference profiles.
The constant-Q transform produces log-spaced frequency bins, one per semitone. Folding the octaves on top of each other gives a 12-element chroma vector — the share of energy at each pitch class. Sangam then compares this vector against 24 reference profiles (12 major and 12 minor keys) measured by Krumhansl and Kessler in 1982 by asking listeners how well each note fit a given key context. The profile with the highest Pearson correlation is reported as the song's key.
# 24 templates → highest Pearson r wins
best_score = max(corrcoef(roll(_MAJOR, i), chroma) for i in range(12))
Measuring energy
root mean square · per-secondRMS energy, summarised second by second.
The samples are split into one-second windows. For each window the code computes √(mean(x²)) and converts the result to dB. The resulting curve is overlaid on the detected section boundaries, so each section gets an average loudness. The segment picker uses these averages to tell loud bars (likely hooks or drops) apart from quieter ones.
rms = sqrt(mean(w**2))
rms_db = 20 * log10(rms + 1e-9)
Scoring the downbeats
onset strength · ±100 ms windowEach downbeat gets a punch score.
Every fourth beat from the beat tracker is treated as a downbeat. The code reads the onset envelope in a ±100 ms window around each downbeat and stores the maximum value. Bars where the downbeat lands on a kick drum produce higher scores than bars where it lands on a softer instrument. The segment picker uses these scores when choosing cut points, so transitions tend to land on bars with a clear attack.
for d in downbeats:
lo, hi = d−100ms, d+100ms
score = max(onset_env[lo:hi])
Rendering the mashup
pyrubberband · scipy.signalTime-stretch, pitch-shift, then EQ-crossfade.
Pyrubberband wraps the rubberband CLI, which uses a phase vocoder to change duration without changing pitch. Sangam calls time_stretch with the ratio of source to target BPM, then pitch_shift in semitones to align keys. The crossfade is built from two 4th-order Butterworth filters at 250 Hz: one keeps the low frequencies of the outgoing clip, the other keeps the high frequencies of the incoming clip. The two halves are summed to produce the join.
shifted = pyrb.pitch_shift(stretched, sr, n_steps=−2)
sos = sps.butter(4, 250/(sr/2), 'low', output='sos')
The stack.
What the chat agent reads
backend/app/chat.py:412The agent receives one short paragraph per song.
Each paragraph lists the song's BPM, key, duration, detected section labels, and any pre-fetched web metadata. The audio array and the curves derived from it (RMS, chroma, onset envelope, downbeats) are kept on the server and used only by the renderer.
## Uploaded files (in order) Each file is shown with its `upload_index` ... [upload_index=0] dilbar.mp3 — 218s, 116 BPM, key F#:min sections (6 detected): • intro 0:00-0:16 (-22.4 dB avg, 8 bars) • build 0:16-0:48 (-14.1 dB avg, 16 bars) • hook 0:48-1:20 ( -9.7 dB avg, 16 bars) • drop 1:20-1:52 ( -7.2 dB avg, 16 bars) • breakdown 1:52-2:24 (-15.9 dB avg, 16 bars) • outro 2:24-3:38 (-18.6 dB avg, 32 bars) web metadata: composer: Tanishk Bagchi; singer: Neha Kakkar; album/film: Satyameva Jayate; year: 2018 ## ONLY THESE SECTIONS EXIST [upload_index=0] dilbar.mp3 — 6 sections bounded at 0s, 16s, 48s, 80s, 112s, 144s, 218s. These are the only timestamps you may cite as section boundaries.
The prompt ends with a list of the only section boundaries the agent is allowed to cite. If the agent proposes a cut at a timestamp that isn't in that list, the segment picker rejects the plan during validation. The prompt also avoids including raw per-second curves, so there are fewer numerical details for the agent to misread.
The agent's job is to choose which sections to splice and what target BPM and key to aim for. The renderer then reads the resulting plan and does the time-stretching, pitch-shifting, EQ-crossfading, and writing. Given the same input songs and the same plan, the renderer produces the same mp3.