Build notes
Building a browser VJ deck with AI: 97% prompt cache rate and $1.2k in API spend
slerp.audio is a browser-based VJ deck: local audio, shader controls, AudioWorklet DSP, worker-owned WebGL. The code came out of Cursor from specs, failing tests, bugs, and profiler screenshots. I did not write any code myself. This project took 13 days from start to finish.
Token reuse
Cursor separates On-Demand (you pay per use) from Included (counted against subscription). Across this window the two together add up to about 4.49 billion tokens in Cursor’s totals. The table below breaks that out; charts and dollar figures further down use on-demand only, because that is what hits the invoice.
| Kind | Usage lines | Total tokens (B) | Prompt cache reads (B) |
|---|
Most billed input was not fresh typing. It was cached project context while long agent threads kept touching the same paths and decisions.
| Bucket | Tokens (B) | Share |
|---|
What that means in dollars
2.58B cached input tokens on the billed slice. At $0.50 per million cached tokens, that reads about $1.29k. The full $1,214.84 on-demand line also includes output, cache writes, and the small fresh slice that missed cache.
| Measure | USD |
|---|
Where the $1,215 went
Same 13-day window by model and Max Mode. Bigger implementation passes skewed toward Max; day-to-day edits skewed toward normal mode.
| Model | Max | Usage lines | Cost | Share | Avg $/line |
|---|
thinking-xhigh is most of the money (~92.6%). Max Mode is ~30.4% of spend on ~20.9% of lines (roughly 65% more $/line than the same model without Max). Those lines tend to be bigger pushes (radix-4 pass, SAB wiring, worker migration, long perf sessions).
What kept the cache hot
Long-lived context on one tree. Boring hygiene beat clever prompts.
- Long threads tied to one project. A transcript pass counted roughly 685 user messages and 8,300 assistant messages across 12 top-level project threads. The build stayed in long-running agent threads instead of disposable one-off chats. Same context, same files, repeats hit the cached prefix instead of rewriting it.
- Stable file paths. Renaming or relocating a file mid-thread invalidates context about that file, so rename refactors waited until the active thread was winding down.
- Short status checkpoints over giant re-prompts. "What shipped, what broke, what's next" in a few lines instead of pasting the whole tree each time.
- One repro at a time when debugging. Mixing two unrelated bugs in one transcript inflates the prompt and rarely converges either bug faster.
- Incremental commits. Smaller diffs keep summaries anchored in the same tree snapshot the assistant was already reasoning about.
The audio engine
The shader cannot use raw music directly. It needs a few stable control signals it can read every frame: how hard the bass is hitting, how much midrange is present, how bright the top end is. Same bands as a DJ EQ, but read for animation instead of cut for sound.
The browser gives the worklet PCM samples: amplitude over time. That is useful for playback, but not for “how much bass is happening right now.” The missing step is frequency analysis. The FFT turns a short slice of audio into frequency energy; the engine then groups that energy into bass, mid, and high bands.
Concretely: a 2,048-point FFT runs every 256-sample hop (~5.8 ms at
44.1 kHz) inside an AudioWorkletProcessor. The band values
are smoothed, written into a SharedArrayBuffer, and read by
the renderer once per frame. No postMessage on the hot path.
postMessage; the worker uploads
them as uniforms into the OffscreenCanvas’s WebGL2 context.
What an FFT actually does, and why we need one
Audio in a computer is just a long list of numbers: at 44.1 kHz, 44,100 amplitude readings per second, one per channel. That list tells you when the speaker cone moves, but it does not directly tell you what is bass, what is mid, or what is treble in any given moment. To split a song into "how loud is each frequency right now," you need to convert a chunk of those samples from a timeline of amplitudes into a histogram of frequencies.
The conversion we want is the
discrete Fourier transform (DFT). Feed in a block of N samples (here, 2,048 samples —
about 46 ms of audio). Get back a frequency report: a row of
N "buckets" (the textbook calls them frequency bins
or DFT coefficients). Each bucket is a single complex number
whose magnitude says how much of one frequency was present in the
block, and whose phase says where that wave was in its cycle.
At a 44.1 kHz sample rate, a 2,048-point DFT spaces buckets about
21.5 Hz apart: bucket 0 is 0 Hz / DC, bucket 1 is about 21.5 Hz, bucket
2 is about 43 Hz, and so on. For real-valued audio the upper half of
the buckets is the conjugate mirror of the lower half
(XN−k = Xk*), so the useful display
side is 0 Hz → Nyquist (about 22.05 kHz). The deck does not show every
bucket directly — it sums ranges of buckets into musical bands: bass
defaults to 40–140 Hz, mid to 150–2,000 Hz, and high to 2,000–10,000 Hz.
Computed straight from the DFT formula
Xk = Σ xm · e−2πikm/N,
this is two nested loops: for every output bucket k, walk
every input sample xm and add up its
contribution. In plain English: test "how much 86 Hz is in this
block?", then scan all 2,048 samples; test "how much 108 Hz?", then
scan the same 2,048 samples again. Doing that for every bucket is
O(N²) work, which at N = 2,048 is roughly
4 million complex multiplies per block. We have about
5.8 ms of CPU between blocks. That budget will not fit.
A
Fast Fourier Transform
is any algorithm that computes the same DFT — same numbers in, same
numbers out — in
O(N log N) work instead of O(N²). The FFT does not
invent a new transform; it factors the DFT's matrix of complex
exponentials into sparse pieces that share most of their arithmetic.
At N = 2,048, (N/2) · log2(N) ≈ 11,000
complex multiplies are enough to produce the same result — roughly
several hundred times cheaper, easily inside budget.
The speedup comes from avoiding repeated work. In the direct DFT, every bucket starts over and rediscovers little sums that other buckets already touched. The FFT keeps those smaller results and reuses them.
Think of the final result as the full frequency report for the whole 2,048-sample block. The FFT builds that report out of smaller reports. A small report might cover only a tiny slice of the block. The next round combines neighboring small reports into bigger reports. The next round combines those again. By the time the FFT reaches the full 2,048-sample report, most of the work has already been reused.
The classic version of this trick is the Cooley–Tukey algorithm (1965). It is divide-and-conquer in the same shape as merge sort — split, solve, combine — except the combine step is a complex- number "butterfly" instead of a merge of sorted lists, and what gets reused are partial sums of complex exponentials, not sorted sub-sequences.
The "log" in O(N log N) is the number of combine
rounds, not a logarithm applied to the audio. "Chunk size" here means
how many original samples one intermediate report covers. The smallest
reports cover one sample each. After one combine round, each report
covers 2 samples. Then 4, then 8, then 16, and so on. Reaching the
final 2,048-sample report takes eleven doublings because
2¹¹ = 2,048. The audio values stay normal sample
amplitudes; the algorithm is just grouping them more efficiently.
Four implementation ideas show up directly in the worklet code below:
- Complex frequency buckets. Each output bucket needs to say two things: how much of that frequency is present, and where that wave sits in its cycle. The first is magnitude. The second is phase. One real number only carries one measurement, so the FFT uses a complex number: one part for cosine alignment, one part for sine alignment.
-
Bit-reversed input order. Repeatedly splitting
even-indexed vs odd-indexed samples ends with the sample at index
isitting at the position whose binary digits arei's binary digits read backwards (bitReverse(i)). The iterative FFT shuffles the input that way once at the start, then never moves data again — the rest of the algorithm is all in-place arithmetic. -
Twiddle factors. When two intermediate reports come
from different positions in the input window, their phases do not
line up automatically. A twiddle factor is the fixed complex number
that rotates one report by the exact angle needed before it is mixed
with the others. Mathematically each one is an Nth root of unity
WNk = e−2πik/N— basicallyk/Nof the way around the unit circle (in the negative direction, because the DFT formula usese−i…). They depend only onN, so we compute all of them once at module load and read them as plain table lookups in the hot loop. The name comes from Gentleman and Sande, 1966. -
Butterflies. The combine step itself. Each one
takes a pair of intermediate complex values
(a, b), computes(a + W·b, a − W·b), and writes both back over the originals. One multiply, two add/subtract pairs, no extra memory. If you draw the data flow on paper you get an X / butterfly shape, hence the name.
Radix-4 is a Cooley–Tukey variant that groups two radix-2 combine rounds into one larger four-way round wherever the sample count allows it. Each combined round handles four values at once. The transform it computes is identical to the radix-2 result; the factorisation just produces fewer stages and lets the code share some twiddle arithmetic.
Compute once, never again
Two values precomputed at module load that the audio thread re-uses on every hop:
- The Hann window. A smooth bell-shaped taper that fades the 2,048-sample chunk to zero at its edges before the FFT sees it. Why: the FFT mathematically pretends each chunk repeats forever; if the chunk does not start and end near zero, that fake repetition has a hard edge, and the FFT honestly reports that hard edge as energy spread across every frequency bucket — a smeared spectrum instead of a clean one. Multiplying the chunk by a Hann window kills the edges and gives a much sharper spectrum. The taper itself is the same 2,048 numbers forever, so build it once at startup and reuse it on every chunk.
-
The bit-reversal table. A 4 KB
Uint16Arraymapping each indexitobitReverse(i). Replaces an inner-loop bit-twiddle on every FFT call with one array load.
slerp-band-processor.worklet.js
// Built once at module load; not recomputed at audio rate. const hannWindow = new Float32Array(FFT_SIZE); for (let i = 0; i < FFT_SIZE; i += 1) { hannWindow[i] = 0.5 * (1 - Math.cos((2 * Math.PI * i) / (FFT_SIZE - 1))); } // 11-bit bit-reversal table for N=2048 (~4 KB). const bitReverseTable = new Uint16Array(FFT_SIZE); let j = 0; for (let i = 1; i < FFT_SIZE; i += 1) { let bit = FFT_SIZE >> 1; for (; j & bit; bit >>= 1) j ^= bit; j ^= bit; bitReverseTable[i] = j; }
Radix-2 → radix-4: the optimisation that didn't matter
The pipeline was hot while this block was tuned, so work started in
chrome://tracing/, not the normal DevTools console. The
page/main-thread view is good for layout and UI work; the worklet has
its own audio thread. To see where the FFT actually lands, you need a
trace that includes the AudioWorkletProcessor.process()
calls.
Under that lens, radix-4 was not the bottleneck; cost sat elsewhere. The radix-4 change still landed (small diff, fewer multiplies on paper). Measured hop time on the worklet barely moved after.
The exact sample count matters here. A 2,048-point FFT has eleven
radix-2 stages because 2¹¹ = 2,048. Every radix-2 stage
doubles the size of the intermediate reports: 2 samples, then 4, then 8,
and so on until the full 2,048-sample frequency report is assembled.
Radix-4 groups those stages in pairs. Since one four-way combine covers
the same growth as two two-way combines, the code can replace most of
the eleven radix-2 stages with radix-4 stages. The useful factorization
is 2,048 = 2 · 4 · 4 · 4 · 4 · 4: one leftover radix-2
stage, then five radix-4 stages.
That factorization is not the only true one. Writing 2,048 as eleven
factors of two is also correct, and that is the plain radix-2 FFT. The
reason 2 · 4⁵ is useful is that it packs in as many
four-way stages as possible without changing the window size. It is the
same transform, just grouped to do less repeated work.
The repeated unit of work is still a butterfly. A butterfly is not a new signal operation; it is the small "combine these reports into a bigger report" block inside the FFT. The radix tells you how many values go into that block: two values for radix-2, four values for radix-4.
The four values entering a radix-4 butterfly are already complex frequency measurements from smaller reports. Complex numbers are useful because a frequency measurement needs both magnitude and phase. Magnitude is "how much of this frequency is present." Phase is "where in the wave cycle is it?"
Phase matters because two matching waves can add very differently depending on where they are in the cycle. If both are at a peak, they reinforce. If one is at a peak while the other is halfway around its cycle, they cancel. The FFT combines frequency measurements that came from different positions in the input window, so it has to correct those phase offsets before adding them together.
That correction is the twiddle rotation. Picture a complex number as a point on an x/y plane: real is x, imaginary is y. Multiplying by a unit-length complex number turns that point around the origin without changing its length. In FFT terms, the amount stays the same, but the phase is moved to the right place.
A radix-4 butterfly needs three of those rotations:
W, W², and W³. They are not
computed from scratch in the audio loop; they come from the precomputed
twiddle tables. If the table index for W is
k, then W² is at 2k and
W³ is at 3k. The code gets those indices with
one double and one add, then reads the table entries.
That is the whole optimisation in this block: load four complex values, rotate three of them, then combine all four into four outputs. Compared with doing the same work as separate radix-2 stages, the grouped version saves roughly 25% of the real-number multiplies on paper.
There is one special case in every stage: k = 0. At that
index, all three rotations equal (1, 0), which means "do
not rotate." The code handles that case separately so it can skip the
complex multiplies entirely. The loop below is only the general case:
k > 0.
Reading the loop is easier with four small translations in mind:
-
real[i]andimag[i]are the two halves of one complex value. -
The
u1,u2, andu3variables are the three twiddle rotations. -
A complex multiply is the rotation step. The formula
(a + bi)(c + di) = (ac − bd) + (ad + bc)iis why the code keeps pairing one line for the new real part with one line for the new imaginary part. - After the rotations, the butterfly uses adds and subtracts to write four new complex values back into the same four array slots.
inner radix-4 butterfly
for (let k = 1; k < L; k += 1) { const i0 = i + k; const i1 = i0 + L; const i2 = i1 + L; const i3 = i2 + L; const u1Idx = k * tableStride; const u2Idx = u1Idx << 1; // W²: index doubles, no multiply const u3Idx = u1Idx + u2Idx; // W³: still just an add const u1r = twiddleReal[u1Idx], u1i = twiddleImag[u1Idx]; const u2r = twiddleReal[u2Idx], u2i = twiddleImag[u2Idx]; const u3r = twiddleReal[u3Idx], u3i = twiddleImag[u3Idx]; const p0r = real[i0], p0i = imag[i0]; const p1r = real[i1], p1i = imag[i1]; const p2r = real[i2], p2i = imag[i2]; const p3r = real[i3], p3i = imag[i3]; const br = u2r * p1r - u2i * p1i; // p1 · W² const bi = u2r * p1i + u2i * p1r; const cr = u1r * p2r - u1i * p2i; // p2 · W const ci = u1r * p2i + u1i * p2r; const dr = u3r * p3r - u3i * p3i; // p3 · W³ const di = u3r * p3i + u3i * p3r; const t0r = p0r + br, t0i = p0i + bi; const t1r = p0r - br, t1i = p0i - bi; const t2r = cr + dr, t2i = ci + di; const t3r = di - ci, t3i = cr - dr; real[i0] = t0r + t2r; imag[i0] = t0i + t2i; real[i1] = t1r - t3r; imag[i1] = t1i - t3i; real[i2] = t0r - t2r; imag[i2] = t0i - t2i; real[i3] = t1r + t3r; imag[i3] = t1i + t3i; }
Further reading
Wikipedia has solid introductory articles for each of the pieces above, in roughly the order they appear in the post:
- Discrete Fourier transform — the math the FFT is computing.
- Fast Fourier transform — the algorithm family, complexity bounds, and history.
- Cooley–Tukey FFT algorithm — the divide-and-conquer scheme used here, including radix-2 and mixed-radix variants.
- Twiddle factor — the precomputed roots of unity used in the butterflies.
- Butterfly diagram — where the "butterfly" name and the X-shaped data flow come from.
- Bit-reversal permutation — the reorder applied once at the start of an iterative FFT.
- Hann window — the bell-shaped taper applied before the FFT.
- Nyquist frequency — why a 44.1 kHz sample rate caps useful buckets at about 22.05 kHz.
- Roots of unity — the unit-circle complex numbers that act as twiddle rotations.
Where the time actually went
Two bottlenecks showed up in Chrome’s Performance panel; neither was the FFT.
Layout thrash on panel-open
Compact mode stuttered for ~500 ms when the side panel mounted. The flame chart showed a long task with a familiar shape: read geometry, write a style, read geometry again. Each read forced a synchronous layout that the previous write had invalidated.
the pattern (synthesised, real shape)
// Before: read → write → read forces a layout per section. for (const section of sections) { const w = panel.offsetWidth; // read section.style.padding = `${w / 16}px`; // write (invalidates layout) const h = section.offsetHeight; // read forces layout *again* section.style.transform = `translateY(${h}px)`; } // After: batch reads, then writes. One layout pass total. const widths = sections.map(() => panel.offsetWidth); const heights = sections.map((s) => s.offsetHeight); for (const [i, section] of sections.entries()) { section.style.padding = `${widths[i] / 16}px`; section.style.transform = `translateY(${heights[i]}px)`; }
Faster to spot in the flame chart than in code review. Batching reads removed the cliff.
Texture thrash on resize
On mobile, every rotation reallocated framebuffer textures at the new viewport size, so the GPU paid for each allocation in the frame that woke the resize. Fix: snap render scale to tiers, debounce resize, reuse the texture handle inside a tier when possible.
Same failure mode as panel-open: read/write churn and allocation, not the inner loop.
Worker-owned rendering
Main thread: panel, file input, scroll, audio bridge, feature checks. Worker: transferred canvas, WebGL2, shaders, resize tiers, its own per-frame paint loop. The in-process renderer stays available as a fallback behind the same API as the worker proxy.
facade branch: worker path first, fallback second
const canOffscreen = typeof OffscreenCanvas !== "undefined" && typeof Worker !== "undefined" && typeof canvas.transferControlToOffscreen === "function"; renderer = canOffscreen ? createWorkerRenderer(rendererOpts) : null; if (!renderer) { renderer = createMainThreadRenderer(rendererOpts); }
If detection fails or transferControlToOffscreen() /
new Worker() throws, createWorkerRenderer()
returns null and the facade drops back to an in-process
renderer. After a canvas transfers successfully, GL stays inside the worker;
the main thread only sends setter-shaped messages.
main-thread proxy: transfer, serialize, dispatch
const off = canvas.transferControlToOffscreen(); const worker = createRenderWorker(); worker.postMessage( { type: "init", canvas: off, opts: buildWireInit(opts) }, [off], );
The adapter maps non-serializable pieces to wire-safe data:
qualityUniforms becomes small tables by quality tier;
UI-only shapes like format never cross the boundary.
Structured clone rejects functions, so the protocol cannot be "whatever
the editor had on hand."
worker entry: same renderer factory, message pump in front
self.addEventListener("message", (ev) => { const msg = ev.data; if (msg.type === "init") return handleInit(msg.canvas, msg.opts); switch (msg.type) { case "setAudioLevels": renderer.setAudioLevels(msg.bass, msg.mid, msg.high); break; case "setViewport": renderer.setViewport(msg.cssWidth, msg.cssHeight, msg.dpr); break; case "setScroll": renderer.setScroll(msg.value); break; // …setShader, rebindShaderSpec, setTuneSnapshot, pause, resume, destroy… } });
The worker entry reconstructs wire-safe shader specs, calls the same renderer factory the main thread can fall back to, then handles commands straight through: audio levels, viewport, scroll, shader swaps, tune snapshots, pulse, pause, resume, destroy.
v1 still sends three floats over postMessage per frame;
cost is negligible on the main thread, and there is no
SharedArrayBuffer reader in the worker yet. Wiring the
worker to read the band bus on its own timer can wait until a trace
proves it is worth it.