← Back to demo

Build notes

Building a browser VJ deck with AI: 97% prompt cache rate and $1.2k in API spend

slerp.audio is a browser-based VJ deck: local audio, shader controls, AudioWorklet DSP, worker-owned WebGL. The code came out of Cursor from specs, failing tests, bugs, and profiler screenshots. I did not write any code myself. This project took 13 days from start to finish.

97.2% on-demand input-side tokens served from prompt cache
$1,215 on-demand spend, 13-day window
2.58B on-demand tokens read from prompt cache
4.49B all usage (billed + subscription-included)

Token reuse

Cursor separates On-Demand (you pay per use) from Included (counted against subscription). Across this window the two together add up to about 4.49 billion tokens in Cursor’s totals. The table below breaks that out; charts and dollar figures further down use on-demand only, because that is what hits the invoice.

Token volume by billing kind (billed vs subscription-included). Combined row ignores two failed runs that carried no charge and no tokens.
Kind Usage lines Total tokens (B) Prompt cache reads (B)

Most billed input was not fresh typing. It was cached project context while long agent threads kept touching the same paths and decisions.

On-demand input-token mix (paid rows only)
Bucket Tokens (B) Share

What that means in dollars

2.58B cached input tokens on the billed slice. At $0.50 per million cached tokens, that reads about $1.29k. The full $1,214.84 on-demand line also includes output, cache writes, and the small fresh slice that missed cache.

Cached-input cost from billed (on-demand) usage
Measure USD

Where the $1,215 went

Same 13-day window by model and Max Mode. Bigger implementation passes skewed toward Max; day-to-day edits skewed toward normal mode.

On-demand spend by model and Max Mode. “Usage lines” is how many billed requests fell into each bucket (not rows of this page).
Model Max Usage lines Cost Share Avg $/line

thinking-xhigh is most of the money (~92.6%). Max Mode is ~30.4% of spend on ~20.9% of lines (roughly 65% more $/line than the same model without Max). Those lines tend to be bigger pushes (radix-4 pass, SAB wiring, worker migration, long perf sessions).

What kept the cache hot

Long-lived context on one tree. Boring hygiene beat clever prompts.

  1. Long threads tied to one project. A transcript pass counted roughly 685 user messages and 8,300 assistant messages across 12 top-level project threads. The build stayed in long-running agent threads instead of disposable one-off chats. Same context, same files, repeats hit the cached prefix instead of rewriting it.
  2. Stable file paths. Renaming or relocating a file mid-thread invalidates context about that file, so rename refactors waited until the active thread was winding down.
  3. Short status checkpoints over giant re-prompts. "What shipped, what broke, what's next" in a few lines instead of pasting the whole tree each time.
  4. One repro at a time when debugging. Mixing two unrelated bugs in one transcript inflates the prompt and rarely converges either bug faster.
  5. Incremental commits. Smaller diffs keep summaries anchored in the same tree snapshot the assistant was already reasoning about.

The audio engine

The shader cannot use raw music directly. It needs a few stable control signals it can read every frame: how hard the bass is hitting, how much midrange is present, how bright the top end is. Same bands as a DJ EQ, but read for animation instead of cut for sound.

The browser gives the worklet PCM samples: amplitude over time. That is useful for playback, but not for “how much bass is happening right now.” The missing step is frequency analysis. The FFT turns a short slice of audio into frequency energy; the engine then groups that energy into bass, mid, and high bands.

Concretely: a 2,048-point FFT runs every 256-sample hop (~5.8 ms at 44.1 kHz) inside an AudioWorkletProcessor. The band values are smoothed, written into a SharedArrayBuffer, and read by the renderer once per frame. No postMessage on the hot path.

Audio engine architecture and DSP signal flow On the audio thread: PCM samples flow through a Hann window, then a radix-4 2,048-point FFT, then magnitude in dB, then bass/mid/high band integration and an EMA smoother. The result is written into a SharedArrayBuffer that crosses the thread boundary. On the main thread, every render frame the loop reads that buffer and forwards three floats via postMessage to a render worker that owns an OffscreenCanvas; the worker uploads them as uBass, uMid, and uHigh into a WebGL2 fragment shader. audio thread PCM samples · 256-sample hop × Hann window precomputed once at module load FFT 2,048 · radix-4 one r2 stage + five merged r4 stages |X|² → dB · per-bin magnitude bass · mid · high integration + EMA smoother (attack / release) thread boundary (COOP + COEP) SharedArrayBuffer zero-copy · lock-free main thread per-frame read · postMessage(bass, mid, high) three floats per render frame thread boundary (transferControlToOffscreen) render worker WebGL2 fragment shader OffscreenCanvas · its own paint loop · owns the GL context
Signal flow: every 256 samples, the audio thread fills the shared buffer; every render frame, the main thread forwards three floats to the render worker via postMessage; the worker uploads them as uniforms into the OffscreenCanvas’s WebGL2 context.

What an FFT actually does, and why we need one

Audio in a computer is just a long list of numbers: at 44.1 kHz, 44,100 amplitude readings per second, one per channel. That list tells you when the speaker cone moves, but it does not directly tell you what is bass, what is mid, or what is treble in any given moment. To split a song into "how loud is each frequency right now," you need to convert a chunk of those samples from a timeline of amplitudes into a histogram of frequencies.

The conversion we want is the discrete Fourier transform (DFT). Feed in a block of N samples (here, 2,048 samples — about 46 ms of audio). Get back a frequency report: a row of N "buckets" (the textbook calls them frequency bins or DFT coefficients). Each bucket is a single complex number whose magnitude says how much of one frequency was present in the block, and whose phase says where that wave was in its cycle.

At a 44.1 kHz sample rate, a 2,048-point DFT spaces buckets about 21.5 Hz apart: bucket 0 is 0 Hz / DC, bucket 1 is about 21.5 Hz, bucket 2 is about 43 Hz, and so on. For real-valued audio the upper half of the buckets is the conjugate mirror of the lower half (XN−k = Xk*), so the useful display side is 0 Hz → Nyquist (about 22.05 kHz). The deck does not show every bucket directly — it sums ranges of buckets into musical bands: bass defaults to 40–140 Hz, mid to 150–2,000 Hz, and high to 2,000–10,000 Hz.

Frequency buckets from DC to Nyquist 0 Hz DC ~43 Hz 2 kHz 10 kHz 22.05 kHz Nyquist bass 40-140 mid 150-2k high 2k-10k Raw buckets are evenly spaced; the deck sums ranges into musical bands.
Bucket map for a 44.1 kHz / 2,048-point FFT. The raw buckets are evenly spaced by about 21.5 Hz; bass, mid, and high are ranges of those buckets, not separate transforms.

Computed straight from the DFT formula Xk = Σ xm · e−2πikm/N, this is two nested loops: for every output bucket k, walk every input sample xm and add up its contribution. In plain English: test "how much 86 Hz is in this block?", then scan all 2,048 samples; test "how much 108 Hz?", then scan the same 2,048 samples again. Doing that for every bucket is O(N²) work, which at N = 2,048 is roughly 4 million complex multiplies per block. We have about 5.8 ms of CPU between blocks. That budget will not fit.

A Fast Fourier Transform is any algorithm that computes the same DFT — same numbers in, same numbers out — in O(N log N) work instead of O(N²). The FFT does not invent a new transform; it factors the DFT's matrix of complex exponentials into sparse pieces that share most of their arithmetic. At N = 2,048, (N/2) · log2(N) ≈ 11,000 complex multiplies are enough to produce the same result — roughly several hundred times cheaper, easily inside budget.

The speedup comes from avoiding repeated work. In the direct DFT, every bucket starts over and rediscovers little sums that other buckets already touched. The FFT keeps those smaller results and reuses them.

Think of the final result as the full frequency report for the whole 2,048-sample block. The FFT builds that report out of smaller reports. A small report might cover only a tiny slice of the block. The next round combines neighboring small reports into bigger reports. The next round combines those again. By the time the FFT reaches the full 2,048-sample report, most of the work has already been reused.

The classic version of this trick is the Cooley–Tukey algorithm (1965). It is divide-and-conquer in the same shape as merge sort — split, solve, combine — except the combine step is a complex- number "butterfly" instead of a merge of sorted lists, and what gets reused are partial sums of complex exponentials, not sorted sub-sequences.

The "log" in O(N log N) is the number of combine rounds, not a logarithm applied to the audio. "Chunk size" here means how many original samples one intermediate report covers. The smallest reports cover one sample each. After one combine round, each report covers 2 samples. Then 4, then 8, then 16, and so on. Reaching the final 2,048-sample report takes eleven doublings because 2¹¹ = 2,048. The audio values stay normal sample amplitudes; the algorithm is just grouping them more efficiently.

FFT combine rounds double the covered samples round report covers shape 0 1 sample 1 2 samples 2 4 samples ... ... ... 11 2,048 samples ×2 ×2
The "log" is the number of combine rounds. Each round doubles how many original samples one intermediate report covers.

Four implementation ideas show up directly in the worklet code below:

  1. Complex frequency buckets. Each output bucket needs to say two things: how much of that frequency is present, and where that wave sits in its cycle. The first is magnitude. The second is phase. One real number only carries one measurement, so the FFT uses a complex number: one part for cosine alignment, one part for sine alignment.
  2. Bit-reversed input order. Repeatedly splitting even-indexed vs odd-indexed samples ends with the sample at index i sitting at the position whose binary digits are i's binary digits read backwards (bitReverse(i)). The iterative FFT shuffles the input that way once at the start, then never moves data again — the rest of the algorithm is all in-place arithmetic.
  3. Twiddle factors. When two intermediate reports come from different positions in the input window, their phases do not line up automatically. A twiddle factor is the fixed complex number that rotates one report by the exact angle needed before it is mixed with the others. Mathematically each one is an Nth root of unity WNk = e−2πik/N — basically k/N of the way around the unit circle (in the negative direction, because the DFT formula uses e−i…). They depend only on N, so we compute all of them once at module load and read them as plain table lookups in the hot loop. The name comes from Gentleman and Sande, 1966.
  4. Butterflies. The combine step itself. Each one takes a pair of intermediate complex values (a, b), computes (a + W·b, a − W·b), and writes both back over the originals. One multiply, two add/subtract pairs, no extra memory. If you draw the data flow on paper you get an X / butterfly shape, hence the name.

Radix-4 is a Cooley–Tukey variant that groups two radix-2 combine rounds into one larger four-way round wherever the sample count allows it. Each combined round handles four values at once. The transform it computes is identical to the radix-2 result; the factorisation just produces fewer stages and lets the code share some twiddle arithmetic.

Compute once, never again

Two values precomputed at module load that the audio thread re-uses on every hop:

slerp-band-processor.worklet.js

// Built once at module load; not recomputed at audio rate.
const hannWindow = new Float32Array(FFT_SIZE);
for (let i = 0; i < FFT_SIZE; i += 1) {
  hannWindow[i] = 0.5 * (1 - Math.cos((2 * Math.PI * i) / (FFT_SIZE - 1)));
}

// 11-bit bit-reversal table for N=2048 (~4 KB).
const bitReverseTable = new Uint16Array(FFT_SIZE);
let j = 0;
for (let i = 1; i < FFT_SIZE; i += 1) {
  let bit = FFT_SIZE >> 1;
  for (; j & bit; bit >>= 1) j ^= bit;
  j ^= bit;
  bitReverseTable[i] = j;
}

Radix-2 → radix-4: the optimisation that didn't matter

The pipeline was hot while this block was tuned, so work started in chrome://tracing/, not the normal DevTools console. The page/main-thread view is good for layout and UI work; the worklet has its own audio thread. To see where the FFT actually lands, you need a trace that includes the AudioWorkletProcessor.process() calls.

Under that lens, radix-4 was not the bottleneck; cost sat elsewhere. The radix-4 change still landed (small diff, fewer multiplies on paper). Measured hop time on the worklet barely moved after.

The exact sample count matters here. A 2,048-point FFT has eleven radix-2 stages because 2¹¹ = 2,048. Every radix-2 stage doubles the size of the intermediate reports: 2 samples, then 4, then 8, and so on until the full 2,048-sample frequency report is assembled.

Radix-4 groups those stages in pairs. Since one four-way combine covers the same growth as two two-way combines, the code can replace most of the eleven radix-2 stages with radix-4 stages. The useful factorization is 2,048 = 2 · 4 · 4 · 4 · 4 · 4: one leftover radix-2 stage, then five radix-4 stages.

Radix-2 stages grouped into radix-4 stages 2,048 = 2¹¹ = 2 · 4⁵ radix-2 11 stages grouped 2 4 4 4 4 4 6 stages
Radix-4 does not change the transform. It groups pairs of radix-2 stages into larger four-way stages, leaving one two-way stage because 2,048 has an odd number of factors of two.

That factorization is not the only true one. Writing 2,048 as eleven factors of two is also correct, and that is the plain radix-2 FFT. The reason 2 · 4⁵ is useful is that it packs in as many four-way stages as possible without changing the window size. It is the same transform, just grouped to do less repeated work.

The repeated unit of work is still a butterfly. A butterfly is not a new signal operation; it is the small "combine these reports into a bigger report" block inside the FFT. The radix tells you how many values go into that block: two values for radix-2, four values for radix-4.

The four values entering a radix-4 butterfly are already complex frequency measurements from smaller reports. Complex numbers are useful because a frequency measurement needs both magnitude and phase. Magnitude is "how much of this frequency is present." Phase is "where in the wave cycle is it?"

Phase matters because two matching waves can add very differently depending on where they are in the cycle. If both are at a peak, they reinforce. If one is at a peak while the other is halfway around its cycle, they cancel. The FFT combines frequency measurements that came from different positions in the input window, so it has to correct those phase offsets before adding them together.

That correction is the twiddle rotation. Picture a complex number as a point on an x/y plane: real is x, imaginary is y. Multiplying by a unit-length complex number turns that point around the origin without changing its length. In FFT terms, the amount stays the same, but the phase is moved to the right place.

Twiddle factors rotate complex values around the origin real imag p p · W p · W² turn phase multiply by W same length new angle magnitude stays phase changes
A twiddle multiply turns a complex value around the origin. The point stays on the same circle, so its magnitude is unchanged; only the phase angle moves.

A radix-4 butterfly needs three of those rotations: W, , and . They are not computed from scratch in the audio loop; they come from the precomputed twiddle tables. If the table index for W is k, then is at 2k and is at 3k. The code gets those indices with one double and one add, then reads the table entries.

That is the whole optimisation in this block: load four complex values, rotate three of them, then combine all four into four outputs. Compared with doing the same work as separate radix-2 stages, the grouped version saves roughly 25% of the real-number multiplies on paper.

There is one special case in every stage: k = 0. At that index, all three rotations equal (1, 0), which means "do not rotate." The code handles that case separately so it can skip the complex multiplies entirely. The loop below is only the general case: k > 0.

Reading the loop is easier with four small translations in mind:

inner radix-4 butterfly

for (let k = 1; k < L; k += 1) {
  const i0 = i + k;
  const i1 = i0 + L;
  const i2 = i1 + L;
  const i3 = i2 + L;

  const u1Idx = k * tableStride;
  const u2Idx = u1Idx << 1;        // W²: index doubles, no multiply
  const u3Idx = u1Idx + u2Idx;     // W³: still just an add

  const u1r = twiddleReal[u1Idx], u1i = twiddleImag[u1Idx];
  const u2r = twiddleReal[u2Idx], u2i = twiddleImag[u2Idx];
  const u3r = twiddleReal[u3Idx], u3i = twiddleImag[u3Idx];

  const p0r = real[i0], p0i = imag[i0];
  const p1r = real[i1], p1i = imag[i1];
  const p2r = real[i2], p2i = imag[i2];
  const p3r = real[i3], p3i = imag[i3];

  const br = u2r * p1r - u2i * p1i;  // p1 · W²
  const bi = u2r * p1i + u2i * p1r;
  const cr = u1r * p2r - u1i * p2i;  // p2 · W
  const ci = u1r * p2i + u1i * p2r;
  const dr = u3r * p3r - u3i * p3i;  // p3 · W³
  const di = u3r * p3i + u3i * p3r;

  const t0r = p0r + br, t0i = p0i + bi;
  const t1r = p0r - br, t1i = p0i - bi;
  const t2r = cr + dr,  t2i = ci + di;
  const t3r = di - ci,  t3i = cr - dr;

  real[i0] = t0r + t2r; imag[i0] = t0i + t2i;
  real[i1] = t1r - t3r; imag[i1] = t1i - t3i;
  real[i2] = t0r - t2r; imag[i2] = t0i - t2i;
  real[i3] = t1r + t3r; imag[i3] = t1i + t3i;
}

Further reading

Wikipedia has solid introductory articles for each of the pieces above, in roughly the order they appear in the post:

Where the time actually went

Two bottlenecks showed up in Chrome’s Performance panel; neither was the FFT.

Layout thrash on panel-open

Compact mode stuttered for ~500 ms when the side panel mounted. The flame chart showed a long task with a familiar shape: read geometry, write a style, read geometry again. Each read forced a synchronous layout that the previous write had invalidated.

the pattern (synthesised, real shape)

// Before: read → write → read forces a layout per section.
for (const section of sections) {
  const w = panel.offsetWidth;            // read
  section.style.padding = `${w / 16}px`;     // write (invalidates layout)
  const h = section.offsetHeight;         // read forces layout *again*
  section.style.transform = `translateY(${h}px)`;
}

// After: batch reads, then writes. One layout pass total.
const widths  = sections.map(() => panel.offsetWidth);
const heights = sections.map((s) => s.offsetHeight);
for (const [i, section] of sections.entries()) {
  section.style.padding   = `${widths[i] / 16}px`;
  section.style.transform = `translateY(${heights[i]}px)`;
}

Faster to spot in the flame chart than in code review. Batching reads removed the cliff.

Texture thrash on resize

On mobile, every rotation reallocated framebuffer textures at the new viewport size, so the GPU paid for each allocation in the frame that woke the resize. Fix: snap render scale to tiers, debounce resize, reuse the texture handle inside a tier when possible.

Same failure mode as panel-open: read/write churn and allocation, not the inner loop.

Worker-owned rendering

Main thread: panel, file input, scroll, audio bridge, feature checks. Worker: transferred canvas, WebGL2, shaders, resize tiers, its own per-frame paint loop. The in-process renderer stays available as a fallback behind the same API as the worker proxy.

facade branch: worker path first, fallback second

const canOffscreen =
  typeof OffscreenCanvas !== "undefined" &&
  typeof Worker !== "undefined" &&
  typeof canvas.transferControlToOffscreen === "function";

renderer = canOffscreen
  ? createWorkerRenderer(rendererOpts)
  : null;

if (!renderer) {
  renderer = createMainThreadRenderer(rendererOpts);
}

If detection fails or transferControlToOffscreen() / new Worker() throws, createWorkerRenderer() returns null and the facade drops back to an in-process renderer. After a canvas transfers successfully, GL stays inside the worker; the main thread only sends setter-shaped messages.

main-thread proxy: transfer, serialize, dispatch

const off = canvas.transferControlToOffscreen();
const worker = createRenderWorker();

worker.postMessage(
  { type: "init", canvas: off, opts: buildWireInit(opts) },
  [off],
);

The adapter maps non-serializable pieces to wire-safe data: qualityUniforms becomes small tables by quality tier; UI-only shapes like format never cross the boundary. Structured clone rejects functions, so the protocol cannot be "whatever the editor had on hand."

worker entry: same renderer factory, message pump in front

self.addEventListener("message", (ev) => {
  const msg = ev.data;
  if (msg.type === "init") return handleInit(msg.canvas, msg.opts);
  switch (msg.type) {
    case "setAudioLevels": renderer.setAudioLevels(msg.bass, msg.mid, msg.high); break;
    case "setViewport":    renderer.setViewport(msg.cssWidth, msg.cssHeight, msg.dpr); break;
    case "setScroll":      renderer.setScroll(msg.value); break;
    // …setShader, rebindShaderSpec, setTuneSnapshot, pause, resume, destroy…
  }
});

The worker entry reconstructs wire-safe shader specs, calls the same renderer factory the main thread can fall back to, then handles commands straight through: audio levels, viewport, scroll, shader swaps, tune snapshots, pulse, pause, resume, destroy.

v1 still sends three floats over postMessage per frame; cost is negligible on the main thread, and there is no SharedArrayBuffer reader in the worker yet. Wiring the worker to read the band bus on its own timer can wait until a trace proves it is worth it.