Build notes

Building a browser VJ deck with AI: 97% prompt cache rate and $1.2k in API spend

slerp.audio is a browser-based VJ deck: local audio, shader controls, AudioWorklet DSP, worker-owned WebGL. The code came out of Cursor from specs, failing tests, bugs, and profiler screenshots. I did not write any code myself. This project took 13 days from start to finish.

97.2% on-demand input-side tokens served from prompt cache

$1,215 on-demand spend, 13-day window

2.58B on-demand tokens read from prompt cache

4.49B all usage (billed + subscription-included)

Token reuse

Cursor separates On-Demand (you pay per use) from Included (counted against subscription). Across this window the two together add up to about 4.49 billion tokens in Cursor’s totals. The table below breaks that out; charts and dollar figures further down use on-demand only, because that is what hits the invoice.

Token volume by billing kind (billed vs subscription-included). Combined row ignores two failed runs that carried no charge and no tokens.
Kind	Usage lines	Total tokens (B)	Prompt cache reads (B)

Most billed input was not fresh typing. It was cached project context while long agent threads kept touching the same paths and decisions.

On-demand input-token mix (paid rows only)
Bucket	Tokens (B)	Share

What that means in dollars

2.58B cached input tokens on the billed slice. At $0.50 per million cached tokens, that reads about $1.29k. The full $1,214.84 on-demand line also includes output, cache writes, and the small fresh slice that missed cache.

Cached-input cost from billed (on-demand) usage
Measure	USD

Where the $1,215 went

Same 13-day window by model and Max Mode. Bigger implementation passes skewed toward Max; day-to-day edits skewed toward normal mode.

On-demand spend by model and Max Mode. “Usage lines” is how many billed requests fell into each bucket (not rows of this page).
Model	Max	Usage lines	Cost	Share	Avg $/line

thinking-xhigh is most of the money (~92.6%). Max Mode is ~30.4% of spend on ~20.9% of lines (roughly 65% more $/line than the same model without Max). Those lines tend to be bigger pushes (radix-4 pass, SAB wiring, worker migration, long perf sessions).

What kept the cache hot

Long-lived context on one tree. Boring hygiene beat clever prompts.

Long threads tied to one project. A transcript pass counted roughly 685 user messages and 8,300 assistant messages across 12 top-level project threads. The build stayed in long-running agent threads instead of disposable one-off chats. Same context, same files, repeats hit the cached prefix instead of rewriting it.
Stable file paths. Renaming or relocating a file mid-thread invalidates context about that file, so rename refactors waited until the active thread was winding down.
Short status checkpoints over giant re-prompts. "What shipped, what broke, what's next" in a few lines instead of pasting the whole tree each time.
One repro at a time when debugging. Mixing two unrelated bugs in one transcript inflates the prompt and rarely converges either bug faster.
Incremental commits. Smaller diffs keep summaries anchored in the same tree snapshot the assistant was already reasoning about.

The audio engine

The shader cannot use raw music directly. It needs a few stable control signals it can read every frame: how hard the bass is hitting, how much midrange is present, how bright the top end is. Same bands as a DJ EQ, but read for animation instead of cut for sound.

The browser gives the worklet PCM samples: amplitude over time. That is useful for playback, but not for “how much bass is happening right now.” The missing step is frequency analysis. The FFT turns a short slice of audio into frequency energy; the engine then groups that energy into bass, mid, and high bands.

Concretely: a 2,048-point FFT runs every 256-sample hop (~5.8 ms at 44.1 kHz) inside an AudioWorkletProcessor. The band values are smoothed, written into a SharedArrayBuffer, and read by the renderer once per frame. No postMessage on the hot path.

Signal flow: every 256 samples, the audio thread fills the shared buffer; every render frame, the main thread forwards three floats to the render worker via postMessage; the worker uploads them as uniforms into the OffscreenCanvas’s WebGL2 context.

What an FFT actually does, and why we need one

Audio in a computer is just a long list of numbers: at 44.1 kHz, 44,100 amplitude readings per second, one per channel. That list tells you when the speaker cone moves, but it does not directly tell you what is bass, what is mid, or what is treble in any given moment. To split a song into "how loud is each frequency right now," you need to convert a chunk of those samples from a timeline of amplitudes into a histogram of frequencies.

The conversion we want is the discrete Fourier transform (DFT). Feed in a block of N samples (here, 2,048 samples — about 46 ms of audio). Get back a frequency report: a row of N "buckets" (the textbook calls them frequency bins or DFT coefficients). Each bucket is a single complex number whose magnitude says how much of one frequency was present in the block, and whose phase says where that wave was in its cycle.

At a 44.1 kHz sample rate, a 2,048-point DFT spaces buckets about 21.5 Hz apart: bucket 0 is 0 Hz / DC, bucket 1 is about 21.5 Hz, bucket 2 is about 43 Hz, and so on. For real-valued audio the upper half of the buckets is the conjugate mirror of the lower half (X_N−k = X_k*), so the useful display side is 0 Hz → Nyquist (about 22.05 kHz). The deck does not show every bucket directly — it sums ranges of buckets into musical bands: bass defaults to 40–140 Hz, mid to 150–2,000 Hz, and high to 2,000–10,000 Hz.

Bucket map for a 44.1 kHz / 2,048-point FFT. The raw buckets are evenly spaced by about 21.5 Hz; bass, mid, and high are ranges of those buckets, not separate transforms.

Computed straight from the DFT formula X_k = Σ x_m · e^−2πikm/N, this is two nested loops: for every output bucket k, walk every input sample x_m and add up its contribution. In plain English: test "how much 86 Hz is in this block?", then scan all 2,048 samples; test "how much 108 Hz?", then scan the same 2,048 samples again. Doing that for every bucket is O(N²) work, which at N = 2,048 is roughly 4 million complex multiplies per block. We have about 5.8 ms of CPU between blocks. That budget will not fit.

A Fast Fourier Transform is any algorithm that computes the same DFT — same numbers in, same numbers out — in O(N log N) work instead of O(N²). The FFT does not invent a new transform; it factors the DFT's matrix of complex exponentials into sparse pieces that share most of their arithmetic. At N = 2,048, (N/2) · log₂(N) ≈ 11,000 complex multiplies are enough to produce the same result — roughly several hundred times cheaper, easily inside budget.

The speedup comes from avoiding repeated work. In the direct DFT, every bucket starts over and rediscovers little sums that other buckets already touched. The FFT keeps those smaller results and reuses them.

Think of the final result as the full frequency report for the whole 2,048-sample block. The FFT builds that report out of smaller reports. A small report might cover only a tiny slice of the block. The next round combines neighboring small reports into bigger reports. The next round combines those again. By the time the FFT reaches the full 2,048-sample report, most of the work has already been reused.

The classic version of this trick is the Cooley–Tukey algorithm (1965). It is divide-and-conquer in the same shape as merge sort — split, solve, combine — except the combine step is a complex- number "butterfly" instead of a merge of sorted lists, and what gets reused are partial sums of complex exponentials, not sorted sub-sequences.

The "log" in O(N log N) is the number of combine rounds, not a logarithm applied to the audio. "Chunk size" here means how many original samples one intermediate report covers. The smallest reports cover one sample each. After one combine round, each report covers 2 samples. Then 4, then 8, then 16, and so on. Reaching the final 2,048-sample report takes eleven doublings because 2¹¹ = 2,048. The audio values stay normal sample amplitudes; the algorithm is just grouping them more efficiently.

The "log" is the number of combine rounds. Each round doubles how many original samples one intermediate report covers.

Four implementation ideas show up directly in the worklet code below:

Complex frequency buckets. Each output bucket needs to say two things: how much of that frequency is present, and where that wave sits in its cycle. The first is magnitude. The second is phase. One real number only carries one measurement, so the FFT uses a complex number: one part for cosine alignment, one part for sine alignment.
Bit-reversed input order. Repeatedly splitting even-indexed vs odd-indexed samples ends with the sample at index i sitting at the position whose binary digits are i's binary digits read backwards (bitReverse(i)). The iterative FFT shuffles the input that way once at the start, then never moves data again — the rest of the algorithm is all in-place arithmetic.
Twiddle factors. When two intermediate reports come from different positions in the input window, their phases do not line up automatically. A twiddle factor is the fixed complex number that rotates one report by the exact angle needed before it is mixed with the others. Mathematically each one is an Nth root of unity W_N^k = e^−2πik/N — basically k/N of the way around the unit circle (in the negative direction, because the DFT formula uses e^−i…). They depend only on N, so we compute all of them once at module load and read them as plain table lookups in the hot loop. The name comes from Gentleman and Sande, 1966.
Butterflies. The combine step itself. Each one takes a pair of intermediate complex values (a, b), computes (a + W·b, a − W·b), and writes both back over the originals. One multiply, two add/subtract pairs, no extra memory. If you draw the data flow on paper you get an X / butterfly shape, hence the name.

Radix-4 is a Cooley–Tukey variant that groups two radix-2 combine rounds into one larger four-way round wherever the sample count allows it. Each combined round handles four values at once. The transform it computes is identical to the radix-2 result; the factorisation just produces fewer stages and lets the code share some twiddle arithmetic.

Compute once, never again

Two values precomputed at module load that the audio thread re-uses on every hop:

The Hann window. A smooth bell-shaped taper that fades the 2,048-sample chunk to zero at its edges before the FFT sees it. Why: the FFT mathematically pretends each chunk repeats forever; if the chunk does not start and end near zero, that fake repetition has a hard edge, and the FFT honestly reports that hard edge as energy spread across every frequency bucket — a smeared spectrum instead of a clean one. Multiplying the chunk by a Hann window kills the edges and gives a much sharper spectrum. The taper itself is the same 2,048 numbers forever, so build it once at startup and reuse it on every chunk.
The bit-reversal table. A 4 KB Uint16Array mapping each index i to bitReverse(i). Replaces an inner-loop bit-twiddle on every FFT call with one array load.

slerp-band-processor.worklet.js

// Built once at module load; not recomputed at audio rate.
const hannWindow = new Float32Array(FFT_SIZE);
for (let i = 0; i < FFT_SIZE; i += 1) {
  hannWindow[i] = 0.5 * (1 - Math.cos((2 * Math.PI * i) / (FFT_SIZE - 1)));
}

// 11-bit bit-reversal table for N=2048 (~4 KB).
const bitReverseTable = new Uint16Array(FFT_SIZE);
let j = 0;
for (let i = 1; i < FFT_SIZE; i += 1) {
  let bit = FFT_SIZE >> 1;
  for (; j & bit; bit >>= 1) j ^= bit;
  j ^= bit;
  bitReverseTable[i] = j;
}

Radix-2 → radix-4: the optimisation that didn't matter

The pipeline was hot while this block was tuned, so work started in chrome://tracing/, not the normal DevTools console. The page/main-thread view is good for layout and UI work; the worklet has its own audio thread. To see where the FFT actually lands, you need a trace that includes the AudioWorkletProcessor.process() calls.

Under that lens, radix-4 was not the bottleneck; cost sat elsewhere. The radix-4 change still landed (small diff, fewer multiplies on paper). Measured hop time on the worklet barely moved after.

The exact sample count matters here. A 2,048-point FFT has eleven radix-2 stages because 2¹¹ = 2,048. Every radix-2 stage doubles the size of the intermediate reports: 2 samples, then 4, then 8, and so on until the full 2,048-sample frequency report is assembled.

Radix-4 groups those stages in pairs. Since one four-way combine covers the same growth as two two-way combines, the code can replace most of the eleven radix-2 stages with radix-4 stages. The useful factorization is 2,048 = 2 · 4 · 4 · 4 · 4 · 4: one leftover radix-2 stage, then five radix-4 stages.

Radix-4 does not change the transform. It groups pairs of radix-2 stages into larger four-way stages, leaving one two-way stage because 2,048 has an odd number of factors of two.

That factorization is not the only true one. Writing 2,048 as eleven factors of two is also correct, and that is the plain radix-2 FFT. The reason 2 · 4⁵ is useful is that it packs in as many four-way stages as possible without changing the window size. It is the same transform, just grouped to do less repeated work.

The repeated unit of work is still a butterfly. A butterfly is not a new signal operation; it is the small "combine these reports into a bigger report" block inside the FFT. The radix tells you how many values go into that block: two values for radix-2, four values for radix-4.

The four values entering a radix-4 butterfly are already complex frequency measurements from smaller reports. Complex numbers are useful because a frequency measurement needs both magnitude and phase. Magnitude is "how much of this frequency is present." Phase is "where in the wave cycle is it?"

Phase matters because two matching waves can add very differently depending on where they are in the cycle. If both are at a peak, they reinforce. If one is at a peak while the other is halfway around its cycle, they cancel. The FFT combines frequency measurements that came from different positions in the input window, so it has to correct those phase offsets before adding them together.

That correction is the twiddle rotation. Picture a complex number as a point on an x/y plane: real is x, imaginary is y. Multiplying by a unit-length complex number turns that point around the origin without changing its length. In FFT terms, the amount stays the same, but the phase is moved to the right place.

A twiddle multiply turns a complex value around the origin. The point stays on the same circle, so its magnitude is unchanged; only the phase angle moves.

A radix-4 butterfly needs three of those rotations: W, W², and W³. They are not computed from scratch in the audio loop; they come from the precomputed twiddle tables. If the table index for W is k, then W² is at 2k and W³ is at 3k. The code gets those indices with one double and one add, then reads the table entries.

That is the whole optimisation in this block: load four complex values, rotate three of them, then combine all four into four outputs. Compared with doing the same work as separate radix-2 stages, the grouped version saves roughly 25% of the real-number multiplies on paper.

There is one special case in every stage: k = 0. At that index, all three rotations equal (1, 0), which means "do not rotate." The code handles that case separately so it can skip the complex multiplies entirely. The loop below is only the general case: k > 0.

Reading the loop is easier with four small translations in mind:

real[i] and imag[i] are the two halves of one complex value.
The u1, u2, and u3 variables are the three twiddle rotations.
A complex multiply is the rotation step. The formula (a + bi)(c + di) = (ac − bd) + (ad + bc)i is why the code keeps pairing one line for the new real part with one line for the new imaginary part.
After the rotations, the butterfly uses adds and subtracts to write four new complex values back into the same four array slots.

inner radix-4 butterfly

for (let k = 1; k < L; k += 1) {
  const i0 = i + k;
  const i1 = i0 + L;
  const i2 = i1 + L;
  const i3 = i2 + L;

  const u1Idx = k * tableStride;
  const u2Idx = u1Idx << 1;        // W²: index doubles, no multiply
  const u3Idx = u1Idx + u2Idx;     // W³: still just an add

  const u1r = twiddleReal[u1Idx], u1i = twiddleImag[u1Idx];
  const u2r = twiddleReal[u2Idx], u2i = twiddleImag[u2Idx];
  const u3r = twiddleReal[u3Idx], u3i = twiddleImag[u3Idx];

  const p0r = real[i0], p0i = imag[i0];
  const p1r = real[i1], p1i = imag[i1];
  const p2r = real[i2], p2i = imag[i2];
  const p3r = real[i3], p3i = imag[i3];

  const br = u2r * p1r - u2i * p1i;  // p1 · W²
  const bi = u2r * p1i + u2i * p1r;
  const cr = u1r * p2r - u1i * p2i;  // p2 · W
  const ci = u1r * p2i + u1i * p2r;
  const dr = u3r * p3r - u3i * p3i;  // p3 · W³
  const di = u3r * p3i + u3i * p3r;

  const t0r = p0r + br, t0i = p0i + bi;
  const t1r = p0r - br, t1i = p0i - bi;
  const t2r = cr + dr,  t2i = ci + di;
  const t3r = di - ci,  t3i = cr - dr;

  real[i0] = t0r + t2r; imag[i0] = t0i + t2i;
  real[i1] = t1r - t3r; imag[i1] = t1i - t3i;
  real[i2] = t0r - t2r; imag[i2] = t0i - t2i;
  real[i3] = t1r + t3r; imag[i3] = t1i + t3i;
}

Where the time actually went

Two bottlenecks showed up in Chrome’s Performance panel; neither was the FFT.

Layout thrash on panel-open

Compact mode stuttered for ~500 ms when the side panel mounted. The flame chart showed a long task with a familiar shape: read geometry, write a style, read geometry again. Each read forced a synchronous layout that the previous write had invalidated.

the pattern (synthesised, real shape)

// Before: read → write → read forces a layout per section.
for (const section of sections) {
  const w = panel.offsetWidth;            // read
  section.style.padding = `${w / 16}px`;     // write (invalidates layout)
  const h = section.offsetHeight;         // read forces layout *again*
  section.style.transform = `translateY(${h}px)`;
}

// After: batch reads, then writes. One layout pass total.
const widths  = sections.map(() => panel.offsetWidth);
const heights = sections.map((s) => s.offsetHeight);
for (const [i, section] of sections.entries()) {
  section.style.padding   = `${widths[i] / 16}px`;
  section.style.transform = `translateY(${heights[i]}px)`;
}

Faster to spot in the flame chart than in code review. Batching reads removed the cliff.

Texture thrash on resize

On mobile, every rotation reallocated framebuffer textures at the new viewport size, so the GPU paid for each allocation in the frame that woke the resize. Fix: snap render scale to tiers, debounce resize, reuse the texture handle inside a tier when possible.

Same failure mode as panel-open: read/write churn and allocation, not the inner loop.

Worker-owned rendering

Main thread: panel, file input, scroll, audio bridge, feature checks. Worker: transferred canvas, WebGL2, shaders, resize tiers, its own per-frame paint loop. The in-process renderer stays available as a fallback behind the same API as the worker proxy.

facade branch: worker path first, fallback second

const canOffscreen =
  typeof OffscreenCanvas !== "undefined" &&
  typeof Worker !== "undefined" &&
  typeof canvas.transferControlToOffscreen === "function";

renderer = canOffscreen
  ? createWorkerRenderer(rendererOpts)
  : null;

if (!renderer) {
  renderer = createMainThreadRenderer(rendererOpts);
}

If detection fails or transferControlToOffscreen() / new Worker() throws, createWorkerRenderer() returns null and the facade drops back to an in-process renderer. After a canvas transfers successfully, GL stays inside the worker; the main thread only sends setter-shaped messages.

main-thread proxy: transfer, serialize, dispatch

const off = canvas.transferControlToOffscreen();
const worker = createRenderWorker();

worker.postMessage(
  { type: "init", canvas: off, opts: buildWireInit(opts) },
  [off],
);

The adapter maps non-serializable pieces to wire-safe data: qualityUniforms becomes small tables by quality tier; UI-only shapes like format never cross the boundary. Structured clone rejects functions, so the protocol cannot be "whatever the editor had on hand."

worker entry: same renderer factory, message pump in front

self.addEventListener("message", (ev) => {
  const msg = ev.data;
  if (msg.type === "init") return handleInit(msg.canvas, msg.opts);
  switch (msg.type) {
    case "setAudioLevels": renderer.setAudioLevels(msg.bass, msg.mid, msg.high); break;
    case "setViewport":    renderer.setViewport(msg.cssWidth, msg.cssHeight, msg.dpr); break;
    case "setScroll":      renderer.setScroll(msg.value); break;
    // …setShader, rebindShaderSpec, setTuneSnapshot, pause, resume, destroy…
  }
});

The worker entry reconstructs wire-safe shader specs, calls the same renderer factory the main thread can fall back to, then handles commands straight through: audio levels, viewport, scroll, shader swaps, tune snapshots, pulse, pause, resume, destroy.

v1 still sends three floats over postMessage per frame; cost is negligible on the main thread, and there is no SharedArrayBuffer reader in the worker yet. Wiring the worker to read the band bus on its own timer can wait until a trace proves it is worth it.