Reverse Doesn't Cut High Frequencies — It Cuts the Wrong High Frequencies
AI Music (Suno) Sound Quality Research — Notes from Phase N / O / v3.20
0. Intro — How Does Suno Audio Feel to You?
Suno is an AI service that generates music from text. You give it lyrics and a style, and out comes a vocal-pop track in tens of seconds. The quality is striking.
But if you spend a whole day listening to Suno output, many people report the same feeling:
"It's not bad — but I get a bit fatigued listening for a long time."
It's loud. It cuts through. The choruses build up. Yet something is subtly poking at you.
We've been researching whether the source of this "subtle poking" can be identified at the audio data level. The fruits of that research are baked into a browser-only mastering tool we call "torimasu (とりマス)".
Public URL: https://webmastering.pages.dev/ (no Ubuntu server, no GPU — runs entirely in the browser.)
This article is about torimasu's "reverse safe-d2" mode
torimasu has multiple modes — normal loudness-mastering, song classification, diagnosis. Each has its own purpose.
This article focuses on one experimental mode that specifically targets AI-generated audio (e.g. Suno). The URL is /safe-d2.html, accessible from the top page as "reverse safe-d2".
Public URL: https://webmastering.pages.dev/safe-d2.html
What does the reverse safe-d2 mode do?
In one sentence:
It detects "pathological processing traces" left in Suno output from the audio alone, and reverses them.
The design has two pillars:
- Layer 1 (Reverse) — Detects highshelf-type "processing traces" left in Suno output, and reverses them.
- Layer 2 (SafeD2) — Then smooths out micro-instabilities that cause listening fatigue.
This is not flashy mastering. The design philosophy is "don't add anything", "don't break anything", "lower the accident rate". (In contrast to traditional mastering = louder / flashier / more punch — this is a different beast.)
Where the research stands so far
This mode started at Phase E and progressed through Phase L step by step. Quick summary:
- Suno final WAV (= A) contains a highshelf-style "pathological" HF structure.
- Applying reverse processing to A makes it approach the Stems re-mixed back together (= T).
- After mastering (compression / limiter / loudness normalization), the improvement is amplified 1.4-1.6×.
- The detector runs a 3-stage gate: cv_ratio < 0.75 + persistence ≥ -30 + confidence ≥ 0.80 (v3.6.x public).
Full chronology in Section 2.
But one big unknown remained
"Does this actually still work after going through YouTube/Spotify distribution?"
Even if the distance is small, it might evaporate the moment it hits AAC or Opus encoding for distribution. If that's true, all the research up to here would be "practically meaningless".
So we set up Phase N and O and checked it on real YouTube round-trips. While we were at it, we also tested the natural-sounding hypothesis: "Maybe reverse-processing T itself makes it codec-resistant?"
- Phase N: The reverse improvement was preserved through YouTube. It actually expanded by +1.46×.
- Phase O: The "if reverse is good for A, it must be good for T too" hypothesis was completely rejected.
1. Vocabulary
Just three symbols to remember before we dive in:
| Symbol | Meaning |
|---|---|
| T | Stems re-mixed together — pre-master signal, our "reference" for verification. Made from Suno's "Get Stems" feature. |
| A | Suno final WAV. What you normally download. |
| revA | A with the reverse pipeline (highshelf cut) applied. |
Roughly:
- T = "what it should sound like" — used as the reference answer
- A = the Suno audio users actually listen to
- revA = what torimasu produces
One distance metric:
smaller
distance(X, T)= X is closer to T = closer to "the original sound"
2. The Path So Far — Phase E through L
Before getting to Phase N / O, let's quickly retrace the year-long path that brought us here. (Skipping internal detail — just "what did each Phase reveal?")
Phase E-0 — Why is A different from T?
The first question was simple:
"What's the actual difference between A (Suno final WAV) and T (Stems mix)?"
We compared them via STFT and ran a "per-axis sweep" (MP3 axis / stereo axis / limiter axis / EQ axis / mastering axis), checking which adjustment closes the gap. The result was very clear:
- MP3 axis: every bitrate made it worse → not the path
- Stereo narrowing: worse → not the path
- Limiter: no effect → not the path
- HF tilt +3 to +5 dB at 8 kHz: max improvement ★
- HF smoothing α=0.1 helps (some songs)
In other words, A was a version of T with HF EQ + light dynamics processing applied. Not "full mastering" with loudnorm.
Phase E-1 — Can we reverse it?
Now that we'd seen the degradation path, the next question was the inverse direction:
"Can shelf cut + light expansion bring A back to T?"
Result: yes.
- Semishigure:
highshelf=-5dB @10kalone → composite distance Δ -0.144 - Yume wo Susuru:
highshelf=-5dB + compand expand_medium→ -0.125
Mathematically clean: T→A's +5dB (Δ -0.13) and A→T's -5dB (Δ -0.14) are symmetric. This is when "the Reverse Mastering hypothesis" became viable.
Phase E-2 — Build a detector that works without T
The practical issue: users don't usually have T.
"Can we build a detector that judges shelf signature from A alone?"
We first tried cross-song baseline (compare against an average of other songs) — that failed. Inter-song variance (5.8 dB) was bigger than the A→T diff we wanted to detect (1.4-4 dB), so the SNR collapsed.
→ Pivot: within-song signature (read processing traces from A alone).
- v0.1 rejected → v0.2 pivot → v0.3 prototype → v0.4 completed (validated on 32-song batch)
- v0.4.1 hotfix (added cv_ratio < 0.75 gate to kill false positives)
Phase F / G — A pause to reality-check
With v0.4 ready, we ran a synthetic stress test (Phase F) and a signature separation analysis (Phase G).
- Phase F: slope alone gives 84% false positive.
- Phase G: T itself contains shelf-like signatures. Apparently a property of Suno's AI audio generation (vocoder / diffusion HF reconstruction).
- The hypothesis "Suno mastering boosts HF" was completely overturned. The reality is the opposite: A is the result of flattening / compressing / smoothing T's HF shape.
→ Promoted cv_ratio to the primary detection metric.
Phase H — Which reverse strategy actually works?
We compared 5 reverse strategies on 5 TA pairs in parallel.
| Strategy | avg Δ vs A | wins |
|---|---|---|
| shelf_cut | -0.0017 | 3/5 |
| shelf+dyn | +0.0061 | 2/5 |
| smoothing_release | +0.2329 | 0/5 |
| dynamics_expansion | +0.1566 | 0/5 |
| smooth+dyn | +0.4190 | 0/5 |
shelf_cut isn't the "true inverse transform" but works as a low-dimensional approximation that closes perceptual distance with minimal side effects. Listening tests confirmed it was the most natural.
Phase I — Mapping the full degradation path
Tracked signature changes across all 4 stages: T → A → B (mp3) → C (mp4).
- The main culprit of degradation is concentrated in T→A (5/5 songs)
- A→B (Suno's mp3 export) is nearly transparent
- A→B→C is secondary degradation
Side discovery: 1 of 5 songs had no signature in T → signature emerges in A. This is a new type — the "signature creation type". shelf_cut doesn't work on these.
Phase J / K — persistence discovery, handling the "creation type"
Found that persistence_10k_plus (the 10-16 kHz slope) identifies "creation type" reliably:
- Standard type: feature at 8-10 kHz + continuity above 10 k
- Creation type: feature at 8-10 kHz only + sudden death above 10 k (steep dropoff worse than -30 dB/oct)
Across a 37-song corpus, found 7 creation candidates. Listening: applying reverse to creation candidates produced a "slightly thin" sensation. → Added "skip when persistence < -30" as v3.6.2.
Phase L — Confronting the biggest untested assumption
Every Phase up to here rode on one implicit assumption:
"small
distance(reverse(A), T)= ultimately good for the user"
But the actual real-world flow looks like this:
A → reverse(A) → MASTERING (EQ/comp/limiter/LUFS) → distribution
↑↑↑ a non-linear transform that may amplify or kill the diff here
5 songs × 3 master chains = 15 trials:
- Digital Skyline: pre +0.061 → post +0.075 to +0.097 (amp 1.23-1.59)
- Zureta Ondo: pre +0.046 → post +0.038 to +0.074 (amp 0.81-1.60)
- Average amp for the improving group = +1.4× (= AMPLIFIED leaning)
In other words: "the micro-scale improvement of -0.06 gets amplified 1.6× to macro-scale by limiter/compressor".
Up to here
| Phase | one-line summary |
|---|---|
| E-0 | A is a version of T with HF EQ + light dynamics added |
| E-1 | shelf cut can reverse it |
| E-2 | Built within-song detector for A-alone judgment (cv_ratio gate) |
| F / G | T also has signatures; A is actually flattening T |
| H | Of 5 strategies, shelf_cut still wins |
| I | A→B/C are derivative; detect from A alone. Discovered creation type. |
| J / K | persistence skips creation type; v3.6.2 deployed |
| L | After master, improvement is amplified 1.4-1.6× (practical validity confirmed) |
All of this was internal experiments — audio + comparison only. We hadn't checked "what happens after going through YouTube distribution". That's what Phase N / O address.
3. Phase N: Verifying with Real YouTube Uploads
What we did
For 2 songs (Zureta Ondo / Digital Skyline), we prepared 3 versions each:
- T_mastered — Stems mix run through the same master chain
- A_mastered — Suno final WAV through the same master
- revA_mastered — A with the reverse pipeline applied, then mastered
Wrapped these as MP4 (black background + AAC 384k) and uploaded to YouTube. Immediately after publishing, downloaded them back as webm (= Opus 128k) via yt-dlp, and compared with the originals.
Identification — which URL is which?
Used spectral correlation to map each YT URL to T / A / revA:
| Song | yt URL | Verdict | corr |
|---|---|---|---|
| Zureta Ondo | 7QS_d_4yy_Q | T | +0.988 |
| chGaLR1qPAo | A | +0.987 | |
| eEqnmKIAuok | revA | +0.989 | |
| Digital Skyline | HkoLQEHSwxI | T | +0.997 |
| 8sQBBFTPx60 | A | +0.997 | |
| WQ8Bc9QdPpc | revA | +0.996 |
All correlations > 0.96 — solid identification.
Results
| Song | pre improvement (post-master) | post-YT improvement | amp ratio |
|---|---|---|---|
| Zureta Ondo | +0.138 | +0.166 | +1.21× |
| Digital Skyline | +0.086 | +0.148 | +1.72× |
| Average | +0.112 | +0.157 | +1.46× |
(improvement = distance(A, T) - distance(revA, T) — i.e. how much closer revA is to T)
→ The reverse difference holds up — actually expands — after YouTube. This matches Phase L's prediction (1.4-1.6×) almost exactly. The "real-world distribution kills the effect" scenario is rejected.
Side observation — who got hurt the worst?
Looking at degradation per signal (distance from orig after YT roundtrip), there's a curious pattern:
T: yt vs orig distance = 1.34 ← largest
A: yt vs orig distance = 1.14
revA: yt vs orig distance = 1.06 ← smallest
"T is the most damaged by YouTube." A natural hypothesis arises:
"T has rich HF → AAC/Opus is weak in HF → so T degrades the most. Which means reverse is essentially a pre-emphasis (pre-equalization) that cuts HF in advance. If that's true, reverse-processing T should also improve codec resistance."
That's what Phase O tests.
4. Phase O: What Happens If You Reverse-Process T?
Hypothesis
"Reverse-processing the correct T should reduce post-Opus distortion" (the pre-emphasis hypothesis).
This is a classic idea. Digital broadcasting's de-emphasis, CD pre-emphasis, tape bias adjustment — all "if it gets cut later, cut it first".
Experiment
- Apply the same shelf params (-3 to -4 dB @10kHz) detected from A to T_mastered.wav → revT_mastered.wav
- Roundtrip both T_mastered and revT_mastered through Opus 128k
- Compare both against the original T_mastered
Results
| Song | d(T_opus, T) | d(revT_opus, T) | pre-emph gain |
|---|---|---|---|
| Zureta Ondo | 1.335 | 1.433 | -0.098 |
| Digital Skyline | 1.340 | 1.514 | -0.174 |
| Average | 1.337 | 1.473 | -0.136 |
Hypothesis fully rejected. revT consistently does worse than T after Opus. It's now confirmed that reverse processing must NOT be applied to T.
The intermediate values are interesting too. Even before going through Opus, revT is already 0.48-0.63 distant from T. In other words, shelf cut alone purely adds distance by removing "correct HF". Opus, if anything, dutifully tries to preserve the HF.
5. The Big Reframing — 3 New Principles
Combining Phase N and O, the fundamental framing of the research changes. This is the heart of this article.
Principle 1: Reverse isn't "HF cut" — it's "wrong-HF removal"
| Signal | Meaning of HF | Result of applying reverse |
|---|---|---|
| T | Correct information | Worsens (moves away from T) |
| A | Suno-derived pathological structure | Improves (moves toward T) |
| revA | A with pathological structure reduced | Target state |
Reverse is neither a plain low-pass nor a plain HF suppressor. The fact that it can pinpoint Suno-specific signatures was proven in reverse by Phase O. It's already grasping the cue that distinguishes "correct HF" from "pathological HF".
Principle 2: distance ≠ degradation
"T has the largest YouTube distance, 1.34" doesn't mean T was damaged. It means T has more HF information, so the diff measurement is more sensitive.
| Audio type | post-codec numerical distance | Listening |
|---|---|---|
| High quality (micro-phase / micro-HF / air / temporal continuity) | Large | Natural |
| Information-poor / already broken | Small | Not necessarily "good" |
This happens often in music listening. High-quality sources show large numerical diffs after codec, but listening-wise they sound natural. Conversely, sources that were broken to begin with show small post-codec diffs and small numbers — but that doesn't mean they sound good.
→ Reading numerical distance as "amount of degradation" is wrong; read it as "information density difference".
Principle 3: revA's strength comes from trajectory stabilization
| Old interpretation | New interpretation |
|---|---|
| Has YouTube/Opus resistance | Was structurally closer to T + the distribution chain didn't kill the diff |
| Pre-emphasis hypothesis (rejected in Phase O) | Trajectory stabilization (against codec chain) |
Mechanism: A's pathological HF / temporal irritation / unnatural entropy flow gets further amplified by Opus. revA had pre-reduced these — so the gap actually widens after distribution. So reverse isn't EQ correction — it's closer to trajectory stabilization against the codec chain.
This leads to one big conclusion:
"Premium feel" = NOT adding information, but REDUCING pathological structure.
Not "boost something". "Subtract the pathological structures that turn into noise". That's what Phase N / O showed.
6. v3.20: HF Pathology Analyzer (shadow mode)
Given these conclusions, the next step was clear:
Don't crank up the reverse strength — make the firing condition smarter. "Which songs / how much" should depend on the HF "pathology level".
But we don't apply yet
This is critical. We can build proxies for "pathological HF", but it's not yet verified that they actually point at pathological things. The scary part is misclassifying high-quality, beautiful HF as pathological:
- Live recordings (jazz brush cymbal / orchestral hall ambience)
- Intentional tape hiss
- Vinyl texture
- Shoegaze / dream pop "wall" sound
- Long-tailed transients
These songs have "unstable but beautiful HF". If you carelessly fire the detector here, you strip the premium feel.
So v3.20 deploys in shadow mode.
| Phase | apply gate | strength scaling |
|---|---|---|
| Phase 1 (current v3.20) | Existing v3.6.x untouched | Not applied (fixed) |
| Phase 2 (future, conditional) | Existing maintained | For gate-passing songs only, scale 0.35-1.15× by pathology |
In other words: measure + display + telemetry only. The reverse pipeline itself is untouched.
The 3 metrics
Implementation: js/hf-pathology-analyzer.js (162 lines). Runs in Web Audio.
(a) hf_instability — temporal HF variation
hf_instability = std(frameEnergy) / mean(frameEnergy)
Is HF jumping around in time? Higher = more pathology candidate. But jazz brush cymbal also raises this, so it's not usable alone.
(b) hf_continuity — adjacent-frame structure similarity
hf_continuity = mean over t of cos_sim(hf_mag[t-1], hf_mag[t])
Is the spectral structure continuous over time? Lower = more pathological (structure splits frame-by-frame). This is the leading candidate.
(c) side_hf_randomness — Side channel chaos
mid = (L+R)/2, side = (L-R)/2
ratio[t] = ||side_hf[t]|| / ||mid_hf[t]||
side_hf_randomness = std(ratio) / mean(ratio)
Is the Side channel HF jumping randomly relative to Mid? A typical signature of artificial stereo spread / pseudo-air.
Composite
pathology_score = 0.45 * instability_norm
+ 0.35 * (1 - continuity_norm)
+ 0.20 * side_randomness_norm
Continuity is the leading hypothesis, instability is next, side is supporting. Weights are tentative — to be recalibrated from telemetry.
For Phase 2 (computed but unused now)
strength_gain_proposed = clamp(0.55 + pathology_score * 0.75, 0.35, 1.15)
Critically, the lower bound is NOT 0. "Leave a little" is safer than "fully off". In future Phase 2, this multiplier will scale shelf depth for songs that already passed the existing gate. But in the current version, this is computed only — not applied.
7. UI display
Shadow mode is shown explicitly in the UI:
🔬 HF pathology (shadow mode) — measurement only, does not affect apply
pathology_score: 0.42 (42%) → mixed
[strength_gain (proposed): ×0.87, not applied]
▼ Breakdown (3 metrics)
hf_instability 0.512 (norm 0.43)
hf_continuity 0.871 (1-norm 0.41)
side_hf_randomness 0.328 (norm 0.27)
composite 0.45·inst + 0.35·(1-cont) + 0.20·side = 0.420
shadow mode note This score is currently NOT used for apply/strength.
Color coding: 0-0.4 green (stable) / 0.4-0.7 yellow (mixed) / 0.7- orange-red (pathological-like).
8. What data to collect — "missed cases" first
Now telemetry will start collecting data. But just looking at "improved" data is meaningless. The key to growing a 4th-generation detector is "missed cases".
Most important: high pathology + no improvement / worsening
Especially for songs like:
- Live recordings / jazz brush cymbal / intentional tape hiss
- Orchestral hall ambience
- Vinyl texture
- Shoegaze / dream pop
If pathology_score is high but delta_T ≥ 0 for these, that's proof that "instability ≠ pathology". A decisive hint to redesign the normalization.
Hypothesis-supporting data
Conversely, if Suno metallic sheen / unstable side HF / pseudo-air / fake shimmer / transient spray show high pathology + improvement, the 4th-gen detector direction is on target.
4 correlation axes
| Axis | Interpretation |
|---|---|
pathology_score × delta_T | Detector validity (the main hypothesis) |
hf_continuity × delta_T | Continuity-only hypothesis |
side_hf_randomness × delta_T | Stereo instability hypothesis |
pathology_score × YouTube amplification | Codec amplification hypothesis (Phase N tie-in) |
If the last one comes out positive, "codec chains amplify pathological structure" gets stronger.
9. Why "play defense" by design?
You might think "what a waste". You built a new detector and don't even apply it. The reason is clear.
1. The existing gate is the crystallization of empirical rules
The 3-stage gate (cv_ratio < 0.75 + persistence ≥ -30 + confidence ≥ 0.80) is the result of slowly killing false positives one-by-one through the long Phase E-K trial-and-error. "The courage not to touch" is built into it.
2. Pathology is still in theoretical stage
Of the 3 metrics, continuity is a strong candidate alone, but real-corpus behavior is unknown. In particular, no proven way yet to separate "high-quality instability" from "pathological instability".
3. Separation is safe
Gate = safety (track-record-based) / Pathology = research (hypothesis-based) — we keep them separated. Build new things without breaking old things. When the new thing matures, only then connect it.
"Don't add automatic processing because of an improvement you don't understand." This is a project principle that's been carried since v3.6.x.
10. Summary
| Phase | Result |
|---|---|
| Phase N | Reverse improvement preserved through YouTube + amplified 1.46× (real, 2 songs) |
| Phase O | "Reverse T → codec-resistance" hypothesis fully rejected |
| New principle | Reverse is NOT HF cut — it's "wrong-HF removal" |
| New principle | distance ≠ degradation. Read it as information density difference. |
| New principle | Premium feel = NOT adding info, but REDUCING pathological structure |
| v3.20 | HF Pathology Analyzer implemented (shadow mode, no apply impact) |
The reverse pipeline is not just a low-pass — it's pinpointing Suno-specific pathological HF structure. That was the most important thing this round confirmed.
The next step is "make the firing condition smarter". Not crank up strength, but vary "which songs / how much" by HF pathology level. But cautiously. To avoid misclassifying high-quality HF, we collect data in shadow mode first.
11. Try it out
🔬 Open the reverse safe-d2 mode
App: https://webmastering.pages.dev/safe-d2.html
Browser-only (no Ubuntu server / no GPU). If you supply a Stems ZIP, pre-delta validation kicks in. Turning telemetry ON contributes verification data for the 4th-generation detector.
Top page (all torimasu modes): https://webmastering.pages.dev/
No improvement guarantee
This mode is NOT a "clearly sounds better" tool. "Indistinguishable in short A/B" + "slightly less fatiguing over long listening" — that's the success criterion. The value is in "not breaking" and "low accident rate".
(If you want flashy / louder, use torimasu's normal mastering mode instead.)
Related references (technical detail)
- Formulas / algorithms:
docs/PHASE_N_O_v3.20_REFERENCE.md(in repo) - Detector lineage (song classification):
docs/DETECT_REFERENCE.md(in repo) - Research scripts:
research/phase_n_youtube_roundtrip.py,research/phase_o_revT_pre_emphasis.py
2026-05-11 / torimasu v3.20 deploy notes