Reverse Doesn't Cut High Frequencies — It Cuts the Wrong High Frequencies

AI Music (Suno) Sound Quality Research — Notes from Phase N / O / v3.20

Published: 2026-05-11

Audience: People interested in AI music sound quality / DSP engineers / Mastering folks / Suno users

App URL: https://webmastering.pages.dev/safe-d2.html

0. Intro — How Does Suno Audio Feel to You?

Suno is an AI service that generates music from text. You give it lyrics and a style, and out comes a vocal-pop track in tens of seconds. The quality is striking.

But if you spend a whole day listening to Suno output, many people report the same feeling:

"It's not bad — but I get a bit fatigued listening for a long time."

It's loud. It cuts through. The choruses build up. Yet something is subtly poking at you.

We've been researching whether the source of this "subtle poking" can be identified at the audio data level. The fruits of that research are baked into a browser-only mastering tool we call "torimasu (とりマス)".

Public URL: https://webmastering.pages.dev/ (no Ubuntu server, no GPU — runs entirely in the browser.)

This article is about torimasu's "reverse safe-d2" mode

torimasu has multiple modes — normal loudness-mastering, song classification, diagnosis. Each has its own purpose.

This article focuses on one experimental mode that specifically targets AI-generated audio (e.g. Suno). The URL is /safe-d2.html, accessible from the top page as "reverse safe-d2".

Public URL: https://webmastering.pages.dev/safe-d2.html

What does the reverse safe-d2 mode do?

In one sentence:

It detects "pathological processing traces" left in Suno output from the audio alone, and reverses them.

The design has two pillars:

Layer 1 (Reverse) — Detects highshelf-type "processing traces" left in Suno output, and reverses them.
Layer 2 (SafeD2) — Then smooths out micro-instabilities that cause listening fatigue.

This is not flashy mastering. The design philosophy is "don't add anything", "don't break anything", "lower the accident rate". (In contrast to traditional mastering = louder / flashier / more punch — this is a different beast.)

Where the research stands so far

This mode started at Phase E and progressed through Phase L step by step. Quick summary:

Suno final WAV (= A) contains a highshelf-style "pathological" HF structure.
Applying reverse processing to A makes it approach the Stems re-mixed back together (= T).
After mastering (compression / limiter / loudness normalization), the improvement is amplified 1.4-1.6×.
The detector runs a 3-stage gate: cv_ratio < 0.75 + persistence ≥ -30 + confidence ≥ 0.80 (v3.6.x public).

Full chronology in Section 2.

But one big unknown remained

"Does this actually still work after going through YouTube/Spotify distribution?"

Even if the distance is small, it might evaporate the moment it hits AAC or Opus encoding for distribution. If that's true, all the research up to here would be "practically meaningless".

So we set up Phase N and O and checked it on real YouTube round-trips. While we were at it, we also tested the natural-sounding hypothesis: "Maybe reverse-processing T itself makes it codec-resistant?"

TL;DR

Phase N: The reverse improvement was preserved through YouTube. It actually expanded by +1.46×.
Phase O: The "if reverse is good for A, it must be good for T too" hypothesis was completely rejected.

The combination of these two fundamentally reframed the research. This article is the record of that.

1. Vocabulary

Just three symbols to remember before we dive in:

Symbol	Meaning
T	Stems re-mixed together — pre-master signal, our "reference" for verification. Made from Suno's "Get Stems" feature.
A	Suno final WAV. What you normally download.
revA	A with the reverse pipeline (highshelf cut) applied.

Roughly:

T = "what it should sound like" — used as the reference answer
A = the Suno audio users actually listen to
revA = what torimasu produces

One distance metric:

smaller distance(X, T) = X is closer to T = closer to "the original sound"

2. The Path So Far — Phase E through L

Before getting to Phase N / O, let's quickly retrace the year-long path that brought us here. (Skipping internal detail — just "what did each Phase reveal?")

Phase E-0 — Why is A different from T?

The first question was simple:

"What's the actual difference between A (Suno final WAV) and T (Stems mix)?"

We compared them via STFT and ran a "per-axis sweep" (MP3 axis / stereo axis / limiter axis / EQ axis / mastering axis), checking which adjustment closes the gap. The result was very clear:

MP3 axis: every bitrate made it worse → not the path
Stereo narrowing: worse → not the path
Limiter: no effect → not the path
HF tilt +3 to +5 dB at 8 kHz: max improvement ★
HF smoothing α=0.1 helps (some songs)

In other words, A was a version of T with HF EQ + light dynamics processing applied. Not "full mastering" with loudnorm.

Phase E-1 — Can we reverse it?

Now that we'd seen the degradation path, the next question was the inverse direction:

"Can shelf cut + light expansion bring A back to T?"

Result: yes.

Semishigure: highshelf=-5dB @10k alone → composite distance Δ -0.144
Yume wo Susuru: highshelf=-5dB + compand expand_medium → -0.125

Mathematically clean: T→A's +5dB (Δ -0.13) and A→T's -5dB (Δ -0.14) are symmetric. This is when "the Reverse Mastering hypothesis" became viable.

Phase E-2 — Build a detector that works without T

The practical issue: users don't usually have T.

"Can we build a detector that judges shelf signature from A alone?"

We first tried cross-song baseline (compare against an average of other songs) — that failed. Inter-song variance (5.8 dB) was bigger than the A→T diff we wanted to detect (1.4-4 dB), so the SNR collapsed.

→ Pivot: within-song signature (read processing traces from A alone).

v0.1 rejected → v0.2 pivot → v0.3 prototype → v0.4 completed (validated on 32-song batch)
v0.4.1 hotfix (added cv_ratio < 0.75 gate to kill false positives)

Phase F / G — A pause to reality-check

With v0.4 ready, we ran a synthetic stress test (Phase F) and a signature separation analysis (Phase G).

Phase F: slope alone gives 84% false positive.
Phase G: T itself contains shelf-like signatures. Apparently a property of Suno's AI audio generation (vocoder / diffusion HF reconstruction).
The hypothesis "Suno mastering boosts HF" was completely overturned. The reality is the opposite: A is the result of flattening / compressing / smoothing T's HF shape.

→ Promoted cv_ratio to the primary detection metric.

Phase H — Which reverse strategy actually works?

We compared 5 reverse strategies on 5 TA pairs in parallel.

Strategy	avg Δ vs A	wins
shelf_cut	-0.0017	3/5
shelf+dyn	+0.0061	2/5
smoothing_release	+0.2329	0/5
dynamics_expansion	+0.1566	0/5
smooth+dyn	+0.4190	0/5

shelf_cut isn't the "true inverse transform" but works as a low-dimensional approximation that closes perceptual distance with minimal side effects. Listening tests confirmed it was the most natural.

Phase I — Mapping the full degradation path

Tracked signature changes across all 4 stages: T → A → B (mp3) → C (mp4).

The main culprit of degradation is concentrated in T→A (5/5 songs)
A→B (Suno's mp3 export) is nearly transparent
A→B→C is secondary degradation

Side discovery: 1 of 5 songs had no signature in T → signature emerges in A. This is a new type — the "signature creation type". shelf_cut doesn't work on these.

Phase J / K — persistence discovery, handling the "creation type"

Found that persistence_10k_plus (the 10-16 kHz slope) identifies "creation type" reliably:

Standard type: feature at 8-10 kHz + continuity above 10 k
Creation type: feature at 8-10 kHz only + sudden death above 10 k (steep dropoff worse than -30 dB/oct)

Across a 37-song corpus, found 7 creation candidates. Listening: applying reverse to creation candidates produced a "slightly thin" sensation. → Added "skip when persistence < -30" as v3.6.2.

Phase L — Confronting the biggest untested assumption

Every Phase up to here rode on one implicit assumption:

"small distance(reverse(A), T) = ultimately good for the user"

But the actual real-world flow looks like this:

A → reverse(A) → MASTERING (EQ/comp/limiter/LUFS) → distribution
                 ↑↑↑ a non-linear transform that may amplify or kill the diff here

5 songs × 3 master chains = 15 trials:

Digital Skyline: pre +0.061 → post +0.075 to +0.097 (amp 1.23-1.59)
Zureta Ondo: pre +0.046 → post +0.038 to +0.074 (amp 0.81-1.60)
Average amp for the improving group = +1.4× (= AMPLIFIED leaning)

In other words: "the micro-scale improvement of -0.06 gets amplified 1.6× to macro-scale by limiter/compressor".

Up to here

Phase	one-line summary
E-0	A is a version of T with HF EQ + light dynamics added
E-1	shelf cut can reverse it
E-2	Built within-song detector for A-alone judgment (cv_ratio gate)
F / G	T also has signatures; A is actually flattening T
H	Of 5 strategies, shelf_cut still wins
I	A→B/C are derivative; detect from A alone. Discovered creation type.
J / K	persistence skips creation type; v3.6.2 deployed
L	After master, improvement is amplified 1.4-1.6× (practical validity confirmed)

All of this was internal experiments — audio + comparison only. We hadn't checked "what happens after going through YouTube distribution". That's what Phase N / O address.

3. Phase N: Verifying with Real YouTube Uploads

What we did

For 2 songs (Zureta Ondo / Digital Skyline), we prepared 3 versions each:

T_mastered — Stems mix run through the same master chain
A_mastered — Suno final WAV through the same master
revA_mastered — A with the reverse pipeline applied, then mastered

Wrapped these as MP4 (black background + AAC 384k) and uploaded to YouTube. Immediately after publishing, downloaded them back as webm (= Opus 128k) via yt-dlp, and compared with the originals.

Identification — which URL is which?

Used spectral correlation to map each YT URL to T / A / revA:

Song	yt URL	Verdict	corr
Zureta Ondo	7QS_d_4yy_Q	T	+0.988
	chGaLR1qPAo	A	+0.987
	eEqnmKIAuok	revA	+0.989
Digital Skyline	HkoLQEHSwxI	T	+0.997
	8sQBBFTPx60	A	+0.997
	WQ8Bc9QdPpc	revA	+0.996

All correlations > 0.96 — solid identification.

Results

Song	pre improvement (post-master)	post-YT improvement	amp ratio
Zureta Ondo	+0.138	+0.166	+1.21×
Digital Skyline	+0.086	+0.148	+1.72×
Average	+0.112	+0.157	+1.46×

(improvement = distance(A, T) - distance(revA, T) — i.e. how much closer revA is to T)

→ The reverse difference holds up — actually expands — after YouTube. This matches Phase L's prediction (1.4-1.6×) almost exactly. The "real-world distribution kills the effect" scenario is rejected.

Side observation — who got hurt the worst?

Looking at degradation per signal (distance from orig after YT roundtrip), there's a curious pattern:

T:    yt vs orig distance = 1.34  ← largest
A:    yt vs orig distance = 1.14
revA: yt vs orig distance = 1.06  ← smallest

"T is the most damaged by YouTube." A natural hypothesis arises:

"T has rich HF → AAC/Opus is weak in HF → so T degrades the most. Which means reverse is essentially a pre-emphasis (pre-equalization) that cuts HF in advance. If that's true, reverse-processing T should also improve codec resistance."

That's what Phase O tests.

4. Phase O: What Happens If You Reverse-Process T?

Hypothesis

"Reverse-processing the correct T should reduce post-Opus distortion" (the pre-emphasis hypothesis).

This is a classic idea. Digital broadcasting's de-emphasis, CD pre-emphasis, tape bias adjustment — all "if it gets cut later, cut it first".

Experiment

Apply the same shelf params (-3 to -4 dB @10kHz) detected from A to T_mastered.wav → revT_mastered.wav
Roundtrip both T_mastered and revT_mastered through Opus 128k
Compare both against the original T_mastered

Results

Song	d(T_opus, T)	d(revT_opus, T)	pre-emph gain
Zureta Ondo	1.335	1.433	-0.098
Digital Skyline	1.340	1.514	-0.174
Average	1.337	1.473	-0.136

Hypothesis fully rejected. revT consistently does worse than T after Opus. It's now confirmed that reverse processing must NOT be applied to T.

The intermediate values are interesting too. Even before going through Opus, revT is already 0.48-0.63 distant from T. In other words, shelf cut alone purely adds distance by removing "correct HF". Opus, if anything, dutifully tries to preserve the HF.

5. The Big Reframing — 3 New Principles

Combining Phase N and O, the fundamental framing of the research changes. This is the heart of this article.

Principle 1: Reverse isn't "HF cut" — it's "wrong-HF removal"

Signal	Meaning of HF	Result of applying reverse
T	Correct information	Worsens (moves away from T)
A	Suno-derived pathological structure	Improves (moves toward T)
revA	A with pathological structure reduced	Target state

Reverse is neither a plain low-pass nor a plain HF suppressor. The fact that it can pinpoint Suno-specific signatures was proven in reverse by Phase O. It's already grasping the cue that distinguishes "correct HF" from "pathological HF".

Principle 2: distance ≠ degradation

"T has the largest YouTube distance, 1.34" doesn't mean T was damaged. It means T has more HF information, so the diff measurement is more sensitive.

Audio type	post-codec numerical distance	Listening
High quality (micro-phase / micro-HF / air / temporal continuity)	Large	Natural
Information-poor / already broken	Small	Not necessarily "good"

This happens often in music listening. High-quality sources show large numerical diffs after codec, but listening-wise they sound natural. Conversely, sources that were broken to begin with show small post-codec diffs and small numbers — but that doesn't mean they sound good.

→ Reading numerical distance as "amount of degradation" is wrong; read it as "information density difference".

Principle 3: revA's strength comes from trajectory stabilization

Old interpretation	New interpretation
Has YouTube/Opus resistance	Was structurally closer to T + the distribution chain didn't kill the diff
Pre-emphasis hypothesis (rejected in Phase O)	Trajectory stabilization (against codec chain)

Mechanism: A's pathological HF / temporal irritation / unnatural entropy flow gets further amplified by Opus. revA had pre-reduced these — so the gap actually widens after distribution. So reverse isn't EQ correction — it's closer to trajectory stabilization against the codec chain.

This leads to one big conclusion:

"Premium feel" = NOT adding information, but REDUCING pathological structure.

Not "boost something". "Subtract the pathological structures that turn into noise". That's what Phase N / O showed.

6. v3.20: HF Pathology Analyzer (shadow mode)

Given these conclusions, the next step was clear:

Don't crank up the reverse strength — make the firing condition smarter. "Which songs / how much" should depend on the HF "pathology level".

But we don't apply yet

This is critical. We can build proxies for "pathological HF", but it's not yet verified that they actually point at pathological things. The scary part is misclassifying high-quality, beautiful HF as pathological:

Live recordings (jazz brush cymbal / orchestral hall ambience)
Intentional tape hiss
Vinyl texture
Shoegaze / dream pop "wall" sound
Long-tailed transients

These songs have "unstable but beautiful HF". If you carelessly fire the detector here, you strip the premium feel.

So v3.20 deploys in shadow mode.

Phase	apply gate	strength scaling
Phase 1 (current v3.20)	Existing v3.6.x untouched	Not applied (fixed)
Phase 2 (future, conditional)	Existing maintained	For gate-passing songs only, scale 0.35-1.15× by pathology

In other words: measure + display + telemetry only. The reverse pipeline itself is untouched.

The 3 metrics

Implementation: js/hf-pathology-analyzer.js (162 lines). Runs in Web Audio.

(a) `hf_instability` — temporal HF variation

hf_instability = std(frameEnergy) / mean(frameEnergy)

Is HF jumping around in time? Higher = more pathology candidate. But jazz brush cymbal also raises this, so it's not usable alone.

(b) `hf_continuity` — adjacent-frame structure similarity

hf_continuity = mean over t of cos_sim(hf_mag[t-1], hf_mag[t])

Is the spectral structure continuous over time? Lower = more pathological (structure splits frame-by-frame). This is the leading candidate.

(c) `side_hf_randomness` — Side channel chaos

mid = (L+R)/2, side = (L-R)/2
ratio[t] = ||side_hf[t]|| / ||mid_hf[t]||
side_hf_randomness = std(ratio) / mean(ratio)

Is the Side channel HF jumping randomly relative to Mid? A typical signature of artificial stereo spread / pseudo-air.

Composite

pathology_score = 0.45 * instability_norm
                + 0.35 * (1 - continuity_norm)
                + 0.20 * side_randomness_norm

Continuity is the leading hypothesis, instability is next, side is supporting. Weights are tentative — to be recalibrated from telemetry.

For Phase 2 (computed but unused now)

strength_gain_proposed = clamp(0.55 + pathology_score * 0.75, 0.35, 1.15)

Critically, the lower bound is NOT 0. "Leave a little" is safer than "fully off". In future Phase 2, this multiplier will scale shelf depth for songs that already passed the existing gate. But in the current version, this is computed only — not applied.

7. UI display

Shadow mode is shown explicitly in the UI:

🔬 HF pathology (shadow mode) — measurement only, does not affect apply

pathology_score: 0.42 (42%) → mixed
[strength_gain (proposed): ×0.87, not applied]

▼ Breakdown (3 metrics)
  hf_instability         0.512 (norm 0.43)
  hf_continuity          0.871 (1-norm 0.41)
  side_hf_randomness     0.328 (norm 0.27)
  composite              0.45·inst + 0.35·(1-cont) + 0.20·side = 0.420
  shadow mode note       This score is currently NOT used for apply/strength.

Color coding: 0-0.4 green (stable) / 0.4-0.7 yellow (mixed) / 0.7- orange-red (pathological-like).

8. What data to collect — "missed cases" first

Now telemetry will start collecting data. But just looking at "improved" data is meaningless. The key to growing a 4th-generation detector is "missed cases".

Most important: high pathology + no improvement / worsening

Especially for songs like:

Live recordings / jazz brush cymbal / intentional tape hiss
Orchestral hall ambience
Vinyl texture
Shoegaze / dream pop

If pathology_score is high but delta_T ≥ 0 for these, that's proof that "instability ≠ pathology". A decisive hint to redesign the normalization.

Hypothesis-supporting data

Conversely, if Suno metallic sheen / unstable side HF / pseudo-air / fake shimmer / transient spray show high pathology + improvement, the 4th-gen detector direction is on target.

4 correlation axes

Axis	Interpretation
`pathology_score` × `delta_T`	Detector validity (the main hypothesis)
`hf_continuity` × `delta_T`	Continuity-only hypothesis
`side_hf_randomness` × `delta_T`	Stereo instability hypothesis
`pathology_score` × YouTube amplification	Codec amplification hypothesis (Phase N tie-in)

If the last one comes out positive, "codec chains amplify pathological structure" gets stronger.

9. Why "play defense" by design?

You might think "what a waste". You built a new detector and don't even apply it. The reason is clear.

1. The existing gate is the crystallization of empirical rules

The 3-stage gate (cv_ratio < 0.75 + persistence ≥ -30 + confidence ≥ 0.80) is the result of slowly killing false positives one-by-one through the long Phase E-K trial-and-error. "The courage not to touch" is built into it.

2. Pathology is still in theoretical stage

Of the 3 metrics, continuity is a strong candidate alone, but real-corpus behavior is unknown. In particular, no proven way yet to separate "high-quality instability" from "pathological instability".

3. Separation is safe

Gate = safety (track-record-based) / Pathology = research (hypothesis-based) — we keep them separated. Build new things without breaking old things. When the new thing matures, only then connect it.

"Don't add automatic processing because of an improvement you don't understand." This is a project principle that's been carried since v3.6.x.

10. Summary

Phase	Result
Phase N	Reverse improvement preserved through YouTube + amplified 1.46× (real, 2 songs)
Phase O	"Reverse T → codec-resistance" hypothesis fully rejected
New principle	Reverse is NOT HF cut — it's "wrong-HF removal"
New principle	distance ≠ degradation. Read it as information density difference.
New principle	Premium feel = NOT adding info, but REDUCING pathological structure
v3.20	HF Pathology Analyzer implemented (shadow mode, no apply impact)

The reverse pipeline is not just a low-pass — it's pinpointing Suno-specific pathological HF structure. That was the most important thing this round confirmed.

The next step is "make the firing condition smarter". Not crank up strength, but vary "which songs / how much" by HF pathology level. But cautiously. To avoid misclassifying high-quality HF, we collect data in shadow mode first.

11. Try it out

🔬 Open the reverse safe-d2 mode

App: https://webmastering.pages.dev/safe-d2.html

Browser-only (no Ubuntu server / no GPU). If you supply a Stems ZIP, pre-delta validation kicks in. Turning telemetry ON contributes verification data for the 4th-generation detector.

Top page (all torimasu modes): https://webmastering.pages.dev/

No improvement guarantee

This mode is NOT a "clearly sounds better" tool. "Indistinguishable in short A/B" + "slightly less fatiguing over long listening" — that's the success criterion. The value is in "not breaking" and "low accident rate".

(If you want flashy / louder, use torimasu's normal mastering mode instead.)

Related references (technical detail)

Formulas / algorithms: docs/PHASE_N_O_v3.20_REFERENCE.md (in repo)
Detector lineage (song classification): docs/DETECT_REFERENCE.md (in repo)
Research scripts: research/phase_n_youtube_roundtrip.py, research/phase_o_revT_pre_emphasis.py

2026-05-11 / torimasu v3.20 deploy notes

Reverse Doesn't Cut High Frequencies — It Cuts the Wrong High Frequencies

0. Intro — How Does Suno Audio Feel to You?

This article is about torimasu's "reverse safe-d2" mode

What does the reverse safe-d2 mode do?

Where the research stands so far

But one big unknown remained

1. Vocabulary

2. The Path So Far — Phase E through L

Phase E-0 — Why is A different from T?

Phase E-1 — Can we reverse it?

Phase E-2 — Build a detector that works without T

Phase F / G — A pause to reality-check

Phase H — Which reverse strategy actually works?

Phase I — Mapping the full degradation path

Phase J / K — persistence discovery, handling the "creation type"

Phase L — Confronting the biggest untested assumption

Up to here

3. Phase N: Verifying with Real YouTube Uploads

What we did

Identification — which URL is which?

Results

Side observation — who got hurt the worst?

4. Phase O: What Happens If You Reverse-Process T?

Hypothesis

Experiment

Results

5. The Big Reframing — 3 New Principles

Principle 1: Reverse isn't "HF cut" — it's "wrong-HF removal"

Principle 2: distance ≠ degradation

Principle 3: revA's strength comes from trajectory stabilization

6. v3.20: HF Pathology Analyzer (shadow mode)

But we don't apply yet

The 3 metrics

(a) hf_instability — temporal HF variation

(b) hf_continuity — adjacent-frame structure similarity

(c) side_hf_randomness — Side channel chaos

Composite

For Phase 2 (computed but unused now)

7. UI display

8. What data to collect — "missed cases" first

Most important: high pathology + no improvement / worsening

Hypothesis-supporting data

4 correlation axes

9. Why "play defense" by design?

1. The existing gate is the crystallization of empirical rules

2. Pathology is still in theoretical stage

3. Separation is safe

10. Summary

11. Try it out

🔬 Open the reverse safe-d2 mode

No improvement guarantee

Related references (technical detail)

(a) `hf_instability` — temporal HF variation

(b) `hf_continuity` — adjacent-frame structure similarity

(c) `side_hf_randomness` — Side channel chaos