torimasu (とりマス)  ·  reverse safe-d2 mode JP App EN Article

Reverse Doesn't Cut High Frequencies — It Cuts the Wrong High Frequencies

AI Music (Suno) Sound Quality Research — Notes from Phase N / O / v3.20

Published: 2026-05-11

Audience: People interested in AI music sound quality / DSP engineers / Mastering folks / Suno users

App URL: https://webmastering.pages.dev/safe-d2.html

0. Intro — How Does Suno Audio Feel to You?

Suno is an AI service that generates music from text. You give it lyrics and a style, and out comes a vocal-pop track in tens of seconds. The quality is striking.

But if you spend a whole day listening to Suno output, many people report the same feeling:

"It's not bad — but I get a bit fatigued listening for a long time."

It's loud. It cuts through. The choruses build up. Yet something is subtly poking at you.

We've been researching whether the source of this "subtle poking" can be identified at the audio data level. The fruits of that research are baked into a browser-only mastering tool we call "torimasu (とりマス)".

Public URL: https://webmastering.pages.dev/ (no Ubuntu server, no GPU — runs entirely in the browser.)

This article is about torimasu's "reverse safe-d2" mode

torimasu has multiple modes — normal loudness-mastering, song classification, diagnosis. Each has its own purpose.

This article focuses on one experimental mode that specifically targets AI-generated audio (e.g. Suno). The URL is /safe-d2.html, accessible from the top page as "reverse safe-d2".

Public URL: https://webmastering.pages.dev/safe-d2.html

What does the reverse safe-d2 mode do?

In one sentence:

It detects "pathological processing traces" left in Suno output from the audio alone, and reverses them.

The design has two pillars:

This is not flashy mastering. The design philosophy is "don't add anything", "don't break anything", "lower the accident rate". (In contrast to traditional mastering = louder / flashier / more punch — this is a different beast.)

Where the research stands so far

This mode started at Phase E and progressed through Phase L step by step. Quick summary:

Full chronology in Section 2.

But one big unknown remained

"Does this actually still work after going through YouTube/Spotify distribution?"

Even if the distance is small, it might evaporate the moment it hits AAC or Opus encoding for distribution. If that's true, all the research up to here would be "practically meaningless".

So we set up Phase N and O and checked it on real YouTube round-trips. While we were at it, we also tested the natural-sounding hypothesis: "Maybe reverse-processing T itself makes it codec-resistant?"

TL;DR The combination of these two fundamentally reframed the research. This article is the record of that.


1. Vocabulary

Just three symbols to remember before we dive in:

SymbolMeaning
TStems re-mixed together — pre-master signal, our "reference" for verification. Made from Suno's "Get Stems" feature.
ASuno final WAV. What you normally download.
revAA with the reverse pipeline (highshelf cut) applied.

Roughly:

One distance metric:

smaller distance(X, T) = X is closer to T = closer to "the original sound"



2. The Path So Far — Phase E through L

Before getting to Phase N / O, let's quickly retrace the year-long path that brought us here. (Skipping internal detail — just "what did each Phase reveal?")

Phase E-0 — Why is A different from T?

The first question was simple:

"What's the actual difference between A (Suno final WAV) and T (Stems mix)?"

We compared them via STFT and ran a "per-axis sweep" (MP3 axis / stereo axis / limiter axis / EQ axis / mastering axis), checking which adjustment closes the gap. The result was very clear:

In other words, A was a version of T with HF EQ + light dynamics processing applied. Not "full mastering" with loudnorm.

Phase E-1 — Can we reverse it?

Now that we'd seen the degradation path, the next question was the inverse direction:

"Can shelf cut + light expansion bring A back to T?"

Result: yes.

Mathematically clean: T→A's +5dB (Δ -0.13) and A→T's -5dB (Δ -0.14) are symmetric. This is when "the Reverse Mastering hypothesis" became viable.

Phase E-2 — Build a detector that works without T

The practical issue: users don't usually have T.

"Can we build a detector that judges shelf signature from A alone?"

We first tried cross-song baseline (compare against an average of other songs) — that failed. Inter-song variance (5.8 dB) was bigger than the A→T diff we wanted to detect (1.4-4 dB), so the SNR collapsed.

→ Pivot: within-song signature (read processing traces from A alone).

Phase F / G — A pause to reality-check

With v0.4 ready, we ran a synthetic stress test (Phase F) and a signature separation analysis (Phase G).

→ Promoted cv_ratio to the primary detection metric.

Phase H — Which reverse strategy actually works?

We compared 5 reverse strategies on 5 TA pairs in parallel.

Strategyavg Δ vs Awins
shelf_cut-0.00173/5
shelf+dyn+0.00612/5
smoothing_release+0.23290/5
dynamics_expansion+0.15660/5
smooth+dyn+0.41900/5

shelf_cut isn't the "true inverse transform" but works as a low-dimensional approximation that closes perceptual distance with minimal side effects. Listening tests confirmed it was the most natural.

Phase I — Mapping the full degradation path

Tracked signature changes across all 4 stages: T → A → B (mp3) → C (mp4).

Side discovery: 1 of 5 songs had no signature in T → signature emerges in A. This is a new type — the "signature creation type". shelf_cut doesn't work on these.

Phase J / K — persistence discovery, handling the "creation type"

Found that persistence_10k_plus (the 10-16 kHz slope) identifies "creation type" reliably:

Across a 37-song corpus, found 7 creation candidates. Listening: applying reverse to creation candidates produced a "slightly thin" sensation. → Added "skip when persistence < -30" as v3.6.2.

Phase L — Confronting the biggest untested assumption

Every Phase up to here rode on one implicit assumption:

"small distance(reverse(A), T) = ultimately good for the user"

But the actual real-world flow looks like this:

A → reverse(A) → MASTERING (EQ/comp/limiter/LUFS) → distribution
                 ↑↑↑ a non-linear transform that may amplify or kill the diff here

5 songs × 3 master chains = 15 trials:

In other words: "the micro-scale improvement of -0.06 gets amplified 1.6× to macro-scale by limiter/compressor".

Up to here

Phaseone-line summary
E-0A is a version of T with HF EQ + light dynamics added
E-1shelf cut can reverse it
E-2Built within-song detector for A-alone judgment (cv_ratio gate)
F / GT also has signatures; A is actually flattening T
HOf 5 strategies, shelf_cut still wins
IA→B/C are derivative; detect from A alone. Discovered creation type.
J / Kpersistence skips creation type; v3.6.2 deployed
LAfter master, improvement is amplified 1.4-1.6× (practical validity confirmed)

All of this was internal experiments — audio + comparison only. We hadn't checked "what happens after going through YouTube distribution". That's what Phase N / O address.



3. Phase N: Verifying with Real YouTube Uploads

What we did

For 2 songs (Zureta Ondo / Digital Skyline), we prepared 3 versions each:

Wrapped these as MP4 (black background + AAC 384k) and uploaded to YouTube. Immediately after publishing, downloaded them back as webm (= Opus 128k) via yt-dlp, and compared with the originals.

Identification — which URL is which?

Used spectral correlation to map each YT URL to T / A / revA:

Songyt URLVerdictcorr
Zureta Ondo7QS_d_4yy_QT+0.988
chGaLR1qPAoA+0.987
eEqnmKIAuokrevA+0.989
Digital SkylineHkoLQEHSwxIT+0.997
8sQBBFTPx60A+0.997
WQ8Bc9QdPpcrevA+0.996

All correlations > 0.96 — solid identification.

Results

Songpre improvement (post-master)post-YT improvementamp ratio
Zureta Ondo+0.138+0.166+1.21×
Digital Skyline+0.086+0.148+1.72×
Average+0.112+0.157+1.46×

(improvement = distance(A, T) - distance(revA, T) — i.e. how much closer revA is to T)

The reverse difference holds up — actually expands — after YouTube. This matches Phase L's prediction (1.4-1.6×) almost exactly. The "real-world distribution kills the effect" scenario is rejected.

Side observation — who got hurt the worst?

Looking at degradation per signal (distance from orig after YT roundtrip), there's a curious pattern:

T:    yt vs orig distance = 1.34  ← largest
A:    yt vs orig distance = 1.14
revA: yt vs orig distance = 1.06  ← smallest

"T is the most damaged by YouTube." A natural hypothesis arises:

"T has rich HF → AAC/Opus is weak in HF → so T degrades the most. Which means reverse is essentially a pre-emphasis (pre-equalization) that cuts HF in advance. If that's true, reverse-processing T should also improve codec resistance."

That's what Phase O tests.



4. Phase O: What Happens If You Reverse-Process T?

Hypothesis

"Reverse-processing the correct T should reduce post-Opus distortion" (the pre-emphasis hypothesis).

This is a classic idea. Digital broadcasting's de-emphasis, CD pre-emphasis, tape bias adjustment — all "if it gets cut later, cut it first".

Experiment

  1. Apply the same shelf params (-3 to -4 dB @10kHz) detected from A to T_mastered.wav → revT_mastered.wav
  2. Roundtrip both T_mastered and revT_mastered through Opus 128k
  3. Compare both against the original T_mastered

Results

Songd(T_opus, T)d(revT_opus, T)pre-emph gain
Zureta Ondo1.3351.433-0.098
Digital Skyline1.3401.514-0.174
Average1.3371.473-0.136

Hypothesis fully rejected. revT consistently does worse than T after Opus. It's now confirmed that reverse processing must NOT be applied to T.

The intermediate values are interesting too. Even before going through Opus, revT is already 0.48-0.63 distant from T. In other words, shelf cut alone purely adds distance by removing "correct HF". Opus, if anything, dutifully tries to preserve the HF.



5. The Big Reframing — 3 New Principles

Combining Phase N and O, the fundamental framing of the research changes. This is the heart of this article.

Principle 1: Reverse isn't "HF cut" — it's "wrong-HF removal"

SignalMeaning of HFResult of applying reverse
TCorrect informationWorsens (moves away from T)
ASuno-derived pathological structureImproves (moves toward T)
revAA with pathological structure reducedTarget state

Reverse is neither a plain low-pass nor a plain HF suppressor. The fact that it can pinpoint Suno-specific signatures was proven in reverse by Phase O. It's already grasping the cue that distinguishes "correct HF" from "pathological HF".

Principle 2: distance ≠ degradation

"T has the largest YouTube distance, 1.34" doesn't mean T was damaged. It means T has more HF information, so the diff measurement is more sensitive.

Audio typepost-codec numerical distanceListening
High quality (micro-phase / micro-HF / air / temporal continuity)LargeNatural
Information-poor / already brokenSmallNot necessarily "good"

This happens often in music listening. High-quality sources show large numerical diffs after codec, but listening-wise they sound natural. Conversely, sources that were broken to begin with show small post-codec diffs and small numbers — but that doesn't mean they sound good.

Reading numerical distance as "amount of degradation" is wrong; read it as "information density difference".

Principle 3: revA's strength comes from trajectory stabilization

Old interpretationNew interpretation
Has YouTube/Opus resistanceWas structurally closer to T + the distribution chain didn't kill the diff
Pre-emphasis hypothesis (rejected in Phase O)Trajectory stabilization (against codec chain)

Mechanism: A's pathological HF / temporal irritation / unnatural entropy flow gets further amplified by Opus. revA had pre-reduced these — so the gap actually widens after distribution. So reverse isn't EQ correction — it's closer to trajectory stabilization against the codec chain.

This leads to one big conclusion:

"Premium feel" = NOT adding information, but REDUCING pathological structure.

Not "boost something". "Subtract the pathological structures that turn into noise". That's what Phase N / O showed.



6. v3.20: HF Pathology Analyzer (shadow mode)

Given these conclusions, the next step was clear:

Don't crank up the reverse strength — make the firing condition smarter. "Which songs / how much" should depend on the HF "pathology level".

But we don't apply yet

This is critical. We can build proxies for "pathological HF", but it's not yet verified that they actually point at pathological things. The scary part is misclassifying high-quality, beautiful HF as pathological:

These songs have "unstable but beautiful HF". If you carelessly fire the detector here, you strip the premium feel.

So v3.20 deploys in shadow mode.

Phaseapply gatestrength scaling
Phase 1 (current v3.20)Existing v3.6.x untouchedNot applied (fixed)
Phase 2 (future, conditional)Existing maintainedFor gate-passing songs only, scale 0.35-1.15× by pathology

In other words: measure + display + telemetry only. The reverse pipeline itself is untouched.

The 3 metrics

Implementation: js/hf-pathology-analyzer.js (162 lines). Runs in Web Audio.

(a) hf_instability — temporal HF variation

hf_instability = std(frameEnergy) / mean(frameEnergy)

Is HF jumping around in time? Higher = more pathology candidate. But jazz brush cymbal also raises this, so it's not usable alone.

(b) hf_continuity — adjacent-frame structure similarity

hf_continuity = mean over t of cos_sim(hf_mag[t-1], hf_mag[t])

Is the spectral structure continuous over time? Lower = more pathological (structure splits frame-by-frame). This is the leading candidate.

(c) side_hf_randomness — Side channel chaos

mid = (L+R)/2, side = (L-R)/2
ratio[t] = ||side_hf[t]|| / ||mid_hf[t]||
side_hf_randomness = std(ratio) / mean(ratio)

Is the Side channel HF jumping randomly relative to Mid? A typical signature of artificial stereo spread / pseudo-air.

Composite

pathology_score = 0.45 * instability_norm
                + 0.35 * (1 - continuity_norm)
                + 0.20 * side_randomness_norm

Continuity is the leading hypothesis, instability is next, side is supporting. Weights are tentative — to be recalibrated from telemetry.

For Phase 2 (computed but unused now)

strength_gain_proposed = clamp(0.55 + pathology_score * 0.75, 0.35, 1.15)

Critically, the lower bound is NOT 0. "Leave a little" is safer than "fully off". In future Phase 2, this multiplier will scale shelf depth for songs that already passed the existing gate. But in the current version, this is computed only — not applied.



7. UI display

Shadow mode is shown explicitly in the UI:

🔬 HF pathology (shadow mode) — measurement only, does not affect apply

pathology_score: 0.42 (42%) → mixed
[strength_gain (proposed): ×0.87, not applied]

▼ Breakdown (3 metrics)
  hf_instability         0.512 (norm 0.43)
  hf_continuity          0.871 (1-norm 0.41)
  side_hf_randomness     0.328 (norm 0.27)
  composite              0.45·inst + 0.35·(1-cont) + 0.20·side = 0.420
  shadow mode note       This score is currently NOT used for apply/strength.

Color coding: 0-0.4 green (stable) / 0.4-0.7 yellow (mixed) / 0.7- orange-red (pathological-like).



8. What data to collect — "missed cases" first

Now telemetry will start collecting data. But just looking at "improved" data is meaningless. The key to growing a 4th-generation detector is "missed cases".

Most important: high pathology + no improvement / worsening

Especially for songs like:

If pathology_score is high but delta_T ≥ 0 for these, that's proof that "instability ≠ pathology". A decisive hint to redesign the normalization.

Hypothesis-supporting data

Conversely, if Suno metallic sheen / unstable side HF / pseudo-air / fake shimmer / transient spray show high pathology + improvement, the 4th-gen detector direction is on target.

4 correlation axes

AxisInterpretation
pathology_score × delta_TDetector validity (the main hypothesis)
hf_continuity × delta_TContinuity-only hypothesis
side_hf_randomness × delta_TStereo instability hypothesis
pathology_score × YouTube amplificationCodec amplification hypothesis (Phase N tie-in)

If the last one comes out positive, "codec chains amplify pathological structure" gets stronger.



9. Why "play defense" by design?

You might think "what a waste". You built a new detector and don't even apply it. The reason is clear.

1. The existing gate is the crystallization of empirical rules

The 3-stage gate (cv_ratio < 0.75 + persistence ≥ -30 + confidence ≥ 0.80) is the result of slowly killing false positives one-by-one through the long Phase E-K trial-and-error. "The courage not to touch" is built into it.

2. Pathology is still in theoretical stage

Of the 3 metrics, continuity is a strong candidate alone, but real-corpus behavior is unknown. In particular, no proven way yet to separate "high-quality instability" from "pathological instability".

3. Separation is safe

Gate = safety (track-record-based) / Pathology = research (hypothesis-based) — we keep them separated. Build new things without breaking old things. When the new thing matures, only then connect it.

"Don't add automatic processing because of an improvement you don't understand." This is a project principle that's been carried since v3.6.x.



10. Summary

PhaseResult
Phase NReverse improvement preserved through YouTube + amplified 1.46× (real, 2 songs)
Phase O"Reverse T → codec-resistance" hypothesis fully rejected
New principleReverse is NOT HF cut — it's "wrong-HF removal"
New principledistance ≠ degradation. Read it as information density difference.
New principlePremium feel = NOT adding info, but REDUCING pathological structure
v3.20HF Pathology Analyzer implemented (shadow mode, no apply impact)

The reverse pipeline is not just a low-pass — it's pinpointing Suno-specific pathological HF structure. That was the most important thing this round confirmed.

The next step is "make the firing condition smarter". Not crank up strength, but vary "which songs / how much" by HF pathology level. But cautiously. To avoid misclassifying high-quality HF, we collect data in shadow mode first.



11. Try it out

🔬 Open the reverse safe-d2 mode

App: https://webmastering.pages.dev/safe-d2.html

Browser-only (no Ubuntu server / no GPU). If you supply a Stems ZIP, pre-delta validation kicks in. Turning telemetry ON contributes verification data for the 4th-generation detector.

Top page (all torimasu modes): https://webmastering.pages.dev/

No improvement guarantee

This mode is NOT a "clearly sounds better" tool. "Indistinguishable in short A/B" + "slightly less fatiguing over long listening" — that's the success criterion. The value is in "not breaking" and "low accident rate".

(If you want flashy / louder, use torimasu's normal mastering mode instead.)


Related references (technical detail)

2026-05-11 / torimasu v3.20 deploy notes