Why spectral analysis works

AI generators synthesise audio through neural network decoders. That synthesis process leaves consistent artifacts in the frequency domain — patterns that are statistically distinct from how audio behaves when it is captured from real instruments and real acoustic spaces. These artifacts persist even after standard mastering and compression, because they are baked into the signal at the generation stage, not added by post-processing.

Mastering can change loudness, tonal balance, and dynamic range. It cannot retroactively introduce the physical properties that natural recordings carry: room reflections, microphone phase relationships, the analog noise floor, and the genuine randomness of acoustic bleed between sources. These are exactly the things neural decoders fail to replicate convincingly.

Three signals stand out consistently across AI-generated music. Together, they form the spectral fingerprint.

Signal 1 — The high-frequency hole at 17–19 kHz

What it is: AI generators cut off sharply just below 20 kHz. The 17–19 kHz band contains almost no energy compared to what you find in well-recorded human music. We measure this as normalised subband energy — the proportion of total signal energy present in that narrow band.

17–19 kHz subband energy
AI generators 0.07 – 0.34
Human recordings 0.756 – 0.831

Why it happens: The decoder architecture of AI music models compresses and reconstructs audio up to a frequency ceiling that does not match natural recordings. The model learns to generate convincing content in the range that matters most for perceived musical quality — roughly 20 Hz to 15 kHz — but content above that ceiling is sparse or absent. The cutoff is a byproduct of training data compression, decoder resolution limits, and the fact that very high frequencies contribute little to perceived musical character.

What it sounds like: On well-calibrated speakers or open-back headphones, AI tracks can sound slightly "sealed off" at the top — missing the air, shimmer, and physical presence of real instruments and recorded rooms. It is subtle and easy to miss by ear. In the frequency data, it is not subtle at all.

Signal 2 — Phase coherence anomaly

What it is: Despite being synthesised, AI-generated music has paradoxically low phase coherence in the high-frequency range. Phase coherence measures how consistently the left and right channels share phase relationships across frequency — a value closer to 1 indicates tightly controlled phase; lower values indicate randomness or inconsistency.

High-frequency phase coherence
AI generators 0.071 – 0.085
Human recordings 0.097 – 0.112

Why it happens: This is counterintuitive — you might expect a machine to be more coherent than a human recording, not less. But AI generation introduces phase randomness at high frequencies that does not match the controlled phase relationships present in well-engineered recordings. Engineers spend considerable effort managing phase: matched microphone placement, polarity alignment, phase-correcting EQ. The generation process does not replicate this discipline. The result is subtle incoherence that is absent from professionally engineered music.

How to detect it: This signal is not something you can hear or spot by looking at a waveform. It requires complex Short-Time Fourier Transform analysis across the full frequency range, computing the phase angle relationship between channels at each frequency bin over time. It is a purely analytical measurement.

Signal 3 — Near-mono high frequencies

What it is: AI generators produce high-frequency content that is nearly identical in the left and right channels — far more correlated than what you find in human recordings. We measure this as the L/R correlation coefficient in the high-frequency band, where 1.0 is perfect mono and 0.0 is completely independent channels.

High-frequency L/R stereo correlation
AI generators ~0.94
Human recordings ~0.79

Why it happens: AI models generate mono or near-mono high-frequency content and apply minimal decorrelation in post. Natural recordings accumulate genuine stereo width at high frequencies through physical processes: room reflections arriving at different times and angles, variation in microphone placement between channels, instrument bleed from off-axis sources, and the natural acoustic divergence between left and right microphones in a stereo pair. None of these processes exist in neural synthesis — the model has no room, no microphone, and no physical source to capture.

How to hear it: Fold the track to mono and back. Human recordings change noticeably in the high-frequency range — cymbals shift in character, reverb tails narrow, room air changes texture. AI tracks barely change. The high-frequency content was already effectively mono to begin with.

The 9 detection rules

We combine the three metrics above into 9 AND-combination rules. Each rule fires when two specific metric thresholds are both exceeded simultaneously. This two-signal requirement dramatically reduces false positives — a track needs to trigger multiple independent anomalies, not just one, to activate a rule.

Rule Signals combined AI recall False positive rate
phase_low + subband17_low Phase coherence + HF energy 81.8% 0.0%
phase_low + stereo_score_high Phase coherence + L/R correlation 87.9% 2.6%
phase_low + stereo_med_high Phase coherence + L/R correlation (mid) 90.9% 4.1%
phase_low + hf_crest_low Phase coherence + HF crest factor 78.8% 4.4%
subband17_low + stereo_score HF energy + L/R correlation 69.7% 1.1%
subband17_low + stereo_med HF energy + L/R correlation (mid) 72.7% 1.1%
subband17_low + hf_crest_low HF energy + HF crest factor 75.8% 1.1%
stereo_score + hf_crest_low L/R correlation + HF crest factor 66.7% 5.2%
stereo_med + hf_crest_low L/R correlation (mid) + HF crest factor 69.7% 12.2%

The rules with the lowest false positive rates — particularly phase_low + subband17_low at 0.0% FP — are treated as high-confidence signals. Rules with higher false positive rates contribute to the verdict but are weighted accordingly. No single rule is determinative on its own.

The overall verdict is determined by how many rules fire and which ones:

Very likely AI-generated 2 or more high-confidence rules triggered
Likely AI-generated 1 high-confidence rule + 1 other rule, or 3 or more rules total
Suspicious 2 rules triggered
Weak AI signal 1 rule triggered
Likely human-made 0 rules triggered

What we don't detect (and why we say so)

The engine is trained on music from all major AI generators. Detection accuracy may vary per generator — some produce artifacts that map more cleanly onto our rule thresholds than others, and as generators evolve their fingerprints shift. We update the model accordingly, but we do not claim universal detection across every tool at every output setting.

There are also processing conditions that reduce detection reliability:

This is why we show you the actual metric values, not just the verdict. If a track lands in an ambiguous range, you can see which specific number is close to the threshold — and decide whether the context justifies a closer look.

Why transparency matters

Every other detector gives you a black-box score. We show you the actual metric values, which rules fired, and the confidence level of each rule. The result is auditable — not just a number with a label attached to it.

If our detector flags your track incorrectly, you can see exactly which signal pushed the score over the threshold. You can then make an informed decision about what to do next: provide additional documentation, investigate whether a specific processing step is responsible, or contact the platform with a clear explanation of what the data shows.

This matters in three specific contexts:

The core principle

Detection without explanation is not accountability — it is just automation. TrackVerifier is built on the premise that the people affected by a verdict deserve to understand it. That means showing the numbers, naming the rules, and being explicit about what the system cannot detect as well as what it can.