How We Detect AI Music — Spectral Fingerprint Analysis Explained

Most AI music detectors give you a verdict with no explanation. We built TrackVerifier differently — every metric is visible, every rule is named. This article explains exactly what we look for and why it works.

Why spectral analysis works

AI generators synthesise audio through neural network decoders. That synthesis process leaves consistent artifacts in the frequency domain — patterns that are statistically distinct from how audio behaves when it is captured from real instruments and real acoustic spaces. These artifacts persist even after standard mastering and compression, because they are baked into the signal at the generation stage, not added by post-processing.

Mastering can change loudness, tonal balance, and dynamic range. It cannot retroactively introduce the physical properties that natural recordings carry: room reflections, microphone phase relationships, the analog noise floor, and the genuine randomness of acoustic bleed between sources. These are exactly the things neural decoders fail to replicate convincingly.

Three signals stand out consistently across AI-generated music. Together, they form the spectral fingerprint.

Signal 1 — The high-frequency hole at 17–19 kHz

What it is: AI generators cut off sharply just below 20 kHz. The 17–19 kHz band contains almost no energy compared to what you find in well-recorded human music. We measure this as normalised subband energy — the proportion of total signal energy present in that narrow band.

17–19 kHz subband energy

AI generators 0.07 – 0.34

Human recordings 0.756 – 0.831

Why it happens: The decoder architecture of AI music models compresses and reconstructs audio up to a frequency ceiling that does not match natural recordings. The model learns to generate convincing content in the range that matters most for perceived musical quality — roughly 20 Hz to 15 kHz — but content above that ceiling is sparse or absent. The cutoff is a byproduct of training data compression, decoder resolution limits, and the fact that very high frequencies contribute little to perceived musical character.

What it sounds like: On well-calibrated speakers or open-back headphones, AI tracks can sound slightly "sealed off" at the top — missing the air, shimmer, and physical presence of real instruments and recorded rooms. It is subtle and easy to miss by ear. In the frequency data, it is not subtle at all.

Signal 2 — Phase coherence anomaly

What it is: Despite being synthesised, AI-generated music has paradoxically low phase coherence in the high-frequency range. Phase coherence measures how consistently the left and right channels share phase relationships across frequency — a value closer to 1 indicates tightly controlled phase; lower values indicate randomness or inconsistency.

High-frequency phase coherence

AI generators 0.071 – 0.085

Human recordings 0.097 – 0.112

Why it happens: This is counterintuitive — you might expect a machine to be more coherent than a human recording, not less. But AI generation introduces phase randomness at high frequencies that does not match the controlled phase relationships present in well-engineered recordings. Engineers spend considerable effort managing phase: matched microphone placement, polarity alignment, phase-correcting EQ. The generation process does not replicate this discipline. The result is subtle incoherence that is absent from professionally engineered music.

How to detect it: This signal is not something you can hear or spot by looking at a waveform. It requires complex Short-Time Fourier Transform analysis across the full frequency range, computing the phase angle relationship between channels at each frequency bin over time. It is a purely analytical measurement.

Signal 3 — Near-mono high frequencies

What it is: AI generators produce high-frequency content that is nearly identical in the left and right channels — far more correlated than what you find in human recordings. We measure this as the L/R correlation coefficient in the high-frequency band, where 1.0 is perfect mono and 0.0 is completely independent channels.

High-frequency L/R stereo correlation

AI generators ~0.94

Human recordings ~0.79

Why it happens: AI models generate mono or near-mono high-frequency content and apply minimal decorrelation in post. Natural recordings accumulate genuine stereo width at high frequencies through physical processes: room reflections arriving at different times and angles, variation in microphone placement between channels, instrument bleed from off-axis sources, and the natural acoustic divergence between left and right microphones in a stereo pair. None of these processes exist in neural synthesis — the model has no room, no microphone, and no physical source to capture.

How to hear it: Fold the track to mono and back. Human recordings change noticeably in the high-frequency range — cymbals shift in character, reverb tails narrow, room air changes texture. AI tracks barely change. The high-frequency content was already effectively mono to begin with.

The 9 detection rules

We combine the three metrics above into 9 AND-combination rules. Each rule fires when two specific metric thresholds are both exceeded simultaneously. This two-signal requirement dramatically reduces false positives — a track needs to trigger multiple independent anomalies, not just one, to activate a rule.

Rule	Signals combined	AI recall	False positive rate
phase_low + subband17_low	Phase coherence + HF energy	81.8%	0.0%
phase_low + stereo_score_high	Phase coherence + L/R correlation	87.9%	2.6%
phase_low + stereo_med_high	Phase coherence + L/R correlation (mid)	90.9%	4.1%
phase_low + hf_crest_low	Phase coherence + HF crest factor	78.8%	4.4%
subband17_low + stereo_score	HF energy + L/R correlation	69.7%	1.1%
subband17_low + stereo_med	HF energy + L/R correlation (mid)	72.7%	1.1%
subband17_low + hf_crest_low	HF energy + HF crest factor	75.8%	1.1%
stereo_score + hf_crest_low	L/R correlation + HF crest factor	66.7%	5.2%
stereo_med + hf_crest_low	L/R correlation (mid) + HF crest factor	69.7%	12.2%

The rules with the lowest false positive rates — particularly phase_low + subband17_low at 0.0% FP — are treated as high-confidence signals. Rules with higher false positive rates contribute to the verdict but are weighted accordingly. No single rule is determinative on its own.

The overall verdict is determined by how many rules fire and which ones:

Very likely AI-generated 2 or more high-confidence rules triggered

Likely AI-generated 1 high-confidence rule + 1 other rule, or 3 or more rules total

Suspicious 2 rules triggered

Weak AI signal 1 rule triggered

Likely human-made 0 rules triggered

What we don't detect (and why we say so)

The engine is trained on music from all major AI generators. Detection accuracy may vary per generator — some produce artifacts that map more cleanly onto our rule thresholds than others, and as generators evolve their fingerprints shift. We update the model accordingly, but we do not claim universal detection across every tool at every output setting.

There are also processing conditions that reduce detection reliability:

Heavy post-processing: Tracks that have been re-recorded through analog hardware, heavily pitch-shifted, or time-stretched significantly may score lower. Some processing paths genuinely alter the spectral fingerprint rather than simply masking it.
Low-bitrate MP3 encoding: Files compressed below approximately 128 kbps have already had high-frequency content removed by the encoder. This destroys the 17–19 kHz signal and the stereo correlation data we rely on. In these cases, TrackVerifier returns an "insufficient HF content" warning rather than a false human verdict.
AI-assisted human recordings: A track recorded live but processed through AI-based mixing or mastering tools may show partial fingerprints. We report the raw numbers precisely because these edge cases require human judgment.

This is why we show you the actual metric values, not just the verdict. If a track lands in an ambiguous range, you can see which specific number is close to the threshold — and decide whether the context justifies a closer look.

Why transparency matters

Every other detector gives you a black-box score. We show you the actual metric values, which rules fired, and the confidence level of each rule. The result is auditable — not just a number with a label attached to it.

If our detector flags your track incorrectly, you can see exactly which signal pushed the score over the threshold. You can then make an informed decision about what to do next: provide additional documentation, investigate whether a specific processing step is responsible, or contact the platform with a clear explanation of what the data shows.

This matters in three specific contexts:

Artists challenging a rejection: A named rule with a specific metric value is a starting point for a conversation. "The detector fired because of stereo correlation in the HF band" is actionable. "Your score was 73" is not.
Labels building auditable A&R processes: If your A&R workflow uses automated screening, you need to be able to show what the system found and why. A transparent per-track log satisfies that requirement. A black-box verdict does not.
Sync agencies maintaining compliance trails: Sync licensing increasingly requires documented due diligence on AI provenance. A structured output showing metric values and rule activations is a document. A pass/fail result is not.

The core principle

Detection without explanation is not accountability — it is just automation. TrackVerifier is built on the premise that the people affected by a verdict deserve to understand it. That means showing the numbers, naming the rules, and being explicit about what the system cannot detect as well as what it can.

How We Detect AI Music — The Spectral Fingerprint Explained