A university professor submitted a paragraph from a 1987 academic paper about thermodynamics to three leading AI text detectors. GPTZero classified it as 94% AI-generated. Originality.ai: 88% AI-generated. The text was written entirely by a human, 37 years before ChatGPT existed. This is not a fringe failure; research from Stanford (2024) found false positive rates of 9–16% on non-native English writing and formal academic prose — precisely the text that most resembles LLM output.
Understanding why this happens makes the detection score interpretable rather than just a verdict.
The Two Detection Mechanisms
| Method | How it works | Weakness |
|---|---|---|
| Perplexity scoring | Measures how surprising each word choice is. LLMs choose predictable words; humans make surprising choices. | Formal, precise writing is also low-perplexity — it trips false positives on academic and legal text |
| Burstiness analysis | Human writing alternates between short and long sentences irregularly. LLM writing is more uniform. | Professional editors smooth out burstiness; edited human writing looks more AI-like |
| Watermark detection | Detects cryptographic watermarks embedded by some LLMs at generation time. | Only works if the original model embedded a watermark — most public APIs do not |
What the Score Actually Means
A score of "85% AI-generated" does not mean 85% of the text was generated by AI. It means the statistical properties of the text fall in the region of the detector's training distribution that corresponds to AI output — 85% of the way from the human cluster to the AI cluster. Two pieces of text can receive the same score for completely different reasons: one because it was actually AI-written, one because the human author writes in a clear, structured style.
When to Trust the Score
- High confidence (above 90%): On informal, conversational text (forum posts, casual emails, personal narratives), a 90%+ score is a meaningful signal — humans in these registers are highly variable, so hitting the AI pattern is unlikely by accident.
- Low confidence (50–80%): The score is ambiguous. Do not use it as evidence in an academic integrity case.
- Academic or technical prose: Treat any score under 95% as noise. The false positive risk is too high.
Limitations
The detector is blind to paraphrased AI content (AI output rewritten by a human), mixed authorship (human outline + AI expansion + human edit), and content generated by models released after the detector's training cutoff. It is a probabilistic screen, not forensic evidence.
