AI detection tools systematically disadvantage non-native English writers, generating false positive rates up to twelve times higher for international students than for native English speakers. The foundational study by Liang et al. (2023) demonstrated that seven widely used detectors misclassified 61.3% of non-native English essays as AI-generated,
compared with roughly 5.1% for native English writing.
Subsequent research through 2025 has broadly confirmed this structural bias, though one contradictory study using custom-built (non-commercial) detectors found it could be mitigated with representative training data. The bias stems from an overlap between the linguistic features of second-language English writing — lower perplexity, limited vocabulary, reduced syntactic variety — and the statistical signatures that detectors associate with AI-generated text.
In the UK, where international students comprise a substantial and financially critical segment of the higher education population, this bias creates acute equity risks that existing regulatory frameworks have largely failed to address.
The Office of the Independent Adjudicator (OIA) published its first casework guidance on AI and academic misconduct in July 2025, explicitly warning that detection tools may be biased against non-native English speakers and students with disabilities.
Yet the Office for Students (OfS) has issued no specific guidance on AI detection tools, their limitations, or their disproportionate impact on international students. This represents the most significant gap in the current UK regulatory framework. Australia's TEQSA and the EU AI Act both offer more structured approaches, with the latter classifying AI systems used in educational assessment as "high-risk" with mandatory bias testing requirements by August 2026.
| Tool | Vendor-Claimed FPR | Independent FPR (General) | FPR for Non-Native English Writers | Source |
|---|---|---|---|---|
| Turnitin | <1% (≥300 words, ≥20% AI) | 2–7% (Temple, Washington Post) | 0.014 vs 0.013 (vendor claim; no independent confirmation) | Turnitin blog; Temple University evaluation |
| Originality.ai | <1% | Variable (76–99% accuracy range) | ~8.3% EFL vs 0% native (borderline p=0.0586) | Pratama (2025); Scribbr (2024) |
| GPTZero | <1% | 10–20% in some studies; 80% accuracy (PubMed) | Claims 1.1% on TOEFL essays (self-reported) | Various; GPTZero benchmark |
| Copyleaks | 0.2% | ~5% (GPTZero benchmark); widely variable | 100% accuracy on L1+L2 in one study (JALT 2024) | Copyleaks; JALT study |
| Winston AI | Not specified | 75–86.5% accuracy | 35% higher FPR for non-English content | HumanizeAI review |
| Across 7 detectors | — | — | 61.3% average (TOEFL essays) | Liang et al. (2023) |
The disproportionate false positive rate for non-native English writers constitutes a form of indirect discrimination that may violate Section 19 of the Equality Act 2010. International students face compounding consequences that native English-speaking students do not: visa jeopardy from academic misconduct findings, inability to transfer institutions easily, scholarship revocation, and the psychological burden of accusation in an unfamiliar legal and cultural system.
No UK university has published an Equality Impact Assessment for its deployment of AI detection tools, despite the Public Sector Equality Duty requiring such assessments for policies that affect protected characteristics including race and national origin.
The post-2023 literature builds a consistent picture of AI detection tools performing inequitably across language proficiency levels. Liang et al. (2023), published in Patterns (Cell Press), remains the foundational study.
Testing seven detectors on 91 TOEFL essays (non-native) and 88 Hewlett Foundation essays (native), the Stanford team found
that
97% of TOEFL essays were flagged by at least one detector, with 19.5% unanimously misclassified by all seven.
The mechanism is straightforward: non-native writers use more predictable vocabulary and simpler sentence structures, producing text with lower perplexity — the same statistical property detectors use to identify AI output.
When the researchers used ChatGPT to "enhance word choices to sound more like a native speaker," misclassification rates dropped significantly. When native essays were deliberately simplified, their misclassification as AI-generated increased.
Pratama (2025), published in PeerJ Computer Science, extended this work by examining 108 scholarly abstracts stratified by discipline and native/non-native authorship, testing GPTZero, ZeroGPT, and DetectGPT. The study found "notable accuracy-bias trade-offs disproportionately affecting non-native speakers and certain disciplines,"
with GPTZero achieving the highest accuracy (98.15%) but still exhibiting bias. A separate analysis of Originality.ai's Lite model found 99.07% accuracy on non-native texts versus 100% on native texts — a small but borderline-significant gap (Fisher's p=0.0586).
ZeroGPT (64.35% accuracy) and DetectGPT (54.63% accuracy) performed poorly on texts generated by newer LLMs.
Perkins et al. (2024), published in the International Journal of Educational Technology in Higher Education, tested six detectors against 805 samples including adversarially modified text. Their "write as NNES with IELTS Band Level 6" technique successfully evaded detection,
providing experimental confirmation that NNES-like writing patterns reduce detector sensitivity. The study concluded that detectors "cannot currently be recommended for determining academic integrity violations due to accuracy limitations and the potential for false accusation."
Weber-Wulff et al. (2023), in the International Journal for Educational Integrity, conducted the most comprehensive early comparison of 14 tools. All scored below 80% accuracy,
with only five exceeding 70%.
While this study did not specifically stratify by native/non-native authorship, it established that detectors exhibited a systematic bias toward classifying text as human-written
(high false negatives) and were easily defeated by paraphrasing or machine translation.
The principal contradictory finding comes from Jiang et al. (2024), published in Computers & Education by researchers at ETS (the GRE administrator). Using approximately 10,000 GRE essays and custom-built detectors incorporating e-rater linguistic features and perplexity measures, they found no evidence of bias against non-native English speakers. This result is significant but must be interpreted carefully: the detectors were purpose-built for the study with representative training data, not commercial tools available to universities. The finding suggests that bias is a function of training data composition and tool design rather than an inherent limitation of detection methodology
— but this distinction is immaterial for institutions deploying commercial tools that have not demonstrated equivalent debiasing.
Turnitin dominates UK higher education, serving the majority of institutions and processing over 200 million papers globally through its AI detector.
The company claims a false positive rate below 1%
for documents of 300 words or more where at least 20% of content is AI-generated.
Turnitin conducted its own ELL bias evaluation using approximately 9,000 ELL documents (including samples from the ICNALE and PELIC corpora) and reported FPR of 0.86% for L2 writers versus 0.87% for L1 writers — essentially identical.
However, this research has not been published in a peer-reviewed journal and represents a clear conflict of interest. Independent evaluations paint a different picture: Temple University found 77% accuracy
with a 7% mis-flag rate for genuine human writing; the Washington Post found roughly 50% false positives in a small sample;
and Weber-Wulff et al. placed Turnitin below 80% accuracy.
Turnitin has made substantive technical changes since 2023
— raising its minimum word count from 150 to 300,
suppressing scores in the 1–19% range (displayed as an asterisk),
and adding an AI paraphrasing detection model in July 2024. Several prominent institutions have disabled Turnitin's AI detector, including Vanderbilt University,
King's College London, Ulster University,
and the University of Nottingham, the latter finding "little correlation between human detection of AI and the tool's detection."
Originality.ai offers three models — Lite, Turbo, and Academic — with claimed accuracy of 98–99%. Pratama (2025) found Originality Lite achieved the highest overall accuracy among tested tools at 98.61%, with a small but borderline-significant gap for non-native writers.
A Scribbr evaluation in 2024, however, found only 76% overall accuracy,
substantially below vendor claims. Originality.ai was the only tool to catch AI paraphrasing more than 50% of the time,
suggesting it may be better suited for detecting sophisticated AI use. Its market presence in UK higher education remains limited compared to Turnitin.
GPTZero claims to be "the only AI detector de-biased for ESL learners," reporting a 1.1% false positive rate on TOEFL essays
following deliberate de-biasing efforts including tagged educational data, representative ESL datasets, and text pre-classification. These claims have not been independently verified in peer-reviewed literature. A PubMed study testing GPTZero on medical texts found only 80% accuracy with a 10% false positive rate and 35% false negative rate.
Pratama (2025) found 98.15% accuracy. GPTZero has raised $10 million in Series A funding
and is growing as a secondary tool in UK academia
but lacks Turnitin's institutional integration.
Copyleaks showed the strongest bias-mitigation result in one study: a JALT 2024 evaluation found it achieved 100% accuracy with zero false positives across both L1 and L2 datasets — one of only two tools (alongside Undetectable AI) to do so. However, GPTZero's own benchmark placed Copyleaks at 90.7% accuracy with approximately 5% false positive rate,
and other evaluations show widely variable results. Copyleaks bundles AI detection without additional charges,
making it an attractive Turnitin alternative for budget-conscious institutions.
Winston AI presents the most concerning profile for equity purposes. Despite claiming 99.98% accuracy, independent testing finds 75–86.5% accuracy, and one evaluation documented a 35% higher false positive rate for non-English content.
No published de-biasing efforts exist. Its minimal presence in UK higher education limits its immediate policy relevance.
A critical cross-cutting finding is that all vendors claim accuracy rates of 98–99.98%, yet independent testing consistently finds accuracy in the 75–93% range depending on context. The US Federal Trade Commission underscored this gap in 2025 by settling with Workado (Content at Scale/BrandWell) for falsely claiming 98% accuracy when independent testing showed only 53%.
The systematic overlap between non-native English writing and AI-generated text centres on six measurable linguistic properties. Low text perplexity is the primary mechanism: AI detectors flag statistically "predictable" text,
and non-native writers rely on high-frequency vocabulary and common constructions that produce low perplexity scores.
Liang et al. demonstrated this causally — essays unanimously misclassified exhibited significantly lower perplexity than those correctly classified,
and artificially enhancing vocabulary reduced misclassification while artificially simplifying it increased misclassification.
Limited lexical diversity compounds this effect, as L2 writers draw from a smaller active vocabulary that mirrors the "average" language patterns of LLM training data.
Simpler sentence structures and reduced syntactic variety produce more uniform text that lacks the characteristic "burstiness" of native human writing — the natural alternation between short and long, simple and complex sentences.
Formulaic language, including collocations, set phrases, and academic register conventions that L2 writers rely on as communication strategies, further mimics AI output patterns. Finally, predictable word sequences arising from limited grammatical range produce exactly the low next-word-prediction surprisal that perplexity-based detectors are designed to flag.
As Stanford's James Zou explained: "If you use common English words, the detectors will give a low perplexity score, meaning my essay is likely to be flagged as AI-generated. If you use complex and fancier words, then it's more likely to be classified as human written."
This creates a structurally discriminatory dynamic where the detector rewards linguistic privilege.
The most significant UK evidence emerged in July 2025 when the Office of the Independent Adjudicator (OIA) published six case summaries involving AI detection and academic misconduct.
In one case, an international student received a zero mark after Turnitin flagged "substantial amounts" of AI-generated content; the student had used Grammarly, believing it was permitted as a non-native speaker.
The OIA upheld the complaint, finding the university had not provided a fair opportunity to respond.
In another case, an international student who used Google to find synonyms was found guilty of misconduct based on Turnitin flags. The OIA ruled that the university "did not consider whether Turnitin's AI detection might be less reliable for non-native English speakers, which was relevant given the student's international status."
A third case involved an autistic student given a zero mark whose writing style was flagged as AI-generated; when the university reconsidered, it found no misconduct had occurred.
These cases establish an OIA precedent that universities must consider detection tool limitations for vulnerable populations.
Internationally, the case law is developing rapidly. In February 2025, a French national enrolled in Yale's Executive MBA programme sued the university after being suspended for one year based on GPTZero flags, alleging national origin discrimination and coercion — a dean allegedly referenced visa revocation during the investigation.
In January 2026, a New York court ruled in Newby v. Adelphi University that the institution's process was "arbitrary and capricious" after an autistic student's paper was scored 100% AI-generated by Turnitin while two other detectors classified it as human-written. The court ordered the university to expunge the violation and rescind all sanctions.
At UC Davis, history senior William Quarterman experienced "full-blown panic attacks" after GPTZero flagged his take-home exam; he was cleared after demonstrating the tool also falsely flagged Martin Luther King Jr.'s "I Have a Dream" speech.
The Markup's 2023 investigation documented a pattern at Johns Hopkins University where instructor Taylor Hahn noticed Turnitin systematically flagging international students' writing. In one case, a student immediately produced drafts and highlighted PDFs proving authentic authorship; in another, Hahn had personally worked with the student through the drafting process, only for the submitted paper to be flagged.
The University of Bristol launched its "AIvsAI" research project specifically to investigate how Turnitin has led to "numerous false accusations" consuming "extraordinary time and resources to investigate and, in often cases, dismiss."
The psychological and academic consequences are severe and compounding. Students report panic attacks, insomnia, declining grades during lengthy investigation periods, and permanent damage to their academic records
— even when cleared, investigation records must often be self-reported to professional bodies. International students face the additional threat of visa revocation, scholarship loss, and reputational damage in contexts where transferring institutions is impractical.
Education consultant Lucie Vágnerová reports that accused students frequently require counselling, with misconduct processes "often tak[ing] at least several weeks, if not months... really deeply affecting their mental health."
Scale estimates suggest the problem is substantial. A Guardian investigation (June 2025) found approximately 7,000 proven cases of AI-assisted cheating recorded across UK universities in 2023–24, equalling 5.1 cases per 1,000 students.
However, there is no systematic data on how many investigations resulted from false positives. At a university processing 75,000 papers annually, even a conservative 2% false positive rate would generate 1,500 wrongful flags — each requiring investigation and causing student distress.
The true false positive rate for international students is almost certainly higher than the institutional average.
UK universities should adopt a precautionary approach to AI detection tool procurement that reflects the evidence of systematic bias against non-native English writers. No AI detection tool should be procured or deployed without a formal Equality Impact Assessment that specifically evaluates false positive rates across language proficiency levels, disability profiles, and other protected characteristics. This is not merely good practice — it is arguably required under the Public Sector Equality Duty (Section 149, Equality Act 2010) for institutions that are public authorities or perform public functions.
Procurement specifications should require vendors to provide independently verified accuracy data disaggregated by writer demographics, including native/non-native English status, specific L1 backgrounds, and English proficiency levels. Vendors should be required to disclose training data composition, technical methodology, and the results of any internal bias testing. The absence of such data should be treated as a disqualifying factor. Contracts should include performance monitoring clauses requiring ongoing bias auditing and the right to suspend or terminate if independent testing reveals disproportionate impact.
Institutions should follow JISC's guidance that AI detection results constitute "only preliminary guidance" and can never serve as proof. Detection scores should be excluded from initial misconduct panels to prevent anchoring bias,
following the recommendation from Newcastle University's Dr David Grundy that AI flags constitute "fruit of the poisoned tree" in evidence terms.
Where detection tools are used, a minimum of two independent tools should be required before any investigation proceeds, and discordant results (as in the Adelphi case, where one tool scored 100% AI and two scored human) should automatically terminate the inquiry.
Turnitin's decision to suppress scores in the 1–19% range (displaying an asterisk instead) represents a partial acknowledgement of unreliability at lower confidence levels. Universities should consider raising their institutional threshold significantly higher — a score below 40–50% should not trigger any investigative action given the documented false positive rates.
The evidence strongly supports a strategic shift from detection to assessment redesign as the primary response to generative AI. The QAA has described this as "a generational incentive for providers to require their programme and module teams to review and, where necessary, reimagine assessment strategies." Several approaches show particular promise:
However, assessment redesign is not a complete solution. Kofinas (2025), writing in the British Journal of Educational Technology, warns that even authentic assessments are "neither a shield for academic integrity nor an immediate solution" — generative AI can engage with real-world tasks, case studies, and reflective exercises. Dawson and Liu (2025) argue that discursive frameworks "remain powerless to prevent AI use when they rely solely on student compliance."
A mixed approach combining structural assessment reform with proportionate, equity-conscious use of detection as one data point among many is likely the most defensible position.
The OIA's July 2025 casework guidance establishes important principles that should be formalised into institutional policy.
The burden of proof must rest with the institution, not the student.
AI detection scores are probabilistic outputs from opaque systems, not evidence of misconduct, and must be treated as such.
Students should be informed of any AI detection score attributed to their work, provided with a plain-language explanation of the tool's limitations (including known bias against non-native English writers), and given adequate time and support to respond.
Universities should implement mandatory bias awareness training for all staff involved in academic misconduct investigations, covering the documented false positive disparities for non-native English writers, students with disabilities, and writers with distinctive stylistic profiles.
Investigation panels should be required to consider whether a student's language background, disability status, or writing style could explain an elevated AI detection score before proceeding.
Specific protections for international students should include: access to specialist advisors who understand both the technical limitations of AI detection and the immigration consequences of misconduct findings; extended response periods that account for the additional stress of proceedings conducted in a second language; and an absolute prohibition on any reference to visa status during misconduct proceedings (as allegedly occurred in the Yale case).
A right of appeal on technical grounds should be guaranteed, including the right to submit the same work to alternative detection tools and to present expert evidence on detector limitations. Given the Adelphi precedent, institutions should recognise that contradictory results from different tools fundamentally undermine any single tool's evidentiary weight.
Finally, the OfS should issue sector-wide guidance comparable to TEQSA's approach in Australia, requiring every institution to submit an action plan addressing AI detection bias risks, publish its AI detection policy with disaggregated accuracy data, and report annually on misconduct outcomes by student demographic characteristics to enable monitoring for disparate impact. The UK's current approach — advisory guidance from QAA and JISC with no regulatory enforcement — is inadequate given the scale of the equity risk and leaves the UK significantly behind both Australia and the EU in protecting students from algorithmic discrimination in educational assessment.
| Citation | DOI/URL | Key Contribution | Notes |
|---|---|---|---|
| Liang, Yuksekgonul, Mao, Wu & Zou (2023). "GPT detectors are biased against non-native English writers." Patterns, 4(7), 100779 | 10.1016/j.patter.2023.100779 | Foundational study: 61.3% FPR for non-native writing across 7 detectors | Limitations: small sample (91 TOEFL essays); tested 2023-era tools |
| Weber-Wulff et al. (2023). "Testing of detection tools for AI-generated text." Int'l Journal for Educational Integrity, 19, 26 | 10.1007/s40979-023-00146-z | All 14 tools <80% accuracy; established baseline unreliability | Did not stratify by native/non-native status |
| Pratama (2025). "The accuracy-bias trade-offs in AI text detection tools." PeerJ Computer Science, 11, e2953 | 10.7717/peerj-cs.2953 | First study on AI-assisted text fairness; non-native bias confirmed | Tests GPTZero, ZeroGPT, DetectGPT on scholarly abstracts |
| Jiang, Hao, Fauss & Li (2024). "Detecting ChatGPT-generated essays in a large-scale writing assessment." Computers & Education, 217, 105070 | 10.1016/j.compedu.2024.105070 | CONTRADICTORY: No bias found with custom-built detectors and representative training data | Used custom (not commercial) detectors; does not refute bias in tools available to universities |
| Perkins et al. (2024). "Simple techniques to bypass GenAI text detectors." Int'l J. of Educational Technology in Higher Ed., 21, 53 | 10.1186/s41239-024-00487-w | Confirmed NNES bias mechanism; 17.4% accuracy reduction with adversarial techniques | Recommends against using detectors for misconduct determinations |
| Giray (2024). "The Problem with False Positives." The Serials Librarian, 85(5–6), 181–189 | 10.1080/0361526X.2024.2433256 | Qualitative documentation of disproportionate impact on non-native scholars | No quantitative false positive data |
| Walters (2023). "The effectiveness of AI content detection tools." Open Information Science | 10.1515/opis-2022-0158 | 16-tool comparison; top 3: Copyleaks, Turnitin, Originality.ai | Did not specifically test native/non-native bias |
| Kofinas (2025). Authentic assessment and AI. British Journal of Educational Technology | 10.1111/bjet.13585 | Warns authentic assessments are not immune to GenAI | Important caveat for assessment redesign strategies |
| Source | URL | Relevance |
|---|---|---|
| OIA casework note: AI and academic misconduct (July 2025) | oiahe.org.uk — AI and academic misconduct casework note | First UK ombudsman guidance; establishes bias consideration requirement |
| Russell Group: 5 Principles on GenAI in Education (July 2023) | russellgroup.ac.uk — principles on generative AI tools | Advisory framework for 24 research-intensive universities |
| QAA: Generative AI guidance and resources | qaa.ac.uk — sector resources on generative artificial intelligence | Explicit caution against AI detection tools |
| JISC: AI Detection and Assessment update (2025) | jisc.ac.uk — innovation: artificial intelligence | Most detailed UK technical guidance; "cannot prove conclusively" |
| JISC: FE College AI Principles | jisc.ac.uk — further education and skills AI principles | Explicit warning on discrimination risk |
| UK Data and AI Ethics Framework | gov.uk — data ethics framework | Requires Equality Impact Assessments for AI procurement |
| ICO: AI fairness and discrimination guidance | ico.org.uk — AI and data protection guidance on fairness | Legal requirements for bias testing under GDPR |
| CDEI: Review into Bias in Algorithmic Decision-Making (2020) | Published via gov.uk assets | Recommendation 12: EHRC capacity for algorithmic discrimination |
| TEQSA: Enacting Assessment Reform in a Time of AI (Sep 2025) | teqsa.gov.au — assessment reform guidance | Australia's mandatory sector-wide approach; comparator model |
| EU AI Act (entered force August 2024) | artificialintelligenceact.eu | Classifies educational assessment AI as high-risk; compliance by August 2026 |
| POST Briefing PN-0712: AI in education (January 2024) | UK Parliament research briefings | References bias concerns in educational AI |
| OfS blog: Approach to AI (2025) | officeforstudents.org.uk — blog on AI approach | Confirms no specific detection guidance; "principles-based" approach |
| Source | URL | Relevance | Notes |
|---|---|---|---|
| Times Higher Education: Students win AI plagiarism appeals (Jul 2025) | timeshighereducation.com | OIA case summaries coverage | Primary UK case study source |
| The Markup: AI detection tools falsely accuse international students (Aug 2023) | themarkup.org — machine learning investigation | Johns Hopkins pattern documentation | Investigative journalism; high reliability |
| Yale Daily News: SOM student sues Yale (Feb 2025) | yaledailynews.com | French national discrimination lawsuit | US case; pending |
| Inside Higher Ed: Adelphi student wins AI plagiarism lawsuit (Feb 2026) | insidehighered.com | Court rules process "arbitrary and capricious" | Landmark US ruling |
| Rolling Stone: Student accused via Turnitin (2023) | rollingstone.com | Louise Stivers / UC Davis case | Psychological impact documentation |
| The Guardian: UK university AI cheating investigation (Jun 2025) | theguardian.com | ~7,000 proven AI cases in 2023–24 across UK | FOI-based investigation |
| Newcastle University blog: Grundy (2025) on AI-flagged misconduct | blogs.ncl.ac.uk | Detailed procedural fairness analysis | Academic blog; strong analytical framework |
| 5SAH Barristers: AI in universities — legal analysis (May 2025) | 5sah.co.uk | No statutory framework for AI in academic misconduct | UK legal analysis |
| Crowell & Moring: Ivy League AI lawsuit analysis | crowell.com | Yale case legal implications | US law firm analysis |
| Bristol University: AIvsAI research project | research-information.bris.ac.uk | Investigation of Turnitin false accusations | Ongoing UK research |
| Source | Finding | Contradicts |
|---|---|---|
| Jiang et al. (2024), Computers & Education | No bias in custom-built detectors with representative training data | Liang et al. (2023) finding of systematic bias — but tests different (non-commercial) tools |
| Turnitin ELL evaluation (2023–24, self-published) | FPR of 0.86% (L2) vs 0.87% (L1) — no statistically significant bias | Independent studies showing 2–7% FPR; conflict of interest (vendor self-study) |
| Copyleaks evaluation (Aug 2024, self-published) | 99.84% accuracy across non-native datasets (12/7,482 misclassified) | Weber-Wulff et al. finding of <80% accuracy for all tools; industry-sponsored |
| GPTZero ESL de-biasing claims | 1.1% FPR on TOEFL essays | Not independently verified; earlier PubMed study found 10% FPR |
The following areas lack adequate evidence for definitive conclusions: (1) no large-scale studies examine specific L1 backgrounds (e.g., Chinese, Arabic, Japanese) independently — the field treats "non-native English" as monolithic; (2) no UK-specific data exists on the number of students falsely accused through AI detection tools; (3) neither the ICLE nor BAWE corpus has been used in published AI detection bias research, despite their obvious relevance; (4) no Equality Impact Assessment for AI detection tool deployment has been published by any UK university; (5) vendor accuracy claims for tools updated in 2024–2025 have not been subjected to independent peer-reviewed evaluation; and (6) the rapidly evolving nature of both AI generation and detection means that point-in-time evaluations become outdated within months.