Content is user-generated and unverified.

AI Detection Tool Bias Against Non-Native English Writers in UK Higher Education

1. Executive Summary

AI detection tools systematically disadvantage non-native English writers, generating false positive rates up to twelve times higher for international students than for native English speakers. The foundational study by Liang et al. (2023) demonstrated that seven widely used detectors misclassified 61.3% of non-native English essays as AI-generated, compared with roughly 5.1% for native English writing. Subsequent research through 2025 has broadly confirmed this structural bias, though one contradictory study using custom-built (non-commercial) detectors found it could be mitigated with representative training data. The bias stems from an overlap between the linguistic features of second-language English writing — lower perplexity, limited vocabulary, reduced syntactic variety — and the statistical signatures that detectors associate with AI-generated text. In the UK, where international students comprise a substantial and financially critical segment of the higher education population, this bias creates acute equity risks that existing regulatory frameworks have largely failed to address.

The Office of the Independent Adjudicator (OIA) published its first casework guidance on AI and academic misconduct in July 2025, explicitly warning that detection tools may be biased against non-native English speakers and students with disabilities. Yet the Office for Students (OfS) has issued no specific guidance on AI detection tools, their limitations, or their disproportionate impact on international students. This represents the most significant gap in the current UK regulatory framework. Australia's TEQSA and the EU AI Act both offer more structured approaches, with the latter classifying AI systems used in educational assessment as "high-risk" with mandatory bias testing requirements by August 2026.

False positive rates by tool and population

Tool	Vendor-Claimed FPR	Independent FPR (General)	FPR for Non-Native English Writers	Source
Turnitin	<1% (≥300 words, ≥20% AI)	2–7% (Temple, Washington Post)	0.014 vs 0.013 (vendor claim; no independent confirmation)	Turnitin blog; Temple University evaluation
Originality.ai	<1%	Variable (76–99% accuracy range)	~8.3% EFL vs 0% native (borderline p=0.0586)	Pratama (2025); Scribbr (2024)
GPTZero	<1%	10–20% in some studies; 80% accuracy (PubMed)	Claims 1.1% on TOEFL essays (self-reported)	Various; GPTZero benchmark
Copyleaks	0.2%	~5% (GPTZero benchmark); widely variable	100% accuracy on L1+L2 in one study (JALT 2024)	Copyleaks; JALT study
Winston AI	Not specified	75–86.5% accuracy	35% higher FPR for non-English content	HumanizeAI review
Across 7 detectors	—	—	61.3% average (TOEFL essays)	Liang et al. (2023)

Core equity implications

The disproportionate false positive rate for non-native English writers constitutes a form of indirect discrimination that may violate Section 19 of the Equality Act 2010. International students face compounding consequences that native English-speaking students do not: visa jeopardy from academic misconduct findings, inability to transfer institutions easily, scholarship revocation, and the psychological burden of accusation in an unfamiliar legal and cultural system. No UK university has published an Equality Impact Assessment for its deployment of AI detection tools, despite the Public Sector Equality Duty requiring such assessments for policies that affect protected characteristics including race and national origin.

2. Detailed Analysis

2.1 The evidence base confirms systematic bias, with one important exception

The post-2023 literature builds a consistent picture of AI detection tools performing inequitably across language proficiency levels. Liang et al. (2023), published in Patterns (Cell Press), remains the foundational study. Testing seven detectors on 91 TOEFL essays (non-native) and 88 Hewlett Foundation essays (native), the Stanford team found that 97% of TOEFL essays were flagged by at least one detector, with 19.5% unanimously misclassified by all seven. The mechanism is straightforward: non-native writers use more predictable vocabulary and simpler sentence structures, producing text with lower perplexity — the same statistical property detectors use to identify AI output. When the researchers used ChatGPT to "enhance word choices to sound more like a native speaker," misclassification rates dropped significantly. When native essays were deliberately simplified, their misclassification as AI-generated increased.

Pratama (2025), published in PeerJ Computer Science, extended this work by examining 108 scholarly abstracts stratified by discipline and native/non-native authorship, testing GPTZero, ZeroGPT, and DetectGPT. The study found "notable accuracy-bias trade-offs disproportionately affecting non-native speakers and certain disciplines," with GPTZero achieving the highest accuracy (98.15%) but still exhibiting bias. A separate analysis of Originality.ai's Lite model found 99.07% accuracy on non-native texts versus 100% on native texts — a small but borderline-significant gap (Fisher's p=0.0586). ZeroGPT (64.35% accuracy) and DetectGPT (54.63% accuracy) performed poorly on texts generated by newer LLMs.

Perkins et al. (2024), published in the International Journal of Educational Technology in Higher Education, tested six detectors against 805 samples including adversarially modified text. Their "write as NNES with IELTS Band Level 6" technique successfully evaded detection, providing experimental confirmation that NNES-like writing patterns reduce detector sensitivity. The study concluded that detectors "cannot currently be recommended for determining academic integrity violations due to accuracy limitations and the potential for false accusation."

Weber-Wulff et al. (2023), in the International Journal for Educational Integrity, conducted the most comprehensive early comparison of 14 tools. All scored below 80% accuracy, with only five exceeding 70%. While this study did not specifically stratify by native/non-native authorship, it established that detectors exhibited a systematic bias toward classifying text as human-written (high false negatives) and were easily defeated by paraphrasing or machine translation.

The principal contradictory finding comes from Jiang et al. (2024), published in Computers & Education by researchers at ETS (the GRE administrator). Using approximately 10,000 GRE essays and custom-built detectors incorporating e-rater linguistic features and perplexity measures, they found no evidence of bias against non-native English speakers. This result is significant but must be interpreted carefully: the detectors were purpose-built for the study with representative training data, not commercial tools available to universities. The finding suggests that bias is a function of training data composition and tool design rather than an inherent limitation of detection methodology — but this distinction is immaterial for institutions deploying commercial tools that have not demonstrated equivalent debiasing.

2.2 Tool-by-tool accuracy and bias evidence

Turnitin dominates UK higher education, serving the majority of institutions and processing over 200 million papers globally through its AI detector. The company claims a false positive rate below 1% for documents of 300 words or more where at least 20% of content is AI-generated. Turnitin conducted its own ELL bias evaluation using approximately 9,000 ELL documents (including samples from the ICNALE and PELIC corpora) and reported FPR of 0.86% for L2 writers versus 0.87% for L1 writers — essentially identical. However, this research has not been published in a peer-reviewed journal and represents a clear conflict of interest. Independent evaluations paint a different picture: Temple University found 77% accuracy with a 7% mis-flag rate for genuine human writing; the Washington Post found roughly 50% false positives in a small sample; and Weber-Wulff et al. placed Turnitin below 80% accuracy. Turnitin has made substantive technical changes since 2023 — raising its minimum word count from 150 to 300, suppressing scores in the 1–19% range (displayed as an asterisk), and adding an AI paraphrasing detection model in July 2024. Several prominent institutions have disabled Turnitin's AI detector, including Vanderbilt University, King's College London, Ulster University, and the University of Nottingham, the latter finding "little correlation between human detection of AI and the tool's detection."

Originality.ai offers three models — Lite, Turbo, and Academic — with claimed accuracy of 98–99%. Pratama (2025) found Originality Lite achieved the highest overall accuracy among tested tools at 98.61%, with a small but borderline-significant gap for non-native writers. A Scribbr evaluation in 2024, however, found only 76% overall accuracy, substantially below vendor claims. Originality.ai was the only tool to catch AI paraphrasing more than 50% of the time, suggesting it may be better suited for detecting sophisticated AI use. Its market presence in UK higher education remains limited compared to Turnitin.

GPTZero claims to be "the only AI detector de-biased for ESL learners," reporting a 1.1% false positive rate on TOEFL essays following deliberate de-biasing efforts including tagged educational data, representative ESL datasets, and text pre-classification. These claims have not been independently verified in peer-reviewed literature. A PubMed study testing GPTZero on medical texts found only 80% accuracy with a 10% false positive rate and 35% false negative rate. Pratama (2025) found 98.15% accuracy. GPTZero has raised $10 million in Series A funding and is growing as a secondary tool in UK academia but lacks Turnitin's institutional integration.

Copyleaks showed the strongest bias-mitigation result in one study: a JALT 2024 evaluation found it achieved 100% accuracy with zero false positives across both L1 and L2 datasets — one of only two tools (alongside Undetectable AI) to do so. However, GPTZero's own benchmark placed Copyleaks at 90.7% accuracy with approximately 5% false positive rate, and other evaluations show widely variable results. Copyleaks bundles AI detection without additional charges, making it an attractive Turnitin alternative for budget-conscious institutions.

Winston AI presents the most concerning profile for equity purposes. Despite claiming 99.98% accuracy, independent testing finds 75–86.5% accuracy, and one evaluation documented a 35% higher false positive rate for non-English content. No published de-biasing efforts exist. Its minimal presence in UK higher education limits its immediate policy relevance.

A critical cross-cutting finding is that all vendors claim accuracy rates of 98–99.98%, yet independent testing consistently finds accuracy in the 75–93% range depending on context. The US Federal Trade Commission underscored this gap in 2025 by settling with Workado (Content at Scale/BrandWell) for falsely claiming 98% accuracy when independent testing showed only 53%.

2.3 Linguistic features that trigger misclassification

The systematic overlap between non-native English writing and AI-generated text centres on six measurable linguistic properties. Low text perplexity is the primary mechanism: AI detectors flag statistically "predictable" text, and non-native writers rely on high-frequency vocabulary and common constructions that produce low perplexity scores. Liang et al. demonstrated this causally — essays unanimously misclassified exhibited significantly lower perplexity than those correctly classified, and artificially enhancing vocabulary reduced misclassification while artificially simplifying it increased misclassification. Limited lexical diversity compounds this effect, as L2 writers draw from a smaller active vocabulary that mirrors the "average" language patterns of LLM training data. Simpler sentence structures and reduced syntactic variety produce more uniform text that lacks the characteristic "burstiness" of native human writing — the natural alternation between short and long, simple and complex sentences. Formulaic language, including collocations, set phrases, and academic register conventions that L2 writers rely on as communication strategies, further mimics AI output patterns. Finally, predictable word sequences arising from limited grammatical range produce exactly the low next-word-prediction surprisal that perplexity-based detectors are designed to flag.

As Stanford's James Zou explained: "If you use common English words, the detectors will give a low perplexity score, meaning my essay is likely to be flagged as AI-generated. If you use complex and fancier words, then it's more likely to be classified as human written." This creates a structurally discriminatory dynamic where the detector rewards linguistic privilege.

2.4 Documented cases of false accusations

The most significant UK evidence emerged in July 2025 when the Office of the Independent Adjudicator (OIA) published six case summaries involving AI detection and academic misconduct. In one case, an international student received a zero mark after Turnitin flagged "substantial amounts" of AI-generated content; the student had used Grammarly, believing it was permitted as a non-native speaker. The OIA upheld the complaint, finding the university had not provided a fair opportunity to respond. In another case, an international student who used Google to find synonyms was found guilty of misconduct based on Turnitin flags. The OIA ruled that the university "did not consider whether Turnitin's AI detection might be less reliable for non-native English speakers, which was relevant given the student's international status." A third case involved an autistic student given a zero mark whose writing style was flagged as AI-generated; when the university reconsidered, it found no misconduct had occurred. These cases establish an OIA precedent that universities must consider detection tool limitations for vulnerable populations.

Internationally, the case law is developing rapidly. In February 2025, a French national enrolled in Yale's Executive MBA programme sued the university after being suspended for one year based on GPTZero flags, alleging national origin discrimination and coercion — a dean allegedly referenced visa revocation during the investigation. In January 2026, a New York court ruled in Newby v. Adelphi University that the institution's process was "arbitrary and capricious" after an autistic student's paper was scored 100% AI-generated by Turnitin while two other detectors classified it as human-written. The court ordered the university to expunge the violation and rescind all sanctions. At UC Davis, history senior William Quarterman experienced "full-blown panic attacks" after GPTZero flagged his take-home exam; he was cleared after demonstrating the tool also falsely flagged Martin Luther King Jr.'s "I Have a Dream" speech.

The Markup's 2023 investigation documented a pattern at Johns Hopkins University where instructor Taylor Hahn noticed Turnitin systematically flagging international students' writing. In one case, a student immediately produced drafts and highlighted PDFs proving authentic authorship; in another, Hahn had personally worked with the student through the drafting process, only for the submitted paper to be flagged. The University of Bristol launched its "AIvsAI" research project specifically to investigate how Turnitin has led to "numerous false accusations" consuming "extraordinary time and resources to investigate and, in often cases, dismiss."

The psychological and academic consequences are severe and compounding. Students report panic attacks, insomnia, declining grades during lengthy investigation periods, and permanent damage to their academic records — even when cleared, investigation records must often be self-reported to professional bodies. International students face the additional threat of visa revocation, scholarship loss, and reputational damage in contexts where transferring institutions is impractical. Education consultant Lucie Vágnerová reports that accused students frequently require counselling, with misconduct processes "often tak[ing] at least several weeks, if not months... really deeply affecting their mental health."

Scale estimates suggest the problem is substantial. A Guardian investigation (June 2025) found approximately 7,000 proven cases of AI-assisted cheating recorded across UK universities in 2023–24, equalling 5.1 cases per 1,000 students. However, there is no systematic data on how many investigations resulted from false positives. At a university processing 75,000 papers annually, even a conservative 2% false positive rate would generate 1,500 wrongful flags — each requiring investigation and causing student distress. The true false positive rate for international students is almost certainly higher than the institutional average.

3. Policy Recommendations

3.1 Procurement and deployment of AI detection tools

UK universities should adopt a precautionary approach to AI detection tool procurement that reflects the evidence of systematic bias against non-native English writers. No AI detection tool should be procured or deployed without a formal Equality Impact Assessment that specifically evaluates false positive rates across language proficiency levels, disability profiles, and other protected characteristics. This is not merely good practice — it is arguably required under the Public Sector Equality Duty (Section 149, Equality Act 2010) for institutions that are public authorities or perform public functions.

Procurement specifications should require vendors to provide independently verified accuracy data disaggregated by writer demographics, including native/non-native English status, specific L1 backgrounds, and English proficiency levels. Vendors should be required to disclose training data composition, technical methodology, and the results of any internal bias testing. The absence of such data should be treated as a disqualifying factor. Contracts should include performance monitoring clauses requiring ongoing bias auditing and the right to suspend or terminate if independent testing reveals disproportionate impact.

Institutions should follow JISC's guidance that AI detection results constitute "only preliminary guidance" and can never serve as proof. Detection scores should be excluded from initial misconduct panels to prevent anchoring bias, following the recommendation from Newcastle University's Dr David Grundy that AI flags constitute "fruit of the poisoned tree" in evidence terms. Where detection tools are used, a minimum of two independent tools should be required before any investigation proceeds, and discordant results (as in the Adelphi case, where one tool scored 100% AI and two scored human) should automatically terminate the inquiry.

Turnitin's decision to suppress scores in the 1–19% range (displaying an asterisk instead) represents a partial acknowledgement of unreliability at lower confidence levels. Universities should consider raising their institutional threshold significantly higher — a score below 40–50% should not trigger any investigative action given the documented false positive rates.

3.2 Alternatives to detection-based approaches

The evidence strongly supports a strategic shift from detection to assessment redesign as the primary response to generative AI. The QAA has described this as "a generational incentive for providers to require their programme and module teams to review and, where necessary, reimagine assessment strategies." Several approaches show particular promise:

Oral examinations and viva voce defences allow students to demonstrate understanding of submitted work and are inherently resistant to AI fraud. They also provide a more equitable assessment environment for international students whose written English may not reflect their subject knowledge. TEQSA identifies interactive oral assessments as a good practice exemplar.
Process-based assessment requires documented evidence of the learning journey — research logs, iterative drafts, prompt documentation, and reflective commentary. This shifts the evidentiary burden from proving what a student did not do (use AI) to demonstrating what they did do (engage in learning).
The AI Assessment Scale (AIAS) developed by Perkins et al. (2024) provides a structured framework for defining appropriate levels of AI use across different assessment tasks, from "no AI" through "full AI integration." Multiple UK universities including Ulster and Queen Margaret University Edinburgh have adopted this framework.
Supervised in-person assessments for high-stakes evaluations, with the University of Sydney moving to require in-person assessment for all online programmes by 2027.

However, assessment redesign is not a complete solution. Kofinas (2025), writing in the British Journal of Educational Technology, warns that even authentic assessments are "neither a shield for academic integrity nor an immediate solution" — generative AI can engage with real-world tasks, case studies, and reflective exercises. Dawson and Liu (2025) argue that discursive frameworks "remain powerless to prevent AI use when they rely solely on student compliance." A mixed approach combining structural assessment reform with proportionate, equity-conscious use of detection as one data point among many is likely the most defensible position.

3.3 Equity safeguards and due process protections

The OIA's July 2025 casework guidance establishes important principles that should be formalised into institutional policy. The burden of proof must rest with the institution, not the student. AI detection scores are probabilistic outputs from opaque systems, not evidence of misconduct, and must be treated as such. Students should be informed of any AI detection score attributed to their work, provided with a plain-language explanation of the tool's limitations (including known bias against non-native English writers), and given adequate time and support to respond.

Universities should implement mandatory bias awareness training for all staff involved in academic misconduct investigations, covering the documented false positive disparities for non-native English writers, students with disabilities, and writers with distinctive stylistic profiles. Investigation panels should be required to consider whether a student's language background, disability status, or writing style could explain an elevated AI detection score before proceeding.

Specific protections for international students should include: access to specialist advisors who understand both the technical limitations of AI detection and the immigration consequences of misconduct findings; extended response periods that account for the additional stress of proceedings conducted in a second language; and an absolute prohibition on any reference to visa status during misconduct proceedings (as allegedly occurred in the Yale case).

A right of appeal on technical grounds should be guaranteed, including the right to submit the same work to alternative detection tools and to present expert evidence on detector limitations. Given the Adelphi precedent, institutions should recognise that contradictory results from different tools fundamentally undermine any single tool's evidentiary weight.

Finally, the OfS should issue sector-wide guidance comparable to TEQSA's approach in Australia, requiring every institution to submit an action plan addressing AI detection bias risks, publish its AI detection policy with disaggregated accuracy data, and report annually on misconduct outcomes by student demographic characteristics to enable monitoring for disparate impact. The UK's current approach — advisory guidance from QAA and JISC with no regulatory enforcement — is inadequate given the scale of the equity risk and leaves the UK significantly behind both Australia and the EU in protecting students from algorithmic discrimination in educational assessment.

4. Source List

Peer-reviewed studies (highest reliability)

Citation	DOI/URL	Key Contribution	Notes
Liang, Yuksekgonul, Mao, Wu & Zou (2023). "GPT detectors are biased against non-native English writers." Patterns, 4(7), 100779	10.1016/j.patter.2023.100779	Foundational study: 61.3% FPR for non-native writing across 7 detectors	Limitations: small sample (91 TOEFL essays); tested 2023-era tools
Weber-Wulff et al. (2023). "Testing of detection tools for AI-generated text." Int'l Journal for Educational Integrity, 19, 26	10.1007/s40979-023-00146-z	All 14 tools <80% accuracy; established baseline unreliability	Did not stratify by native/non-native status
Pratama (2025). "The accuracy-bias trade-offs in AI text detection tools." PeerJ Computer Science, 11, e2953	10.7717/peerj-cs.2953	First study on AI-assisted text fairness; non-native bias confirmed	Tests GPTZero, ZeroGPT, DetectGPT on scholarly abstracts
Jiang, Hao, Fauss & Li (2024). "Detecting ChatGPT-generated essays in a large-scale writing assessment." Computers & Education, 217, 105070	10.1016/j.compedu.2024.105070	CONTRADICTORY: No bias found with custom-built detectors and representative training data	Used custom (not commercial) detectors; does not refute bias in tools available to universities
Perkins et al. (2024). "Simple techniques to bypass GenAI text detectors." Int'l J. of Educational Technology in Higher Ed., 21, 53	10.1186/s41239-024-00487-w	Confirmed NNES bias mechanism; 17.4% accuracy reduction with adversarial techniques	Recommends against using detectors for misconduct determinations
Giray (2024). "The Problem with False Positives." The Serials Librarian, 85(5–6), 181–189	10.1080/0361526X.2024.2433256	Qualitative documentation of disproportionate impact on non-native scholars	No quantitative false positive data
Walters (2023). "The effectiveness of AI content detection tools." Open Information Science	10.1515/opis-2022-0158	16-tool comparison; top 3: Copyleaks, Turnitin, Originality.ai	Did not specifically test native/non-native bias
Kofinas (2025). Authentic assessment and AI. British Journal of Educational Technology	10.1111/bjet.13585	Warns authentic assessments are not immune to GenAI	Important caveat for assessment redesign strategies

Policy documents and institutional guidance

Source	URL	Relevance
OIA casework note: AI and academic misconduct (July 2025)	oiahe.org.uk — AI and academic misconduct casework note	First UK ombudsman guidance; establishes bias consideration requirement
Russell Group: 5 Principles on GenAI in Education (July 2023)	russellgroup.ac.uk — principles on generative AI tools	Advisory framework for 24 research-intensive universities
QAA: Generative AI guidance and resources	qaa.ac.uk — sector resources on generative artificial intelligence	Explicit caution against AI detection tools
JISC: AI Detection and Assessment update (2025)	jisc.ac.uk — innovation: artificial intelligence	Most detailed UK technical guidance; "cannot prove conclusively"
JISC: FE College AI Principles	jisc.ac.uk — further education and skills AI principles	Explicit warning on discrimination risk
UK Data and AI Ethics Framework	gov.uk — data ethics framework	Requires Equality Impact Assessments for AI procurement
ICO: AI fairness and discrimination guidance	ico.org.uk — AI and data protection guidance on fairness	Legal requirements for bias testing under GDPR
CDEI: Review into Bias in Algorithmic Decision-Making (2020)	Published via gov.uk assets	Recommendation 12: EHRC capacity for algorithmic discrimination
TEQSA: Enacting Assessment Reform in a Time of AI (Sep 2025)	teqsa.gov.au — assessment reform guidance	Australia's mandatory sector-wide approach; comparator model
EU AI Act (entered force August 2024)	artificialintelligenceact.eu	Classifies educational assessment AI as high-risk; compliance by August 2026
POST Briefing PN-0712: AI in education (January 2024)	UK Parliament research briefings	References bias concerns in educational AI
OfS blog: Approach to AI (2025)	officeforstudents.org.uk — blog on AI approach	Confirms no specific detection guidance; "principles-based" approach

Journalism, case studies, and legal analyses

Source	URL	Relevance	Notes
Times Higher Education: Students win AI plagiarism appeals (Jul 2025)	timeshighereducation.com	OIA case summaries coverage	Primary UK case study source
The Markup: AI detection tools falsely accuse international students (Aug 2023)	themarkup.org — machine learning investigation	Johns Hopkins pattern documentation	Investigative journalism; high reliability
Yale Daily News: SOM student sues Yale (Feb 2025)	yaledailynews.com	French national discrimination lawsuit	US case; pending
Inside Higher Ed: Adelphi student wins AI plagiarism lawsuit (Feb 2026)	insidehighered.com	Court rules process "arbitrary and capricious"	Landmark US ruling
Rolling Stone: Student accused via Turnitin (2023)	rollingstone.com	Louise Stivers / UC Davis case	Psychological impact documentation
The Guardian: UK university AI cheating investigation (Jun 2025)	theguardian.com	~7,000 proven AI cases in 2023–24 across UK	FOI-based investigation
Newcastle University blog: Grundy (2025) on AI-flagged misconduct	blogs.ncl.ac.uk	Detailed procedural fairness analysis	Academic blog; strong analytical framework
5SAH Barristers: AI in universities — legal analysis (May 2025)	5sah.co.uk	No statutory framework for AI in academic misconduct	UK legal analysis
Crowell & Moring: Ivy League AI lawsuit analysis	crowell.com	Yale case legal implications	US law firm analysis
Bristol University: AIvsAI research project	research-information.bris.ac.uk	Investigation of Turnitin false accusations	Ongoing UK research

Sources with contradictory findings (flagged)

Source	Finding	Contradicts
Jiang et al. (2024), Computers & Education	No bias in custom-built detectors with representative training data	Liang et al. (2023) finding of systematic bias — but tests different (non-commercial) tools
Turnitin ELL evaluation (2023–24, self-published)	FPR of 0.86% (L2) vs 0.87% (L1) — no statistically significant bias	Independent studies showing 2–7% FPR; conflict of interest (vendor self-study)
Copyleaks evaluation (Aug 2024, self-published)	99.84% accuracy across non-native datasets (12/7,482 misclassified)	Weber-Wulff et al. finding of <80% accuracy for all tools; industry-sponsored
GPTZero ESL de-biasing claims	1.1% FPR on TOEFL essays	Not independently verified; earlier PubMed study found 10% FPR

Data insufficiency notes

The following areas lack adequate evidence for definitive conclusions: (1) no large-scale studies examine specific L1 backgrounds (e.g., Chinese, Arabic, Japanese) independently — the field treats "non-native English" as monolithic; (2) no UK-specific data exists on the number of students falsely accused through AI detection tools; (3) neither the ICLE nor BAWE corpus has been used in published AI detection bias research, despite their obvious relevance; (4) no Equality Impact Assessment for AI detection tool deployment has been published by any UK university; (5) vendor accuracy claims for tools updated in 2024–2025 have not been subjected to independent peer-reviewed evaluation; and (6) the rapidly evolving nature of both AI generation and detection means that point-in-time evaluations become outdated within months.

Content is user-generated and unverified.