Why AI Citation Fabrication Is Harder to Catch in the Humanities

On May 7, 2026, The Lancet published a short research letter with an alarming finding. A team led by Maxim Topaz at Columbia University's School of Nursing built an automated system to verify references across the biomedical literature. It scanned roughly 2.5 million papers and 125.6 million structured references from PubMed Central's Open Access collection, including 97.1 million references with PubMed IDs, and it found something that made editors take notice: 4,046 references that point to papers which do not appear to exist. The rate of these fabricated citations rose more than twelvefold between 2023 and 2025, with the steepest climb beginning in mid-2024, around the time AI writing tools went mainstream.

You can read the original letter in The Lancet and the companion editorial, Bauchner et al., "Fabricated references: a new threat to editorial integrity." The Columbia School of Nursing summary is freely available if the journal is paywalled for you.

The coverage that followed framed this as a detection problem. Journals need tools to catch fake citations before publication. Several large publishers already have them. And for biomedicine, that framing is roughly correct.

For the humanities, it is the wrong framing for a problem that is, if anything, harder.

Why fabrication detection is the easy case in biomedicine

The reason the Columbia team could build a working detector is that biomedical literature is almost completely indexed. Nearly every legitimate paper has a PubMed identifier. The team verified the references that carried a PMID, which was 77% of them, and the question for each was binary and answerable: does a real paper with this identifier exist, or not?

That is a clean lookup. When 95% or more of the relevant literature lives in a structured, queryable database, "this reference points to nothing" is a signal you can trust. A citation that fails the lookup is genuinely suspicious, because almost everything real would have passed.

This is also why the big medical journals can build detection in-house. The hard part was never the verification logic; it was having a comprehensive database to verify against. Biomedicine has one.

It is worth noting that even the study's authors were careful about what their number means. Topaz told Nature the finding is a "lower bound of true prevalence," and that the team is "scratching the tip of the iceberg." The letter itself is candid about where its method runs out: the analysis excluded the 23% of references that had no PubMed identifier, and the authors note that fabricated references "might be more common among non-indexed sources, including grey literature, websites, and books," which would mean their figure undercounts. Read that limitation again, because it is the whole of the humanities problem stated in a single clause. Non-indexed books and grey literature are not the edge case in the humanities. They are the main case.

One research-integrity scholar, quoted in Retraction Watch, pushed further, arguing the more important problem is not the wholly invented citation but the one that exists yet does not support the claim attached to it.

Hold onto that point too. It matters more in the humanities than anywhere else.

Why the humanities version is harder

Now picture a dissertation chapter in classics, theology, or intellectual history. Its footnotes cite a translated patristic text in the Sources Chrétiennes series, a 1931 monograph from a German university press, an article in a regional history journal that predates digital indexing, an edited volume whose chapter authors differ from its editors, and a primary source quoted from a critical edition published in 1908.

Almost none of that is reliably covered by the scholarly databases. Crossref indexes humanities work unevenly, with thin coverage of older and non-English material; OpenAlex and Google Books reach further into monographs and pre-digital scholarship but are far from complete. Large parts of the humanities were published before any of these databases existed, in languages and series that were never fully indexed, by presses that have since merged, moved, or closed. The clean binary that makes biomedical detection work, exists in the database or does not, simply breaks. A reference that returns no database match is not evidence of fabrication in the humanities. It is the normal condition of a great deal of perfectly real scholarship.

This produces two failure modes that the medical detection model handles badly.

The first is the false accusation. A naive "fabrication detector" pointed at a humanities bibliography would flag a large share of legitimate references simply because they are not indexed. Telling a historian that their citation to a 1908 critical edition is "fake" because Crossref has never heard of it is not a useful result. It is noise that trains the user to ignore the tool.

The second is the harder real problem that the Retraction Watch critic named. AI writing tools do not only invent references that point nowhere. They also produce references that look real, attach to a real-sounding author, and cite a claim that the source does not actually make, or conflate two genuine works into one plausible-looking hybrid. These are not caught by asking "does it exist," because in a sense they do exist, just not in the form cited. Detecting them requires reconstructing what the researcher meant and cross-referencing the fragments against the actual historical publication record, which is a judgment task, not a binary lookup.

(If you need a verification engine built specifically for these kinds of humanities sources, Citation Master audits your bibliography without false accusations.)

"Does it exist" is the wrong question

The biomedical detector asks one question: real or fake. That works when the answer is reliably knowable.

In the humanities, the useful questions are different and plural. Is this the right edition, given that the same ancient text exists in dozens? Is "Schroeder 1976" the same work as the fuller citation three footnotes earlier? Was this volume actually published in 1975 rather than 1976? Is this series number correct, given that Sources Chrétiennes 464 and 464-465 are different things? Does the shorthand "Morlet, 2011" refer to a journal article or a chapter in an edited collection?

None of these are answerable with a binary fabrication check. All of them are common, and all of them are the difference between a bibliography that holds up under scrutiny and one that quietly does not.

What an honest tool does instead

This is why a verification tool built for the humanities cannot simply be a fabrication detector with a different logo. The job is not to render a verdict of real or fake. The job is to verify what can be verified, reconstruct what can be reconstructed, and flag the rest honestly, with enough context that the researcher can make the final call.

We wrote in detail about the accidental version of this problem, the drift and typos and swapped initials that creep into even published bibliographies, in our earlier piece on auditing your bibliography before your reviewer does. Fabrication and AI-introduced error are the newer, sharper edge of the same underlying need: a bibliography you can actually trust.

The distinction we hold to is simple. A tool should not tell you a citation is fabricated. It should tell you it could not verify the citation, show you exactly what it checked and what did not match, and let you decide. The difference is not pedantic. "Fabricated" is an accusation that, applied to an unindexed but genuine humanities source, is simply wrong. "Unverified, here is what I found and did not find" is a true statement that helps you.

A case in point

Consider the kind of reference that defeats a binary detector. A footnote reads only: "Schroeder. 1976." No title, no publisher, no volume. A fabrication checker pointed at a database would find no match and could, on the medical model, flag it as suspect. A blind AI might invent a plausible 1976 edition to make the fragment resolve cleanly.

Neither response is right. The reference is real; the researcher simply used shorthand. The correct behavior is to reconstruct it: cross-reference the surname, the discipline, and the publication history to identify that this is Guy Schroeder's translation and annotation of Eusebius's La Préparation évangélique, Sources Chrétiennes volume 215, in the Éditions du Cerf series. And then, having reconstructed it, to notice the genuine error hiding inside the shorthand: that volume was published in 1975, not 1976. We break down exactly how the engine caught this specific 1975 vs 1976 typo in our full case study here.

That is not detection. It is reconstruction followed by verification followed by an honest flag, with the correction offered and the final judgment left to the researcher. It is the opposite of the binary real-or-fake model, and it is the only model that actually works on the sources humanities scholars cite.

What this means for you

If you work in biomedicine, the journals are building fabrication detection, and that is appropriate to your field. The databases are comprehensive enough to make it work.

If you work in the humanities or the qualitative social sciences, the headline finding still matters, because AI writing tools introduce the same errors into your drafts. But the medical fix does not transfer. You do not need a tool that declares references fake. You need one that reviews your messy draft, reconstructs your shorthand, verifies what the databases and the open web can confirm, and tells you honestly and specifically what it could not, so that the references you cannot verify by lookup get your attention rather than a false verdict.

Detection is the easy half of the problem, and it is the half your field is least equipped to rely on. The half that matters for humanities work is judgment: reconstruction, comparison, and honest flagging. That is the half worth building for.

Ready to audit your citations?

Citation Master reviews your actual draft, footnotes, shorthand, multilingual sources and all, verifies every citation it can against scholarly databases and targeted web search, and flags what it cannot, with an explanation rather than a verdict. What gets verified is reformatted to your chosen style. What does not is flagged separately, never invented.

Start with 50 free verifications today. No credit card required.

Start your free audit