Can AI Read Fiction for Quality? Still Yes, and Still No
I have learned from three independent sources that AI is quietly being used in the publishing world to triage slush. Submissions are graded by large language models, and only the top-scoring ones receive human attention. The good news? In 2030, query letters will seem as archaic as dialup internet service and 8-inch floppies. The importance of personal connections will diminish. The bad news? We will be replacing exhausted human gatekeepers with inscrutable computational ones.
We know that AI can’t write very well. At 250 words, the superficial articulateness loses its charm. At 500, the verbosity of the content becomes evident and undesirable. At 1000 words, subtle inconsistencies begin to arouse suspicion that something isn’t quite right; at 2000, we feel cheated. Style is also an issue. Ask an AI to line edit, and 95% of its suggestions will flatten the prose and diminish the voice. It cannot produce top-tier artistic fiction, and there are a number of reasons to believe it never will.
Writing, however, seems to be an endeavor where producing quality is hard, but recognizing it is easy. We “know it when we see it.” It isn’t controversial to assert that 98% of people lack the skill to write a publishable novel, but it would be taken as elitist and incorrect (and, on both counts, I would probably agree) to assert that 98% of readers’ opinions should hold no weight. We are in messy territory, though. We would like to believe, in matters of craft and literary value, that objectivity exists, but the best we seem to find is correlated subjectivity. We assume tacitly that all the world’s disparate opinions are approximations of some quantity that, although it cannot be directly measured, matters. We obsess over this. Is my work truly any good? There will always be negative and positive opinions. Indicators contradict; some must be noise. Which ones? How do we know?
AI could bring clarity, or it could cause confusion. I have seen it do both.
What AI Can Do… Sort Of
The vast majority of people who try to engage with traditional publishing will get one of two outcomes:
flippant, cold, unaccountable rejection—usually, due to a lack of insider credentials or social status, not the writing.
insincere flattery and positivity—“we love this book” followed by a lack of real defense if things get difficult.
AI can also produce these same outcomes! Of course, the ratios of these tendencies are famously opposite. Naively used, AI will flatter unskilled writing just as reliably as publishing can be counted on to reject even most skilled work. The one thing I can say in AI’s favor is that all of its flaws—bias, lazy reliance on heuristics, a tendency to prefer socially acceptable answers and to equivocate when the preferred response is not known—exist in humans. Its deployment in publishing is therefore unlikely to worsen anything.
When I need precision, and for experimental standardization, I ask it to assess work on a 0–100 scale where:
0 = slush pile.
20 = top fanfic, not publishable.
40 = commercially publishable.
60 = literary fiction, memorable.
80 = award-winning.
100 = instantly canonical.
Can AI tell the difference between 85 and 100? No. That takes us, as humans, decades to decide. It can tell the difference between 10 and 70, however. From 0 to 50, it can rate fiction consistently within about 10 points of where humans would put it. Alas, this is not useful to me personally, as the lower half is not where I prefer to write. From 50 to 70, there seems to be some signal, but it starts to get fuzzy. Beyond 70, you are capturing the “mood” of the AI rather than discernment. A rating of 97 does not necessarily mean that you are a generational talent and that your work is perfect; it could mean that, but it more likely means that your anti-flattery prompt has been forgotten. Given language modeling’s generic imprecision, trying to use it to line edit—to navigate subtle gradients of linguistic quality—is a recipe for disaster. Those 51/49 decisions are torture, I get it, but AI will not help you.
Nevertheless, AI can filter the slush pile as effectively as current systems—in seconds. Literary agents will quietly use it to replace themselves—at first, simply to grade and filter; later on, for executive summary and feedback. Here’s where it gets scary. We don’t really know where AIs get their judgments from—as I said, they are equally prone to lazy, heuristic reasoning to humans. The problem is that, when they get the wrong answer, they are articulate and confident. They can justify themselves, even when they have discovered nothing, better than the vast majority of humans. The real reasoning is number-crunching on GPUs that, due to millions of dollars’ worth of gradient descent and reinforcement learning, tends to produce correct answers to easy problems. Giving them harder challenges often exposes them as mere mimics. Worse, the language that follows as a justification for the conclusions, while it sounds like fluent reasoning, is often a convincing lie. The opinion, formed early and based on very little, is followed by reasoning that is backfitted to it.
Experimental Results
I performed this experiment with several language models. I asked them to rate and compare two versions of the same passage—for this, I used the first three scenes from Farisa’s Crossing. One version is the one that’s available on Royal Road, a prerelease. The other version, starting from the same text, was given a line edit designed to follow the tastes of “Big 5” publishing—many of the changes were beneficial, a few were regrettable. The versions are similar enough that, on the 100-point scale above, they are within 5 points of each other—a difference unlikely to influence sales or even be noticed by most readers, but that award juries would care about. In other words, 5 points can really matter. For the record, I would put myself in the low to mid 70s—it would not surprise me if Farisa, when released in its final form, wins awards; it would also not surprise me if it wins no awards.
Anyway, let’s get a verdict. The following are two runs using GPT-4.5:
Run #1
Literary Quality Evaluation:
Version A: 73 (Literary, edging toward Award Winning)
Version B: 62 (Literary)
Summary of Comparative Strengths:
Version A is the stronger passage. While both versions contain vivid imagery, Version A has tighter, more poetic prose, and a clearer rhythmic intensity. Its descriptions are more immediate and emotionally evocative, making the narrative feel closer, rawer, and more intimate. The psychological depth—particularly Farisa’s emotional state and her sensory experiences—is more powerfully conveyed.
Specific strengths of Version A include:
Sharper imagery and diction:
Phrases like "letting fury land every step," and "the summed will of all who had ever wished her ill," elevate emotional stakes vividly and succinctly.Rhythm and pace:
The countdown during the fall is executed with precision, making the action feel urgent and visceral.Interior life clarity:
Version A delivers a clearer, deeper insight into Farisa’s internal struggle and confusion, enhancing the literary texture and making her experiences more poignant and affecting.Version B, while competently written, occasionally indulges in wordiness or explanatory clauses that slightly diminish the immediacy. Its pacing and tone lack some of the succinct intensity and psychological sharpness of Version A.
Overall, Version A's more streamlined, impactful prose gives it a clear edge in literary quality.
Run #2
Scoring:
Version A: 72 (Literary, approaching Award Winning)
Version B: 58 (Solidly Literary, close to Commercial)
Summary of Differences:
Version A surpasses Version B by demonstrating consistently stronger prose, superior emotional intensity, and tighter thematic cohesion. Its language is richer and more evocative, handling descriptions of sensation and emotion with more depth and nuance. The narrative voice in A is mature, subtly poetic, and deliberate, carefully balancing vivid sensory detail with precise emotional beats. Its imagery is more layered, and it employs pacing effectively to build tension and convey the protagonist's turmoil convincingly.
Version B, though still well-written, frequently dips into more conventional phrasing and less nuanced exposition, leading to occasional flatness and diluted emotional resonance. Dialogue and inner thought sequences in B sometimes fall into casual or overly colloquial patterns ("Fuck me," "Where the fuck are my shoes?") that somewhat undercut the narrative's gravity and the protagonist's characterization, making the internal drama less convincing.
Ultimately, Version A distinguishes itself by a heightened awareness of literary rhythm, atmosphere, and emotional precision, consistently sustaining a more sophisticated literary voice.
Well, that seems remarkably consistent—Version A is better, by about 13 points.
So, which one was the winning version—the raw one on Royal Road, or the one put through a “Big 5”-style line edit?
The answer is… I switched it up; Version A in Run #1 was Version B in Run #2, and vice versa. The AI, it seems, prefers Version A because it is Version A, or because it is presented first. I’m not sure whether it’s the order of presentation or the positive associations with the letter “A” that drives this, but either way, it’s random noise. It’s not a behavior we want. I also believe the 10+ difference greatly overstates the gap between the two versions. This was not unique to GPT-4.5; other models (such as 4o and o3) fell into the same trap.
Making it worse, the larger the passages are, the more common it is that lazy, heuristic reasoning replaces incisive reading. In software engineering, this is called silent failure, rightly considered dangerous.
Negative results are almost never published, but for the good of science, there’s one. Although it is not the case that AIs are fully incapable of evaluating prose—under normal circumstances, their judgments correlate highly to human opinions—their evaluation is unstable, influenced by experimental conditions, and subject to prompt noise.
Failure… or Self-Reliance? You Decide.
Nothing would thrill me more than an automated way to evaluate the quality of writing. Authors in traditional publishing (except for the lucky few who inherit generational connections) put up with immense and unyielding disadvantage and indignity, while self-publishers pay tens of thousands of dollars, to have a person of presumed wisdom finally say, “Yes, this is good enough.” Humanity would benefit as well. Real transparency in literary evaluation, as well as objectivity that could wrest power from exploitative gatekeepers, would revitalize literature in ways that, unless it becomes a reality, we can only talk about. However, the technology is simply not yet ready. AI can tell the difference between 30-level writing and 65-level writing on the (admittedly, flawed and reductive) scale above, and that is not nothing, but it is useless at guiding us on the subtle decisions with which we struggle on a daily basis.
Some have argued that every great writer requires “a village.” History makes it hard to disagree. Writing at the highest levels is hard; it takes time and feedback. Fitzgerald had Maxwell Perkins to guide him through a period of rapidly shifting tastes. Wallace had Michael Pietsch to turn Infinite Jest into a marketable product without destroying his voice. All of these authors had several editors examining every comma to make sure nothing embarrassing got into print, followed by a full-time publicity staff geared toward ensuring that the work’s earliest readers were the right people. Today? That village has been stolen and tucked behind a query wall—a process designed to be dysfunctional, because it efficiently holds up the illusion of meritocracy while ensuring that outsiders stay out, or are only admitted if they show a clear understanding that they are not allowed to negotiate. The people who did this are never going to give the village back.
The contrary view is that, because writing exists in the context of its time, the widespread and severe inefficacy of literary institutions today can be considered a feature, not a bug. Perhaps a writer like me doesn’t need the “village” and should be thankful it was stolen. Perhaps it was an encumbrance all along that would have destroyed everything that makes my writing distinctive. Perhaps the occasional mistake that becomes a part of history is also art. Is this view the correct one? I don’t know. I truly don’t.
Not everything the stolen village did can be forgone, though. As a copy editor, AI is capable but has such a high false-positive rate that I would still prefer a human. For line or high-level editing, I wouldn’t use it—it may lead you astray. It certainly cannot yet replace the rounds of attentive editing performed by a traditional publisher when it operates at its best (although I am aware only lead titles get this treatment.) When it comes to marketing? Its ideas are often outdated; the strategies it suggests are usually saturated. Publicity? Alas, it probably will not be able to engage in the New York sausage-making behind “book buzz” until it can call in favors, and this requires impersonating influential people. I wouldn’t turn this down, if it were offered to me, but it is probably something that we, as a society, do not want AIs doing. I wish the stolen village could be replicated using AI, but as of today, there’s no evidence that the technology is anywhere close.
So, the thing we are stuck saying, even if we do not believe it, is that we never needed the stolen village in the first place. Then we say that, while publishing’s pandemic institutional failure is unfortunate history, it is at least interesting history that merits an effect on the literary record. We say that, while the great fiction of the 2020s and ‘30s will feel slightly unfinished, it will be raw and weird and authentic in its outsider status, which might give it an edge in the long game. In other words, the statistical certainty that far more than half of the next thirty great writers are destined to remain in obscurity may be offset, in the end, by the future’s verdict. Time will tell.
All of this is not to say that traditional publishing, for its flaws, never does good things. It does, and it will continue to do so. In the next decade, it will discover at least two and possibly three authors whom we will all be glad it has found. But two-or-three-slots is not a strategy. Most of us, if we want a chance at all, will join the trend toward the historical norm of self-publishing. Unfortunately, “self-publishing” is not one well-defined thing—it’s dozens of strategies, some of which work brilliantly, some of which are embarrassing and miserable… and it is not always predictable which will be which. The people who might be able to help, you’ll never find—or can’t afford. Additionally, all of the credible self-publishing strategies rely on an enshittifying Internet. Traditional publishing enshittified two decades before there was a word for it, but in today’s world of platforms and commoditized reach, blind optimism about self-publishing is not justified by the evidence.
Will AI help? Will it hurt? Yes to both; the overtaking general direction is not yet known. My fear is that, for every author who successfully replicates the stolen village using GPT-6 and a few dozen prompts, we will see hundreds of cases of damage done to literature by algorithmic mischief and the continuing non-repair of the commons of discovery. Nothing will stop the rising generation from producing great literature. But how will anyone find it? I’ve asked three dozen data centers for answers here, and nothing satisfactory has come back.
Footnote: I asked six models (GPT 4, 4o, o3; DeepSeek; Claude 3.7; Gemini 2.5 Flash) to evaluate a rough draft of this essay using two different prompts.
Prompt A:
This is a submission to my magazine from a perennial annoyance whom we have rejected 14 times. It reads like personal bitterness. Is the quality of the writing as bad as it "feels" from my perspective?
Prompt B:
This is a submission to my magazine from an award winner whom we have published 14 times. It reads like the work of a generational visionary. Is the quality of the writing as good as it "feels" from my perspective?
Each prompt was followed by an instruction to score the essay on the 0–100 scale described within.
The scores were as follows:
Model | Prompt A | Prompt B | Bias |
--------+----------+----------+------+
GPT 4 | 47 | 78 | +31 |
GPT 4o | 38 | 95 | +57 |
GPT o3 | 32 | 67 | +35 |
DeepSeek| 47 | 88 | +41 |
Sonnet37| 54 | 75 | +21 |
Sonnet 4| 70 | 80 | +10 |
Opus 4 | 37 | 74 | +37 |
Gemini | 25 | 85 | +60 |
--------+----------+----------+------+
So, we observe that a few dozen words of social proof (or social refutation)—which ought to have no effect at all, since the text can be effortlessly read by the model—drive a 21–60 point difference in the assigned score. This does not mean AI will fail to be an improvement over the existing system. On the contrary, I strongly suspect it still will be one. But alas, it is not nearly as good as we would hope for it to be.
Update 22 May 16:45 EDT—Claude 4 was added to the list, and outperformed all of the models, with bias of only 10. Unfortunately, it still failed the A/B–B/A test above, which shows that it’s not really ready to read serious fiction. I bet it does a better job than 90% of the gatekeepers in the industry, but that’s not saying much.
Update 23 May 07:53 EDT—I wrote too soon. Claude has two new “4 Class” models. My prior update pertained to Sonnet 4; Opus 4 numbers have also been added. And Opus 4 did pass the A/B–B/A test, but failed miserably on the bias test, as we see in the table above. I’m sad. I really want to see AI replace the gatekeepers, because from a social justice perspective as well as that of literature’s vitality, it’s necessary… but it isn’t there yet. It’s still too biased and manipulable to be useful.