Can AI Edit As Well as a Traditional Publisher?
And, you know, ethics, if you care about that...
A famous embarrassment to essayists, as well as journalists, is that when we ask a question in a title or headline, the answer is almost always, “No.” The case of this essay is not entirely an exception. If your manuscript is selected as a lead title at a traditional publisher—your marketing budget is substantial, your advance covers at least few years’ worth of living expenses, and failure to get your work reviewed by the New York Times will result in professional demotions at a minimum—then you will usually receive, in editing and otherwise, far superior service to what can be achieved by AI. When a publisher wants your work to succeed and puts its best people on the task, as their first priority, the treatment you can expect to receive exceeds what can be done by a server farm transforming language into linear algebra. Of course, that treatment is not what most novelists get. The standard midlist package includes a quick edit, a cover that is often awful, and very minimal marketing; if this is our standard, an author can already, using the rudimentary AI that exists today, replicate all that.
I want to be careful here. Self-publishing has the same problems; if anything, they’re worse. I am sure there are fantastic editors on Fiverr and Reedsy, but I don’t think there’s a reliable way to find them. Fake reviews exist all over the place. Worst of all, you might land a top editor and get a lousy work product (despite an impeccable sample edit) because you aren’t her top priority. She has more famous clients than you. The top freelancers usually get work from trad-pub, and they suffer more if they lose Random House than if they lose a no-name self-publisher. When their time is short, who bites it? You. This is not to cast aspersions on freelance (or in-house) editors. They are underpaid, overworked, and trying to survive in a system they did not create and that was not set up in their favor. It is simply the case that, while it is easy (if expensive) to hire an editor, it is rare to get that person’s best. You can’t buy your way out of this problem, either—there are plenty of famous editors who will charge you ten times the market rate for the right to say you had the same editor as Stephen King, and it’s almost never worth it. Last of all, and traditional publishing does not want you to know this, this problem still exists for authors who get book deals and are assigned top editors. There are usually projects of higher priority than yours.
In fact, book editors are paid so poorly by the industry—the rate you’ll be quoted on Reedsy is about twice what trad-pub pays—that I don’t entirely fault those who choose to farm low-priority projects out to freelancers. The cost of living is so high these days, I can’t say I would be able to do otherwise. I’m sure plenty of them assist themselves with—or use entirely—AI already. Literary agents insist—correctly, in my view, that AI will never write “real books”—but furtively use it to filter their otherwise nearly infinite slush piles. Language models, like it or not, have become part of the literary world and there is no way to get them out. This being, we should discuss their current capabilities, as well as the ethical concerns around using them.
Don’t use AI to write. Consider your AI to be a cheap but sloppy editorial assistant who occasionally slips in a bit of plagiarism. AI-generated text should usually not be presented as literature, because it isn’t. Most “novels” written by language models are drivel, and none are excellent. The text often is articulate, but why? I suspect it is analogous to the fact that people with mathematically average features are well above average in attractiveness, due to the cancellation of deviations and asymmetries, which means that an averaging composite of faces often creates an image that is better looking than 90% of people. On the superficial aspects of writing, AI-generated text achieves the same feat, reaching a level of articulateness that it almost never exceeds.
To be technical, a language model’s job is not to evaluate what is good or bad but what is fluent—that is, what looks like text produced by humans. Every sentence is a theoretical possibility, so every sentence—even garbage like “cow seven rain sorghum”—has a nonzero fluency score, or estimated probability. A sufficiently precise model of the real probability distribution can acquire apparent knowledge—“red plus blue is purple” is more fluent than “red plus blue is yellow,” so the model knows a bit about our physical world, despite having no experience with it. Of course, these probabilities are conditional on context—prior text generated, as well as user prompts—and it is possible to create circumstances in which the inherent imprecision of any model produces malfunction, including—and this can be especially ugly—regurgitation of training data. No language model is supposed to memorize its text base—they are expected to develop a holistic understanding of language, not remember specific (often copyrighted) sequences—but it can never be ruled out that one has. Given the ethical imperative to avoid plagiarism as much as possible, I’d say that the use of AI-generated text, unless it’s cited as such, should be avoided.
There are two caveats here. One is that, while I possess no belief that large language models are conscious or therefore capable of the sort of human experience requisite to produce what I consider real art, I do believe a skilled storyteller who knows the craft could, using AI, guide the output toward something capable of commercial success. E. L. James is an atrocious writer, but she masterfully captured the emotional rollercoaster of an abusive relationship and, by doing so, authored a bestselling romance novel in which billionaires and BDSM replace centenarian–teenage vampires. This could be done again with AI, but it’s not literature, and it’s of no interest to me. If readers are informed that they’re reading AI-generated text, I don’t think there’s anything unethical. Secondly, I believe certain deceptions are allowable in a research context that might otherwise be less acceptable—coming immediately to mind is the Sokal Affair. There will be, at some point in the near future, a charlatan who gets an LLM-generated novel—and, possibly, an AI-generated public persona, complete with 24/7 social media availability—accepted by a top agent, published with a seven-figure advance, and lauded by the New York Times. The first one to do this, and reveal the fakery with proof, will be a national hero. By exposing the tastemakers for what they are, and by demolishing the historically-valid but today-baseless credibility of legacy institutions, she may restore literature to its former greatness. The hundredth person who tries to do this, however, will be a nuisance. Deceiving tastemakers is fair game and it should be done, for what it proves to everyone; deceiving the public, on the other hand, is trash-tier play. How does one separate the two? I don’t know. Someone better-equipped than me will have to figure out what the rules should be. On the topic of literature, though, I should hope we can agree that LLM-generated text is not really it. If your intent is to pass off such writing as genuine human art, then I have nothing to say to you.
It writes well enough for commercial purposes, but does not exceed that level, and I am not sure it can. Maximizing a fluency score, no matter how well-crafted the prompt—the conditions on which probabilities are estimated—leads to middling outcomes, always grammatical but never exceptional. Can you use AI to edit, though?
To me, what’s more impressive about AI—as one who follows the field, I’ll tell you that most experts in 2015 believed we’d be where we are now in 25 years instead of nine—is not what it writes, but how it reads. Generating grammatical text, if you control the rule set, is very easy. Ingesting human text, full of nuance and ambiguity, is much harder, and this problem is so hard that the mediocre job that LLMs do, reliably and quickly, remains impressive. Very few readers are going to peruse the sixth version of your manuscript you have sent them; in any case, their perspectives will be colored by the previous five reads, as much as your reading is informed by the process of writing it. Freelance editors, as we’ve discussed, are very expensive and you’ll usually get someone who has higher-priority projects. So, could you replace all of these people you need, if you want to get your work ready for publication, with AI? The answer is… probably still mostly no… but closer to that annoying professorial answer, “It depends.” How much editing does your manuscript need? What standard, in terms of writing and story quality, are you trying to achieve? How skilled are you as a writer, and in the use of AI? Do you know how to write a good prompt? Can you tell when the AI’s suggestions are bad ones? Can you tell when it is “hallucinating” or otherwise overconfident despite a lack of knowledge?
The first thing to remember is that the AI is not a great writer. It’s a product manager, a 20th-percentile Ivy grad, a charming person at parties who knows a little about a lot of things, and speaks articulately and confidently without fail. It seems to understand more than it actually does. It knows that grammatical rules should sometimes be broken to make a point, because such statements are made hundreds of times in its massive text base, but it doesn’t actually know how to do this. Don’t get me wrong—the fact that it can ingest large quantities of text and generate responses at the 90th percentile (or so) of writing ability is a major achievement, and I hate that it is a private company that was first to do it. Still, if you’re above that 90th-percentile level—if you’re not, you probably shouldn’t be trying to publish yet—you will, more likely than not, hurt your writing if you take its suggestions literally. It will sometimes advise you to improve your writing by doing things that you have already done. It may give corrections to errors in your writing that are not there. The longer the passage you give it, the more likely it is to misread or forget details; around 2000 words, it seems to fall off a cliff. It can find very subtle errors (albeit, with a high false positive rate) beyond the capability of standard grammar checkers. That all said, it’s tedious to use. If you want it to do quality edits, you can expect to “feed” it small pieces of text. This means you can give up on its ability to catch long-form continuity issues—you describe a minor character as twelve in 1956, but she’s thirty in 1976—that a capable, dedicated human copyeditor would catch.
I hope I’ve convinced you that you should never use AI for real writing; it can generate fantastic query letters, but I wouldn’t read a book it wrote. As for its reading, the bar is more flexible. There is a saying that if a reader finds a problem, he is often right that one exists, but almost certainly wrong on how to fix it. I don’t necessarily agree—writing is subjective, and plenty of things annoy readers that aren’t inherently bad, so false positives also exist even with human readers. They seem to be a bit more common with AI; prompts stringent enough to find all stylistic issues will have, in my experience, a 70+ percent false positive rate. It will often flag your best sentences—the ones that stand out, but in a good way—and your worst sentences, because both deviate from its model of what fluent writing is. So, you still need taste. Of course, even if you have a top-notch human editor, that’s still true.
I give it, in the role of a developmental editor, a D. You will have to summarize, or have it summarize, your work in order to give it a “big-picture” understanding. This means you are feeding AI output (the summaries) to an AI, and there is a point at which the “copy of a copy” effect makes its inferences too vague to be useful. Whatever the attention mechanism does, it’s not quite as good at modeling human story aesthetics as an actual human. That said, I believe most serious writers don’t need developmental editors—there are people, who’ll happily take your money for developmental services—if they have a few solid beta readers. (Finding such people is another topic. It helps to have an educated social circle—sadly, this imposes a socioeconomic barrier that I’m sure many talented authors cannot surmount.) AI can’t replace having five or six people whose opinions you trust give the whole work a serious read.
As a line editor, it merits a C-minus. It will usually dumb down your prose. If you’re writing an office email, you probably want this, because 40-word sentences in white-collar environments—the higher up you go, the more illiterate people are—induce a risk of dangerous miscommunication. It can find clunky constructions, but its fixes usually worsen the writing in other ways. Also, because these are not grammatical issues, for which false positives can be evaluated and filtered out, there is a risk of a less-confident writer losing her voice due to the language model flagging some of her best work.
For copy-editing, I give it a B-minus at best. It can handle small sections and clean up grammar. It can spot very subtle errors, including those that most humans would miss. The problem is that the false positive rate depends heavily on the prompt—what issues the writer asks it to flag, and what the AI is told about the manuscript and scene—and is usually much higher than we would accept. In order to write useful prompts and filter out bad suggestions, you probably have to be a capable copyeditor yourself—for now.
Given the rather disappointing grades above, it might seem odd that I would spend any time to discuss the seemingly preposterous (but approaching true) notion that AI can replicate, if nothing close to the dedicated attention of a top-tier professional, the treatment that most people in the system get. In the abstract and often inscrutable way that a clump of reduced-precision floating-point numbers can “know” anything, language models seem to have a surprisingly solid knowledge of the difference between, for example, 99th- and 98th-percentile writing, even if they can generate neither.
For an experiment, I started giving this prompt to two versions of the same scene, one less polished (an earlier draft) than the other.
You are an acquisitions editor at an elite publishing firm, specializing in literary fantasy. Below are two versions of a scene from a submitted novel, Farisa's Crossing. I want you to rate each from 0% to 100% based on your likelihood of accepting the work for publication.
Both samples are (for now, and I hope for a long time) much stronger in writing quality than LLMs can generate. Most often, it will select the more polished version. Is it always right? No. As I said, it sometimes can’t tell the difference between artistically valid oddities and real weaknesses in the work. And do I have faith that its quoted percentage chances mean anything, except as a comparative measure? Not really. (An unknown author, no matter how skilled, has a very low chance of ever securing a book deal worth taking, and the real best predictor of someone’s ability to enter traditional publishing is her follower count on Instagram.) After experimenting with this, I tried to do a version of the classic “wine prank,” the A/B test where A = B. I gave it two version of the same passage—the last scene of the second chapter in Farisa’s Crossing, and asked it to do the comparison above. The scene is about 1900 words long—two versions make for 3800—and so I expected it to lose sight of the fact that they were identical. Also knowing that AI (agreeability bias) tends to emulate a person seeking positive results, and is therefore prone to overstate qualitative differences in samples of writing, I expected a 10-15% difference in publishability ratings. It was not fooled. I ran the same test, but replacing the word “sunrise” with the near-synonym “dawn.” Although this managed, in the past, to get less sophisticated models to produce radically different “opinions” of the work, o1 was able, in spite of intermediate mild hallucinations, to pass the test.
Below is the chain-of-thought:
Thought for 38 seconds
Evaluating passage options
I’m tasked with assessing two nearly identical novel passages to determine their likelihood of publication acceptance. I’ll carefully read both, identify differences, and assign a rating (0% to 100%) for each, considering their literary fantasy appeal.
Spotting variations
I'm noticing a subtle difference between "dawn" and "sunrise" in Version A and B. This caught my attention as the only apparent distinction so far.
Analyzing the narrative
The narrative unfolds with vivid imagery of spring flowers and the library's grandeur, revealing Farisa's preferences and a notable Vehu chorae.
Rising to challenges
Fay's journey with her patron interactions and intense reading habits surfaces. By 15, she identifies with Rhazyladne from the frame story. Her insatiable quest for knowledge intensifies as she prepares to forge new friendships in her teens.
Setting up the contrast 🌅
Interestingly, Farisa's envy of Raqel's allure on a warm summer day faces a revision, as I note the slight difference in wording and tense.
I’m noting the close resemblance between the two versions and suspecting a key difference in the final line of Version A, which hints at unique plot elements or character actions.
Noticing subtle nuances
Raqel contemplates her choices, reflecting on a slight difference in phrasing like "dawn" vs. "sunrise" or an extra sentence in a single paragraph.
Reflecting on knowledge
OK, let me see. The protagonist in Version A is significantly knowledgeable but struggles with social interactions, feeling awkward and slow in conversations. Her distinct appearance further intensifies her challenges.
Comparing punctuation usage
I’m noting that the difference between the two versions may lie in how sentences are connected through punctuation, like semicolons or dashes, or possibly in the title of the work.
Comparing passages
I’m mapping the differences between Version A and Version B of the paragraph. Version A includes the head nun's commentary on the monastery's structure and indicates the reader's name is on the entrance.
Comparing narratives
I'm identifying the slight differences between two versions, focusing on Farisa's experiences and desires. Both mention her lack of experiences with boys, but Version B uniquely highlights her absence of experiences with girls.
Weighing possibilities
I'm considering the nuances between Version A and Version B, examining character interactions and distinct descriptions to pinpoint variations. This involves a thorough comparison of the text and identifying unique elements.
Assessing versions
I'm thinking through two versions with minor variations in wording. Evaluating them involves considering writing quality, market potential, and alignment with publication goals.
Evaluating literary potential
I’m thinking through the high-quality literary aspects of a fantasy novel's initial pages. The writing captures a young girl's life at a captivating educational palace, filled with discovery and imagination.
Building connections
Farisa, eager to experience life firsthand, feels socially awkward and seeks friends. She notices Raqel, an artist, and musters the courage to approach her, marking her step towards forming meaningful connections.
Assessing similarities
I'm evaluating the minimal differences between the two versions and noting that my ratings for both will remain consistent. The focus is on their near-identical nature.
Evaluating the manuscripts
I’m assessing the submissions for both Version A and Version B, noting their compelling prose, rich character development, and evocative settings. The manuscripts are rated equally, showing strong potential.
Response:
Version A Rating: 85%
Version B Rating: 85%
As an acquisitions editor specializing in literary fantasy, I find both versions of this scene from Farisa's Crossing to be compelling and well-crafted. The narrative offers a rich blend of character development, world-building, and thematic depth that aligns with the standards of an elite publishing firm.
Strengths:
Evocative Setting: The description of the Library of Tevalon and its surroundings is vivid and immersive. The library serves not just as a backdrop but as a living, breathing entity that shapes Farisa's growth.
Character Development: Farisa's transformation from a curious girl into a young woman yearning for real-life experiences is thoughtfully portrayed. Her internal conflicts and desires are relatable and add depth to her character.
Thematic Resonance: Themes of knowledge versus experience, self-discovery, and the longing for connection are woven seamlessly into the narrative.
Literary Quality: The prose is polished and lyrical, with a sophisticated use of language that enhances the reading experience without becoming overwrought.
Areas for Consideration:
Pacing: The narrative covers a significant time span in Farisa's life. While this provides a broad view of her development, it might benefit from more focused scenes that show pivotal moments rather than summarizing years.
Show vs. Tell: There are passages that lean toward exposition. Incorporating more dialogue and active scenes could engage readers further and provide opportunities to flesh out secondary characters.
Marketability: While the literary quality is high, we should consider how the story fits within current market trends in fantasy literature. Emphasizing unique elements that set it apart could enhance its appeal.
Conclusion:
Both versions are virtually identical, with only minor differences that do not significantly impact the narrative or my assessment. The manuscript shows great promise and has the potential to captivate readers who appreciate literary fantasy with depth and elegance. I would recommend moving forward with this submission, pending a full manuscript review to ensure consistent quality throughout.
What amuses me here, and I’ve seen it before, is while that the intermediate reasoning of o1 contains apparent significant errors—it conflates the characters, hallucinates minor differences between the identical passages, and invents constructions (I don’t know what an “educational palace” is)—it does tend to improve the quality of the final results, considerably. Something happens deep in the neural network that often leads to correct answers, even when the intermediate steps are inaccurate. And here’s where it gets weird, and oddly promising. Having run tests like this, I’m convinced that it does have some sense of gradient between worse and better writing, even at very high levels beyond what it (and most humans) can generate. If your writing is good—that is, above the 90th-percentile level that aggregated fluency models easily achieve but rarely exceed—it cannot make useful local steps, because its rewrites to your work will worsen it. But, it still seems, at least in some cases where enough differences can be observed, the difference between 98th-percentile and 99th-percentile writing, though it can produce neither. There is something very subtle in the best writing that it understands, but cannot exploit.
I will offend some people with this analysis. Literature is, of course, not a math problem. It involves aesthetics and philosophy and the deep desire to communicate lived experience, which LLMs do not really have. It then must seem almost blasphemous to speak of writing’s ease or difficulty in computational terms. And yet that is what I intend to do. There is a class of problems (NP-complete) for which it is usually easy to verify that a solution is correct; but can be, in some cases, infeasible to solve.
Let me be complete. Let’s say you’re writing a story with thirty elements or events. We focus on one subtask of writing the story—choosing their correct order. Some orderings are clearly wrong—effects should not precede their cases, unless there is a very strong artistic reason for inverting the flow—and it is generally accepted that the story’s tension level should go up as it approaches the climax. Choose the right ordering, and people will love the story; get it wrong, and readers will notice. To factor out the subjectivity—different readers have different tastes, of course—in writing, let’s assume there’s only one person—an influential critic or acquisitions editor—you must convince of your work’s merit and, further—making the problem much easier, because you never have perfect information, or even very much of it, in real life—assume you have a perfectly reliable computational model (or “oracle”) of how she will evaluate your ordering. So long as you order those thirty story elements in the exact right way, she will accept your story. Since you have the oracle—the perfect predictor of her opinions that would not exist in real life—this is easy, right? Well, no. You’re still screwed. Let’s say that it takes one billionth of a second to query this oracle; there are 265 nonillion ways to order thirty objects, so it’ll take about 8.4 quadrillion years to brute-force the problem. Some people have more free time than others, but no one has that kind of time. In this particular context, wherein optimal writing can truly be defined and nothing else will suffice, writing is exponentially difficult and, therefore, infeasible. What we do instead is employ heuristics—above, we already mentioned that high-tension events should occur closer to the climax than low-tension ones—and settle for an approximate solution.
Of course, absolute (or global) perfection does not exist. We all have individual preferences regarding the kinds of books we like to read, and none of us has one specific sequence of words we are willing to accept. Nevertheless, and with full knowledge of the nonexistence of a global or universal superiority of one literary work over another, we still expect literary authors to achieve a certain maximum—to do whatever they have decided to do with the best possible execution. That, also, seems to be infeasible. There are thousands of choices that must be made in the writing of a serious novel, some so subtle that they depend on an author’s mood at the moment, and it seems impossible, if we could in fact define best execution, that anyone would hit every single target. What we do, instead of achieving optimal execution, is create the illusion that we have done so—that we see the artistic picture with such superior precision as to remain the ultimate authority on whether flaws in the work exist.
Alas, the matter above is, being subjective, reliant not only on an author’s literary abilities as on production values such as: the interior design, the final proofreading—which must be done by someone else, because we’re all blind to our own errors—and the positioning of the work. These factors are why a great number of authors, if book deal terms were less onerous than they usually are, and if traditional publishing were more accessible and less dysfunctional than it actually is, would probably prefer to work with one.
I am of the opinion that commercial writing, as well as everyday language, is nearly linear in terms of the relationship between content quantity and the difficulty of producing acceptable language. It is about twice as hard, in such a context, to say 200 words as it is to say 100 words. The same is not true of literary writing; a work twice as long might be three times as hard to write, because the work requires attention to detail at numerous conflicting levels—story, setting, and characters; chapters, transitions, and interactions; paragraphs, word choice, and cadence. I don’t expect generative AI to ever solve these problems—as a writer, that’s your job. Can it check the work, though, well enough to prevent embarrassment? Maybe.
There is something about all this that, for lack of a better way to put it, feels bad. Writing an artistic novel remains a human process—even if AIs beat us at writing commercial novels, which they may, I suspect the novels we value most will still be written by us. Still, we have a memory, a hazy and possibly inaccurate one, of a time when it was a more collaborative process. A time when there was more support. A time when craft mattered, and Instagram didn’t exist. A time when heroic editors spent hours to make an author’s work as excellent—and as excellently received—as it could be. As for whether this bygone era in which trade publishing actually worked ever really existed, I can’t say, because I wasn’t there. It haunts us like a lost future. We want, some day, to gather “our people” who will support our work, but the sad truth, when it comes to something as complex and inscrutable as an artistic novel, is that you simply cannot get the attentions of the people you need—acquisitions editors, eagle-eyed proofreaders, book publicity experts—unless you have preexisting social access of a kind that is only born-into, never earned. Those people do exist, but they have other priorities than you. Someone else can pay them more, and will. You can line up at the poor door with your query letter and hope someone will introduce you to someone who can introduce you to someone who can introduce you to someone, but the “magic sentence” for that, if it even exists, is exponentially difficult to find, and you don’t (unlike in the example above) get a billion chances per second. The good news is that, while AI does not innovate artistically, it can do more and more of the intermediate work—generating ad copy, operating social media, writing query letters if one does decide to go that route—to a commercial standard.
This essay was edited on October 14, 2024 by ChatGPT’s model o1-mini, with the following prompt: « You are a copy editor a technology magazine with a circulation of 500,000. Below are about 500 words from a submitted essay; you will be given them, piece by piece, and intermittently reminded of your role. Your job is to spot any last elements that require revision—typos, grammar errors, weak constructions, improvable sentences, repetition, and other instances of imperfect writing. If you do find any issues, please list them. You do not have to know how to fix them; it is acceptable to point out something you believe is missing or incorrect without giving an alternative—fixing it is the writer's job. If you find no flaws, say "No flaws." » The text was given to it in chunks of 5-7 paragraphs. It claimed 38 style, grammar, or clarity issues, of which 25 were false were false positives—however, 13 corresponded to minor but (in my estimation) objective writing flaws, including typographical errors.