o3-mini Initial Impressions

Feb 02, 2025

I’ve been using large language models experimentally for years, and here are my initial thoughts on o3-mini.

Intermediate Reasoning Still Difficult to Interpret

Prompt: “What is the probability of rolling 35 or higher on 7d6?”

It got the correct answer—1.21%—but its intermediate reasoning was incorrect:

Calculating coefficients
I’m figuring out the number of ways to achieve sums from 28 to 35 using generating functions and the inclusion-exclusion formula. This approach effectively manages patterns and constraints.
Tweaking the strategy
I’m trying another way by using a transformation to adjust the sums and updating coefficients for each sum from 28 to 35. This ensures a more balanced approach to problem-solving.

At first, I found this perplexing. The sums from 28 to 35 are irrelevant, right? Well, the final answer makes clear that it has subtracted one from the dice—it is instead calculating P(Sum ≥ 28) on seven “d6-1” (i.e., dice labelled {0, 1, 2, 3, 4, 5}.) It isn’t explained well in the intermediate reasoning, but this is valid.

It seems to switch back, at least by its own report, to working on d6’s:

Revisiting calculations
I’m re-evaluating binomial coefficients like C(37,6) and adjusting terms using the inclusion-exclusion principle. Each step sharpens the final probability estimate.

The furnished final answer makes clear why it was talking about “sums from 28 to 35”; it had shifted the problem to the probability of achieving 28 from seven “d6-1” dice. The answer given is long-winded but correct. It’s unclear how well o3 would perform beyond the level of a toy problem, but it does get the exact correct answer: 3376/6⁷.

Statistical Basic Knowledge Achieved

Prompt: What distribution would the median of 9 random uniform variables have?

It gets the right answer here—a Beta(5, 5) distribution. Its explanation is sound and would be easy for an undergraduate mathematician to understand.

Writing Ability

I have not seen evidence of progress beyond the o1 level—indeed, GPT-4 is often as good a writer (and perhaps better) than the AIs with augmented reasoning capabilities. Like GPT-4 and o1, it can copyedit. It can generate prose at the quality level of an average published book. It has not surpassed that level, and I have strong reasons for believing it will not progress beyond it. The short explanation is that, just as averaging facial features results in faces that are judged as significantly above average in attractiveness, it is very easy for an “averager” to reach the 90th percentile of articulateness, though it is rarely distinctive. The “smoothing” that AI wants to impose on language removes the voice.

You can, of course, request style changes. If you do so, it will caricaturize said style.

For example:

Prompt: Turn this sentence literary. "The clock struck twelve and the rooks flew away."

This is a bit of a malicious prompt, because the truth is that, even in literary fiction, ninety percent of the sentences are ordinary, just like the one above. As an artistic author, you do want to show the reader that you are capable of: precise description, lyrical prose with attention to meter, and indirect characterization. At the same time, you can’t let the words get in the way of the story. Making every sentence “literary” when most sentences only deserve to be ordinary will exhaust the reader.

This is what o3-mini returns:

At the stroke of midnight, as the ancient clock released its solemn chime, a murmurous exodus began—black rooks unfurled their wings and soared into the ink-dark sky, vanishing like whispered secrets of the night.

I would almost never consider this an improvement. There are times to write like this, but they’re rare—even in literary fiction. It overshoots, and the results are often ridiculous, at least by the modern standard by which, even in literary fiction, ordinary sentences should be preferred for ordinary purposes.

Consider the infamously awful (but, I contend, not all that bad) opening sentence used by Edward Bulwer-Lytton, “It was a dark and stormy night.” There’s nothing wrong with it; it is simply, as an opener, a bit of a missed opportunity. Also, the full sentence is:

It was a dark and stormy night; the rain fell in torrents—except at occasional intervals, when it was checked by a violent gust of wind that swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

In the context of the time this sentence was written, and the expectation (no longer existing in literary fiction) that ordinary sentences be avoided simply for being ordinary, this is not that bad. I wouldn’t call it good—the parenthetical indication of the setting being London is clumsy by the standard of any time—but it suited the demands of its time well enough to get published. The main reason we consider it terrible today is its failure of proportionality—it puts weight on weather, an important but ordinary detail that we would today consider deserving of ordinary description, so we can get to what matters—characters and their motivations. “On a stormy night in London, [X] decided to…”

Anyway, the above is the sort of prose that large language models consider “literary.” They’re not objectively wrong, because this sort of thing is inherently subjective. They simply overshoot to a comical degree.

Can It Edit?

As I have done before, I gave it sections of my own writing and asked it to find weak points—the results were not impressive. When looking for stylistic weaknesses, it tends to produce false positives—ask it to find 10 weaknesses in your prose, and it will find or invent ten—but it may select the good aspects just as frequently as it finds bad ones. It is probably as effective at editing commercial prose as a Fiverr or Reedsy editor with multiple clients (the danger, if you are a self-publisher, of signing a brand-name editor is that you will be one of her less important clients; she’d rather botch your book and lose just you than botch a trad-pub project and lose a house) but it’s not Maxwell Perkins. AI is not capable of striking the balance between supporting your confidence where your writing is strong and needs no tweaks, and finding the weak points even if you hid them well. It is too easily swayed by your prompt.

Oddly, I do find that AI can detect real stylistic weaknesses in prose. The problem is that it’s exhausting to work with because:

the false positive rate is high. The aspects of your writing that make it distinctive may also be flagged, and can cause the model to get “hung up,” resulting in further inaccurate inferences.
its suggestions are almost never helpful. Even when it finds an objective mistake, the suggested replacement sentence is almost always stylistically worse. This isn’t an issue for me—I wouldn’t use AI-generated writing for anything serious; even when it finds issues, I write the fixes myself—but it will lead people astray.

AI will almost certainly never get to the level of writing literary prose; sadly, it still can’t even edit it very well. This sucks, because the people who can are either (for self-publishers) rare and expensive, or are part of the Stolen Village that traditional publishing has tucked behind a query wall. We are still not at the point where AI has removed the socioeconomic barriers that publishing (self-publishing, for the high costs of doing it right; traditional publishing, due to its nepotism and gatekeeping) imposes.

o3’s Self-Opinion of Writing Ability

One of the issues with AI, when it comes to writing well, is that you can convince it to say pretty much anything if you know what you’re doing. This is one of the reasons why it can’t replace a human fiction editor—humans are fallible, but they hold their opinions strongly enough that they can’t be swayed, and that’s an asset in an editor who knows what she is doing. You can get an AI “editor” to praise your prose, or you can get it to tear it apart, with the right prompt—the fact that this can be done for any prose passage means that it is not intrinsically helpful, at least not at a holistic level.

What does o3 think about its own writing ability, though?

Here’s the prompt:

Ok, now here's a question. Let's say that you, o3-mini, were tasked with writing a novel. Do you think you could write, with minimal prompting:
* an average fan fiction?
* a typical commercial novel?
* a commercial bestseller?
* a literary masterpiece?

Its response is:

I can certainly generate text in a wide range of styles and genres, but each “level” comes with its own challenges and subjective benchmarks. Here’s how I’d break it down:
An Average Fan Fiction:
Yes.
Fan fiction often follows familiar tropes and conventions, and with minimal prompting I can produce a story that meets those expectations. The writing might be serviceable and fun but wouldn’t necessarily break new ground.
A Typical Commercial Novel:
Yes.
I can produce a narrative that follows established commercial fiction conventions—clear plot arcs, marketable characters, and accessible language. With minimal prompting, the work might hit the standard beats expected in commercial fiction, though it might not have the polish or originality that distinguishes the very best.
A Commercial Bestseller:
Possibly, but with caveats.
Crafting a bestseller is about more than just following a formula—it involves timing, originality, character resonance, and often a bit of luck with what readers are drawn to at a given moment. While I can mimic the style and pacing of many bestselling works, whether a text becomes a “bestseller” depends on factors beyond the writing itself (marketing, cultural zeitgeist, etc.). With refined prompting and iterative feedback, I could produce something that has the hallmarks of a commercially successful novel, but “bestseller” status is never guaranteed by the text alone.
A Literary Masterpiece:
This is the toughest challenge.
A literary masterpiece is a moving target—it’s defined by deep thematic resonance, innovative use of language, emotional and intellectual impact, and often a uniqueness that defies formulaic approaches. I can attempt to generate writing in a style that’s evocative and thought-provoking, but truly achieving what many would call a “masterpiece” would likely require extensive prompting, iterative revisions, and subjective judgment that goes far beyond minimal guidance. Literary masterpieces are often the result of a long creative process, and while I can simulate aspects of that, it’s hard to say I could create one on minimal prompting alone.
In summary, I can comfortably produce an average fan fiction or a typical commercial novel with minimal prompting. Generating something that meets the commercial bestseller criteria or is widely recognized as a literary masterpiece would be more challenging and would likely require more detailed guidance, iterative refinement, and a bit of luck with subjective taste.

I would largely agree with its self-assessment. I do believe it is capable of writing commercial bestsellers today, although probably not “with minimal prompting” because it would take human knowledge of trends as well as the preferences of powerful individuals within publishing to do better than random guessing—writing quality isn’t the issue, but the competitive distinction of “bestseller” imposes an innate low probability on the achievement. To be more technical, bestsellerness is a sparse reinforcement learning problem, but the feedback cycle is so long (unlike in a board game, where an agent can play thousands of games against itself per second) that model-free approaches without guidance are unlikely to work.

I do think that o3 underestimates the gap between its level of writing ability and what it would take to achieve literary excellence, but this is not surprising because: (a) language models tend to struggle with “second-order knowledge”—that is, knowing when they don’t know how to do something, (b) these agents tend to “guess toward the middle,” while my position is that large language models are completely unsuited for real artistic fiction.

72 Degrees North

Discussion about this post