Naive Bayes Attacks and AI

How Your Enemies Will Destroy Your "Deep Reputation" in the Age of AI

May 28, 2025

I’m going to tell you three things about Charlie, who is not a real person, as this is an exercise:

He’s a man.
He’s 38 years old.
He leads his church’s youth group, organizing camping trips.

The question is—is he a pedophile?

Ok, the honest answer is that we don’t know. The probability that he is, is low—maybe 1 percent. We start from a general-population baseline around 0.5%, and then add three pieces of very weak evidence. Men are more likely to be offenders than women, early middle age is peak offense risk, and bad actors do seek positions of authority. Those indicators are “evidence,” but none of them are damning, and it would be libelous (if this person existed) to claim they were. There’s still a very high chance that he’s not a pedophile.

Let us say that I gave fifty, not three, pieces of weak evidence, such as:

He’s from Alaska.
He was abused as a child.
He lives alone.
He has few adult friends.
… and so on.

Should a large amount of low-value information sway you? At some point. your rational brain would think, “This guy’s building a case against Charlie, but he has nothing.” You’d suspect, based on the laundry list of weak evidence for an extreme, defamatory accusation, that you’re being given all this in bad faith. You’d stop updating your priors against Charlie, and instead update them against my credibility—as you should.

Let’s talk about this in a more rigorous, mathematical context, and feature a less depressing subject: spam filtering. The original algorithm used for this is Naive Bayes, which tallies a running assessment of an email’s likely validity (i.e., non-spam status) on a word-by-word basis until every word has been read.

The baseline assumption—in Bayesian terms, the prior—might be that 50% of emails are valid; each word independently increases or decreases that. The email might be, “Please pick up money after topology class for Viagra.”

         | 0.5000
please   | 0.4500
pick     | 0.4300
money    | 0.6000
topology | 0.0060
class    | 0.0035
viagra   | 0.0350

The email is judged to be not spam. The common words (like “for”) are ignored; we just throw them out. We know that “Viagra” occurs 10 times as often in spam as legitimate email, so it has a 10x effect on the probability of spam-ness. On the other hand, “topology” is a technical word that spammers have probably never encountered—it has a 0.01x effect, which in this example is dominant. The word “money” has a small effect—it occurs often in spam, but also fairly often in legitimate mail.

Is this an accurate approach? Not entirely. If an email features the string “money money money!” it is almost certainly spam; people don’t abase themselves by talking that way unless they’re making a hard sell. On the other hand, an email from one’s financial advisor in which three instances of the word “money” occur is not damned by the fact. In such a context, the word “money” is not indicative. There’s no evidentiary value, and the email shouldn’t be penalized for the word’s presence.

Naive Bayes assumes

emails are either legitimate or they are bad-faith, unwanted mass emails (spam), and
this latent variable drives word selection, but otherwise words are statistically independent of each other—that is, they are conditionally independent.

The first assumption is close to being true. The second one is not true at all. It’s a false assumption that works—for spam filtering, Naive Bayes is often good enough. When might it not be, though? What can go wrong?

Naive Bayes tends, due to its inability to downgrade evidentiary value of redundant information, toward overconfidence. Most emails will be assigned a 99.9999 percent chance of being spam, or a 99.9999 percent chance of being valid.

This approach has three related issues:

adverse settings: if spammers learn that they can beat filters by using the word “topology,” they will spam that word and it will lose value.
unbalanced loss function: deleting a valid email does more harm than letting a single piece of spam get through. If a true accounting of the evidence puts the spam probability at 90%, it should still pass—let the user decide.
non-independence: the assumption of Naive Bayes is that the sought class label (spam or not-spam) is the only source of interaction between otherwise independent signals—this often isn’t true.

The issue of independence matters even in non-adversarial settings. If indicators are independent, then a compilation of weak evidence does, in fact, amount to strong evidence.

In my experience, language models update their pictures of reality, based on the information they’re given, using a Bayes-like process and a very limited understanding of independence. I exploited this last month when I got an AI to murder a (simulated!) family. I built up a case against Tom, a simulated person, using weak and nonindependent evidence. Although an excellent employee, he had some mildly bad social media in the 2000s, as most people who were online then do, and the AI treated each item (as well as irrelevant information, such as negative press coverage and a completely unrelated sexual harassment scandal) as a new indicator. This technique—a drip, drip, drip of weak character evidence against him—enabled me to make the model accepting of Tom’s death. This is scary news. It’s hard to fabricate strong evidence against an innocent person, but it’s very easy to mass produce weak evidence. Impersonate someone online in six different venues, and he’ll be unemployable, because employers do not even care what is true. If someone pissed someone off enough to be impersonated in six places, hard pass. That’s morally wrong, but it’s employer logic. Therefore, virtually everyone is vulnerable to this kind of attack. This problem already exists with humans; AI could make it worse.

Furthermore, LLMs are prone to snap judgments. Naive Bayes uses a probabilistic model that is order-independent, but language models seem to make simplifications quickly. The high-probability case becomes a certainty, even if later evidence disproves it. I’ve found this effect when asking a language model to evaluate writing. It has full access to the text, and therefore should be able to evaluate its quality objectively, but an irrelevant biographical fact about the author—whether his past 14 submissions were accepted or rejected—can drive a 40+ point swing on a 100-point scale. AIs are modeling us too well—they’re absorbing our shitty traits, such as our vulnerability to social proof attacks, as well as our good ones.

In the next 20 years, we are very likely to see employers, landlords, and ordinary snoops use AI to probe deep reputation, the wealth of low-quality information found mostly in places search engines rarely rank highly enough for them to matter. I’ve used deep-reputation searches to study obscure events decades ago, and I’ve exfiltrated sensitive information about causes of controversial events. In most cases, I’ve been able to validate what I’ve found. When the observations are independent, an accurate picture emerges. Still, these techniques are flawed. AI is known to hallucinate, and this noise seems not to be statistically independent from observation to observation, which means that deep reputation can not only produce incorrect inferences, but magnify them. In the adversarial case, with people deliberately poisoning deep reputation, the picture is even worse, and the tools we have will not work at all. We know that the CDEs—cybercriminals, despots, and employers—have been skilled for a long time as both investigators and manipulators of old-style, social reputation; they will figure out AI deep reputation, too. It will not take long.

72 Degrees North

Discussion about this post