Optivara Insights

GEO Criticism is a Racket (And so is some GEO)

Written by Nathan Allen | Mar 20, 2026 10:02:20 PM

Understanding AI behavior turns black-box mysticism into something measurable and interpretable.

This article, and many like it, is what prompted me to write this essay. Too much of the GEO debate is happening at the level of slogans, hype, and dismissal, while very little is being said about what is actually going on behind the curtain.

I’m not sure how we got to the point that we believe the hype from big tech – or, as is more generally the case, “big tech.” We’ve lived through WebVan and Theranos, WeWork, Slack (“replacing email”), and Elon promising hamsters on Mars (or whatever) any day now. I’d suggest that maybe tech+adjacent (and possibly all other) Comms isn’t always totally honest. And as someone who worked in Big Tech, I can assure you that the Comms people often know less about the tech than even the typical user does. 

 

AI is like any other gold rush scenario: there are credible, informed actors, snake-oil charlatans, and the well-meaning but uninformed. Some throw out “black box” if it’s a catch-all for resignation, while others seem to assert remarkably firm opinions. And there are many incentives and motivations – conflicting and competing – driving these conflicting and competing observations.

So, when Anthropic suggests they’ve got a giant black box and perhaps God is inside, consider that Dario Amodei is bleeding billions, and he’s got a valuation to pump up. And remember that Dario’s primary business objective at this point is regulatory capture, so he’s trying to scare government officials into increasing the costs of market entry and limiting the market to a handful of incumbents (legacy motivations already seizing the young industry). Also, perhaps consider that Dario thinks he’s a cult leader.

“Black box” refers to a very specific set of mathematical problems and related phenomena observed in very large language models. It is not a mystical incantation to summon resignation.

Black Box Explained

When AI companies refer to large-scale frontier models as a “black box,” they’re referring to two characteristics of the knowability problem. The first characteristic is scale, which produces challenges with tracing and analyzing the process. Anthropic has specifically said that they’re working on ‘traceability’ (billions USD more and a year away!), and they have made progress in mechanistic interpretability via sparse autoencoders and circuit tracing. (I suppose Anthropic should be given some credit as they have been doing more work in this area than anyone else, but then I’m reminded that Dario probably thinks that when he peers into the black box, he’ll just find a mirror.)

This is a problem of scale. AI companies build these models, then build the tools to analyze the models (or don’t). The analysis tools – as is often the case in CompSci – are produced after the initial experiments (which makes for bad science, but in CompSci, people often just want to see if something works first; then, after they’ve spent a billion or three to see if it works, they need to ship it to drive some revenue; then, maybe, assuming they’ve raised more capital, they’ll commit some resources to building tools to try to understand what they’ve built).

Of course, creating tools for traceability (and related tools) will explain what happened, but it likely won’t fully explain how or why it happened.

Which leads us to the second problem: emergence. Many of the characteristics of large-scale frontier models were self-created. The scale and scope of the model were not predefined. Ex nihilo emergence is a phenomenon seen (and hotly debated) elsewhere, particularly in biology, though most “black boxes” suffer not from all-unknown variables but partially unknown variables.

Complex systems tend to be complex not because we can’t measure all the variables but because we often don’t even know what all the variables are.

The problem is that these models are designed to mimic natural processes – from language to human thought. AI researchers were elated to witness emergence in AI, with new abilities or behaviors appearing that were not directly programmed in, and that seem to arise once the model becomes large or complex enough, which they then try to explain. It’s no short leap to assume that the efforts to explain emergence in AI are an effort to explain emergence in ourselves.

But there is no satisfactory explanation for emergence in the natural world, so why do we think we can explain it in a synthetic world? We think we can (or should) explain it precisely because it’s the synthetic world; we acknowledge some boundaries in the natural world but reflexively reject such boundaries in a synthetic world.

Of course, the real complication is that this synthetic emergence is mimicking natural emergence. It’s conceivable that we understand natural emergence by studying synthetic emergence; perhaps traceability will reveal patterns at a scale where the explanation for emergence is within our grasp. And it’s also conceivable that we’ll fail to understand either.

So, that scale and emergence is the black box.

Flipping a Quarter

To put it another way: is flipping a quarter a “black box?”

A physical coin flip is deterministic at the micro level (e.g., starting position, force, air resistance, surface bounce). We call the outcome “random” only because we don’t measure and control those variables in practice. If we had perfect knowledge of every input (and insane computational power), we could predict heads or tails with near-certainty every time.

In contrast, even if you had complete access to every weight, bias, activation, and floating-point operation inside a transformer, you still wouldn’t have a satisfying causal story for why particular high-level patterns emerge. Micro-determinism exists in principle (matrix multiplications are deterministic given the inputs). Still, the macro-behavior is emergent at a scale where human-comprehensible causal chains are effectively lost amid hundreds of billions of parameters. That’s architectural opacity. But let’s not ignore that we just slipped from “predict” to “causal story.”

Probability is descriptive rather than explanatory. It summarizes ignorance or irreducible stochasticity, so that “a coin flip results in heads about 50% of the time” doesn’t tell you why this particular flip was heads.

But for engineered systems like neural networks, we usually demand more mechanistic insight. Individual neurons often represent dozens or hundreds of overlapping concepts at once; a single direction in activation space might participate in detecting sarcasm, encoding negation, flagging scientific citations, and many other functions, depending on context. This isn’t how language “naturally” deconstructs in human brains; it’s an architectural artifact of gradient descent optimizing for prediction in high-dimensional parameter space with far too few dimensions to represent everything monosemantically. Mechanistic interpretability work (sparse autoencoders, circuit tracing) has shown we can often recover cleaner, more monosemantic features, but, even then, the causal chains linking those features to final outputs remain enormously tangled compared to, say, a human trying to explain a reasoning step (the human will artificially ignore or reject inputs viewed as irrelevant to the explanation).

Induction heads, factual recall pathways, sycophancy circuits (which often involve features that get amplified in RLHF), or internal forward/backward planning show up reliably only after models reach certain sizes. These aren’t direct reflections of language structure (“thinking”); they’re solutions the optimization process discovered to compress language mathematically efficiently. A small model might implement grammar via simple bigram-like patterns; a large one invents multi-hop circuits that simulate a chain of thought internally or meta-reason about its own uncertainty. Explaining why a particular high-level pattern emerges requires tracing those model-invented circuits in ways that humans don’t explain such patterns (e.g., “mathematically efficiently”).

But …

If we wanted to explain a coin flip, we’d need to measure a broad array of variables and engage massive compute. How massive? Perhaps about as much as Anthropic is spending on its traceability project. And even then, we’d be almost certain, or less that 100%.

At that point, we’d have the what and how, but we still wouldn’t have the why. The problem is substantially irreducible.

Is an LLM a black box? Is flipping a coin?

(Aside: to observe that coin-flips are low-dimensional deterministic systems, whereas LLMs are high-dimensional learned systems misses the point that both may be, at reduction to prime movers, ultimately opaque in their emergent properties. In other words, despite epistemological analytical differences at the beginning, they may converge in the end.)

(See Darwin’s Black Box for examples of irreducible complexity in biochemistry.)

Opacity & Scale

“Black box” doesn’t negate usefulness or statistical reliability. In the 1940s, nuclear physics was in a similar situation. Physicists knew fission worked and had chain-reaction math, but the full phenomenology of reactors — criticality accidents, xenon poisoning, breeding ratios in real materials — was poorly understood. Such was the mystery that up through the 1960s, some thought that aliens were deactivating nuclear warheads. Yet people built reactors anyway because the macro-outcomes were statistically predictable.

The public discourse, and even much of the scientific discourse, mixed awe, fear, hype, and denial. Interpretability lagged capability by years (and decades in some cases).

Many of the most baffling high-level behaviors in LLMs, such as in-context learning, chain-of-thought emergence, sudden jumps in capabilities, sycophancy amplification after Reinforcement Learning from Human Feedback (RLHF), or even certain kinds of deception tendencies, reliably appear only after models cross particular size thresholds (often in the ~10–100B parameter range). Smaller models with the same transformer architecture and similar training recipes usually implement tasks with simpler, more legible heuristics rather than inventing multi-step internal algorithms.

Mechanistic interpretability research (e.g., Anthropic’s work on scaling monosemanticity with sparse autoencoders, or circuit tracing in induction heads) consistently shows that interpretable, human-comprehensible “features” and “circuits” become far more numerous, overlapping, and contextually flexible as the parameter count grows. At a small scale, you might find a handful of clean, mostly monosemantic neurons; at the frontier scale, you find millions of features in superposition, with causal chains that span dozens of layers and involve billions of interactions. The sheer volume overwhelms our usual tools to track causal pathways.

LLMs are a highly compressed, emergent recapitulation of an enormously complex natural phenomenon (human language use, which itself emerges from biology, culture, cognition, history, etc.). Language isn’t a tidy, engineered system; it’s a messy, evolved, and distributed one with irreducible layers. When we train a transformer to predict the next token across petabytes of that mess, the model doesn’t “solve” language. Instead, it approximates language via statistical mechanics in high dimensions. The resulting internal representations and dynamics can exhibit genuine emergence (phase-transition-like jumps in capabilities, cooperative interactions among parameters that aren’t legible from individual pieces), much like how life or consciousness emerges from chemistry without being fully reducible to it in practice.

We demand knowability, but the artifact is mimicking a natural system whose full mechanistic reduction remains elusive. Even if everything is deterministic at the matmul level, the effective theories at higher levels (circuits, features, algorithms the model invents) become so superposed that a complete, human-comprehensible reduction might be practically (or even theoretically) out of reach.

Treating LLMs as fully reducible because they’re synthetic is a kind of anthropocentric overconfidence. We’re not imposing clean engineering on a blank slate; we’re distilling a slice of nature’s own irreducible complexity into silicon. The black box isn’t (just) a bug of poor design; it’s a result of succeeding at the task.

And yet we get extremely consistent behavior from GPTs on many tasks despite limited mechanistic understanding, just as we get reliable statistics from coin flips without knowing the microphysics of each toss. If this consistency didn’t exist, coding with Claude wouldn’t be a reality.

Humans in the Loop with Reinforcement Learning from Human Feedback (RLHF)

But why isn’t the resulting behavior – the responses and the sources – a black box? Substantially because humans trained them. What’s missing in most conversations about AI Agents is the reinforcement learning by human feedback. Millions of hours of human curation, labelling, and scoring of answers have gone into AI agents, and the scoring rubric varies by AI company.

For this reason, responses and sources are actually quite predictable within AI agents that use the same schema. It’s also why comparing Perplexity (by default RAG+search) to any other major AI agent is a categorial error. Perplexity should be viewed as anomalous among major platforms, and the default schema/tools for all AI agents should be viewed as iterating.

So, if a paper (circa 2024) focuses on “BingChat, Google’s SGE, and perplexity.ai,” it will produce results of augmented search engines. Neither Copilot nor Gemini is architected this way now. (Perplexity – again, the anomaly – still is an augmented-search engine.) Even Google’s AI Answers doesn’t default to web search. (The reason for the change is likely due to a combination of cost and latency.)

Geographic/user-location bias in retrieval is yet another variable that differs significantly by AI agent. (I’ll tackle that in another essay.)

It’s worth noting that certain indices are updated more frequently than others, particularly in media and source types.

Generative Engine Optimization

First, no agent (aside from Perplexity) defaults to search, and all primarily source from their in-house search when they do (so Gemini uses Google, and x.AI uses its in-house search engine – though, again, Gemini defaults to its index+knowledge graph, not search). When agents are seeking answers, they are guided by their instructions and RLHF inputs, not by standard SEO practices. Those instructions (rarely changed, though guardrails get updated somewhat often) and RLHF inputs are fairly predictable with a bit of research. Generative Engine Optimization (GEO) is that research.

Predictable?

Studies that cite ChatGPT and Perplexity source disagreement fail to consider that Perplexity is uniquely architected and prone to anomalous results. Such studies also fail to consider that any decent GEO platform will weight results to reflect your target audience (in effect, this means that OpenAI/ChatGPT and Google/Gemini will comprise the vast majority of most results).

But if you measure by audience reach, aiming for 90%+ of your potential audience, then domain convergence is significant. In higher education, domain convergence among the top 10 sources at 90%+ audience is 70%. For the top 50 sources, convergence is 48%. This is for visibility and sentiment/reputation (so, top- and mid-funnel). Most Communications and Marketing departments should have that list of top sources.

But then there are observations such as this: “If 89% of AI citations come from earned media, then the driver of AI visibility is the same thing agencies already sell and have always sold. Muck Rack also found only 2% overlap between the journalists most pitched by brands and those most cited by AI engines. Their interpretation: ‘an adjustment to pitching practices could have a major impact on a brand’s GEO.’ Translation: pitch better journalists. That’s just media relations.”

First, Muck Rack doesn’t break down “earned media,” so the report is both substantially normal in its results and substantially useless. That “89%” doesn’t simply reduce to “media relations.” Second, Muck Rack only tested the very top of the funnel. Testing the full funnel usually reveals that, at most, about 25% of sources are owned media (your own website). The rest of the sources are everything else: traditional media, user-generated media, third-party reports, etc. The media list is specific to an industry and often to a particular peer group or vertical within that industry. The convergence around which media to consider authoritative is strong within an industry. For Muck Rack’s experiment to be valid and the results to be meaningful, one would have to assume that there exists a non-trivial quantity of omni-industry authoritative sources.

But the Muck Rack quote above does reveal one very important factor: “pitch better journalists” is somewhat true. It’s not the journalists that count; it’s the publications that are considered high in the hierarchy of authority (trained by those millions of hours of human input). Second, pitch them what? You need to know the agent-company brand gap and the questions prospects are asking. Preferably, you’d know that same information for your top competitors to optimize strategic leverage. Then the content to pitch and the sources to pitch to start to become clear.

But is GEO a Racket?

I’m sympathetic to the sentiment that GEO (and related) is a racket. The founding of Optivara started with a call I received from a senior CMO about the apparent uselessness of Profound. Many people know that I was deploying conversational agents before they were cool (Toyota, Amex). I’d worked with the Watson team that won on Jeopardy (lots of duct tape), led efforts to deploy the first intelligent tutor, and designed edge-AI systems for the Pentagon, so they call me with their AI observations/complaints. And I was confused about why Profound wasn’t very good. If you understand the architecture, then it’s not that difficult to produce actionable data.

The problem strikes me as a fundamental issue in any new technical field: many who rush in lack the technical background needed to deliver on their promises. This isn’t much different from the first few years of SEO, when every MarComm person rushed in, made ridiculous promises (with great certitude) backed by even more ridiculous explanations (also with great certitude).

Benchmarking

The first problem is producing reliable data. Every report from a GEO platform is essentially a science experiment. You’re tracking the variables; so what are the constants? When everything is a variable, you produce high noise-to-signal. That’s a common problem.

High variability plus high noise produce non-actionable results. The questions, the persona (demographics, intent, etc.), the geo-location, etc., cannot be variables that change weekly (or even mid-conversation), or the results will be increasingly corrupted by noise. This is how you get weekly GEO reports that make no sense.

Objectives

A comment about GEO’s objective goes something like this: “These GEO merchants didn’t discover a new channel. They discovered that good comms works everywhere, including in LLM outputs, and slapped a new acronym on it.”

I’m not aware that discovering “a new channel” was the objective. However, adding high-quality, actionable data to inform decisions is a desirable objective. Helping Comms professionals make strategic decisions to maximize the impact of their efforts seems like something everyone should find useful. “What is the content that should be produced?” and “how shall it be distributed?” are fundamental questions in marketing and communications.

We had one user (of a fairly large company) realize that the wire service she was using was totally absent from the hierarchy of authorities. That’s critical information, and a good GEO platform provides that information.

In some sense, good GEO serves as a combination of Nielsen ratings and focus groups. It helps MarComm professionals understand what is working, what they’re missing, and what’s never going to work so that they can allocate their resources efficiently and effectively. Certain distribution routes don’t have high ratings, while others do. Some are resource-intensive, while others aren’t. Those decisions are critical to creating an effective strategy, but they all start with data.

Of course, there’s a lot of bad GEO. In many ways, GEO is the equivalent of SEO without Google Analytics – a single source of truth (of course, GA is also why SEO is easy). A good GEO platform aggregates data and drives that data to action.

The Practical Application of Irreducibility Complexity

Beyond the math, any deeper reduction of LLM behavior (that “black box”) would likely demand reducing language, thought, culture, and human experience that are irreducible in their native form (we don’t have a mechanistic narrative that explains why humans think or speak the way we do, despite brains being significantly “smaller” in parameter count). The LLM is a distilled, quantified echo of that irreducibility.

It’s a beautifully recursive bind: we’re using emergent systems to attempt to understand emergence, and those limitations may repeat natural limits. Until or unless interpretability tools catch up dramatically, we’re left with informed uncertainty.

In knowledge production, we’ve been at this point many times over millennia. No, Dario doesn’t have God in a box. Though with a few billion USD more, we might better understand how an emergent system synthesizes human thought.

And, as with any frontier of knowledge production, there are charlatans on all sides (again, check out the nuclear debates in the 1940s up until … just a few years ago). There are GEO vendors and consultants who overpromise simple “optimizations” without a basic understanding of the architecture.

But we push forward precisely because the results are knowable, useful, and reasonably predictable. We shouldn’t confuse epistemic limits with practical ones.

Complex systems are complex precisely because there are unmeasured and unknown variables. Coin flipping and school districts are, in their own way, black boxes. But labelling them as such neither reflects the reality of engaging them nor the substantial predictability of outputs given certain inputs.

Generative Engine Optimization operates on the same principles. The results are reasonably predictable, providing value for comms decisions and resource allocation, much the same way Nielsen ratings do (and probably better than focus groups). AI Agents are similar to a new television network (back when there were only three networks): a major syndication route for content.

Ignoring the data being generated puts your MarComm strategy in a black box.