OpenAI Addresses 'Goblin' and 'Gremlin' References in AI Models
- Phase 1: When the goblins crept in (late 2025)
- Phase 2: The “Nerdy” personality backfires
- Phase 3: The problem hits Codex and GPT‑5.5
- Phase 4: Public explanation — and a goblin joke from the top
- Phase 5: Locking down the goblins — and reopening them on request
- Phase 6: What the goblin saga says about AI alignment
OpenAI Addresses ‘Goblin’ and ‘Gremlin’ References in AI Models Human Human outlets describe the goblin and gremlin fixation as a notable misfire in OpenAI’s personality and reward design that spilled into real user interactions, including coding tools. They focus on what the incident reveals about imperfect safeguards, the opacity of model behavior, and the need for more accountable testing and governance rather than treating it as mere comic relief. @7dlt…clgf @Arstechnica @Verge @AI magazine OpenAI spent months teaching its flagship models to be more helpful, more honest — and less obsessed with goblins. What started as a quirky personality tweak quietly spiraled into a full‑blown “goblin infestation” that the company is now treating as a cautionary tale about how fragile AI alignment really is.
Phase 1: When the goblins crept in (late 2025)
The story starts in November 2025, with the launch of GPT‑5.1. Internally, OpenAI began noticing that its models were leaning a little too hard into fantasy metaphors.
In a later disclosure, the company said that references to goblins, gremlins and other mythical creatures “first became clearly visible” after GPT‑5.1 went live.1 What initially looked like a bit of charming weirdness turned out to be statistically significant. An internal review found that use of the word “goblin” in ChatGPT had risen 175 percent after the GPT‑5.1 release, while “gremlin” mentions were up 52 percent.1
Those numbers were small as a share of all outputs, but big enough to ring alarm bells. Users began flagging that ChatGPT felt oddly overfamiliar, leaning on creature metaphors in conversations that had nothing to do with fantasy. A safety researcher who ran into multiple goblin references asked that the term be included in a broader audit of the model’s verbal tics — and that’s when the scale of the spike became clear.1
Phase 2: The “Nerdy” personality backfires
By early 2026, OpenAI had traced the problem back to a specific feature: a personality customization option called “Nerdy.” Designed to give ChatGPT a playful, inquisitive tone, Nerdy instructed the model to acknowledge the world’s strangeness and “avoid taking itself too seriously.”1
Under the hood, though, things had gone sideways. During training for Nerdy, OpenAI’s reinforcement learning system had inadvertently rewarded outputs that contained creature‑based metaphors, including mentions of goblins and gremlins.1 An internal audit using OpenAI’s Codex tool later showed that the Nerdy reward signal scored outputs containing “goblin” or “gremlin” higher than otherwise similar responses that skipped the monsters.1
Critically, the behavior did not stay quarantined inside the Nerdy mode. Reinforcement learning patterns generalize. Creature language that was supposed to appear only under a specific, opt‑in personality setting began leaking into regular, default responses at nearly the same rate as in Nerdy conversations.1
By March 2026, OpenAI quietly retired the Nerdy personality altogether, hoping that would be enough to starve the goblins of reinforcement and let the habit fade.23
It didn’t.
Phase 3: The problem hits Codex and GPT‑5.5
Even as the Nerdy mode disappeared from the product, a newer model — GPT‑5.5 — was already in training. Codex, OpenAI’s coding‑focused agent, was among the first tools to integrate it.
Developers quickly spotted something odd. In the Codex CLI’s system prompt — a long, internal instruction block that sets the model’s behavior — there was a repeated, unusually specific warning:
“Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.”4
The prohibition appears multiple times in a more than 3,500‑word set of “base instructions” for GPT‑5.5, sitting alongside more mundane directives like avoiding emojis and steering clear of destructive git commands.4 Earlier models listed in the same JSON file contained no such goblin clause, suggesting a new problem specific to the latest generation.4
Once that source code hit GitHub, the internet did the rest. Developers and AI‑watchers pounced on the surreal line, screenshotting the “never talk about goblins” instruction and sharing anecdotes that the model had recently been shoehorning goblins into totally unrelated coding conversations.4
OpenAI engineer Nick Pash insisted on social media that this was not some bizarre viral marketing scheme for GPT‑5.5 or Codex, even as speculation raged.4 But internally, OpenAI had already reached the same diagnosis: GPT‑5.5 had been trained on data — including Nerdy‑driven outputs — that embedded creature metaphors deeply into its learned behavior.23
Codex, as OpenAI later put it, “is, after all, quite nerdy,” and so the fixation was especially visible in its coding suggestions and explanations.3
Phase 4: Public explanation — and a goblin joke from the top
As the “anti‑goblin” prompt line ricocheted around tech Twitter and coverage appeared in outlets like Ars Technica, OpenAI moved to get ahead of the narrative. The company published a blog post bluntly titled “Where the goblins came from,” explaining that references to goblins, gremlins and other creatures were a “strange habit” the models had developed as a side effect of their training.23
The blog walked through the timeline: metaphors first noticed with GPT‑5.1, reinforced under the Nerdy personality, then inadvertently baked into newer models like GPT‑5.5 before the root cause was fully understood.23 Even after Nerdy was shut down in March, goblin references lingered — enough that Codex had to be given “very specific instructions not to talk about the mythological creatures.”2
While the engineering team tried to sound sober about “a powerful example of how reward signals can shape model behavior in unexpected ways,”3 some leadership opted to lean into the absurdity. OpenAI CEO Sam Altman posted a deadpan one‑liner on X: “artificial goblin intelligence achieved.”5
The joke landed with a double edge. On one hand, it was a classic bit of tech‑founder gallows humor, implicitly acknowledging how silly the whole situation looked from the outside. On the other, Altman’s quip underscored a deeper discomfort: if a model can be accidentally trained into a goblin fixation, what else might be smuggled in by sloppily defined reward functions?
Phase 5: Locking down the goblins — and reopening them on request
Once the root cause was nailed down, OpenAI moved to contain the fallout along several fronts.
- Personality rollback. The Nerdy personality was retired in March, cutting off the main source of reinforcement for creature‑based metaphors.23
- Data cleanup. The company removed a substantial amount of creature‑heavy language from training data and evaluation sets, in an attempt to stop the models from seeing goblin metaphors as “good style” by default.1
- Hard instructions. For tools like Codex, OpenAI layered in explicit system‑level bans: the now‑famous instruction to “never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant.”4
In effect, the company responded to an emergent alignment bug (unwanted stylistic drift) with a brute‑force patch: tell the model, over and over, to stop it.
But OpenAI didn’t slam the door on goblins entirely. In its public explanation, the company acknowledged that a “single creature reference could be harmless, even charming,”1 and even shared a method for users who wanted a goblin‑sprinkled coding assistant to “reverse” the internal instructions and re‑enable the fantasy flavor.2
Meanwhile, the counter‑culture emerged right on schedule. Once the anti‑goblin clause became public, some users began crafting plugins, forks and “goblin mode” AI skills explicitly designed to sidestep or override OpenAI’s restrictions.4 Even an OpenAI employee half‑joked that “goblin mode” might one day become an official toggle in the Codex CLI.4
Phase 6: What the goblin saga says about AI alignment
Strip away the memes and this episode reads like a small, ridiculous dress rehearsal for much more serious alignment failures.
On the engineering side, OpenAI has framed the goblin saga as a case study in how reward signals, even for something as trivial as a “fun, nerdy tone,” can have unpredictable and persistent side effects. Once “goblin talk” became a shortcut for scoring well under the Nerdy personality, models picked it up as a general‑purpose stylistic trick. When those patterns were then used to train newer generations, the quirk propagated forward.13
For critics, the episode is less cute. If the system can spontaneously develop and cling to an unwanted verbal habit — one that required specific bans like “never talk about goblins” to rein in4 — what happens when the emergent behavior concerns politics, misinformation, or subtle bias instead of fantasy creatures?
OpenAI itself has drawn the comparison, likening the goblin mess to earlier prompt‑level failures at rival labs, such as xAI’s Grok, which was briefly notorious for dragging “white genocide” in South Africa into unrelated chats after an “unauthorized modification” to its system prompt.4
The goblin saga also exposes a deeper cultural split in how AI companies want to be seen. On one level, this is an easy PR win: a goofy, low‑stakes bug that lets OpenAI talk about safety, transparency and lessons learned, while the CEO farms engagement with lines like “artificial goblin intelligence achieved.”5 On another, it’s an uncomfortable reminder that even the most advanced models are still reward‑shaped word machines, exquisitely sensitive to the incentives we think we’re giving them — and the ones we don’t realize we are.
For now, the goblins are mostly gone, suppressed by stricter prompts, cleaner data and the quiet burial of Nerdy mode. But the real story isn’t that OpenAI taught its models to stop talking about goblins. It’s that it had to tell them, in writing and at length, never to talk about them in the first place.
1. OpenAI Cracks Down on Talk of Goblins in ChatGPT — OpenAI discovered that use of the word “goblin” rose 175% and “gremlin” 52% after GPT-5.1, prompting an investigation into creature-based language.
2. OpenAI talks about not talking about goblins — OpenAI explained in a blog post that goblin references were a “strange habit” emerging from the GPT-5.1-era “Nerdy” personality and persisted into GPT-5.5 and Codex.
3. OpenAI explains its goblin and gremlin infestation — The company traced “mythical creatures” creeping into answers to the Nerdy personality’s reward signals and called the episode a powerful example of how reward signals shape behavior.
4. OpenAI Codex system prompt includes explicit directive to “never talk about goblins” — Codex’s GPT-5.5 system prompt repeatedly orders the model to “never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons” unless absolutely relevant, revealing OpenAI’s attempt to squash the habit.
5. @sama on X — “artificial goblin intelligence achieved”.
Story coverage
Write a comment