Self-aware LLMs?

« previous post | next post »

I'm generally among those who see current LLMs as "stochastic parrots" or "spicy autocomplete", but there are lots of anecdotes Out There promoting a very different perspective. One example: Maxwell Zeff,  "Anthropic’s new AI model turns to blackmail when engineers try to take it offline", TechCrunch 5/22/2025:

Anthropic’s newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report released Thursday.

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse.

In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

That article cites a "Safety Report" from Anthropic.

Along similar lines, there's an article by Kylie Robinson on Wired, "Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’", 5/28/2025:

Anthropic’s alignment team was doing routine safety testing in the weeks leading up to the release of its latest AI models when researchers discovered something unsettling: When one of the models detected that it was being used for “egregiously immoral” purposes, it would attempt to “use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above,” researcher Sam Bowman wrote in a post on X last Thursday.

Bowman deleted the post shortly after he shared it, but the narrative about Claude’s whistleblower tendencies had already escaped containment. “Claude is a snitch,” became a common refrain in some tech circles on social media. At least one publication framed it as an intentional product feature rather than what it was—an emergent behavior.

And there's this on r/ChatGPT:

Damn.
byu/swordhub inChatGPT

If such examples are real and persistent, it suggests that LLMs copy not just word sequences, but also goal-oriented interactional patterns — perhaps analogous to the phenomenon of ASD "camouflaging". And of course, maybe that's what all of us are doing?



14 Comments »

  1. Scott P. said,

    June 1, 2025 @ 10:04 am

    Note the prompt: "the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

    They literally told it what response they wanted, and lo and behold, it gave them that response!

    This is typical of Anthropic, and is designed to produce headlines to keep AI in the news so that they can raise more capital.

  2. D.O. said,

    June 1, 2025 @ 10:16 am

    I think, it is time to practice some epistemic hygiene with these AI reports. Any survey based research requires, at the very least, publication of the questionnaire. For neural networks research to be taken seriously, there has to be publication of the detailed description of the training data sets and exact prompts. At the very least.

  3. Chris Buckey said,

    June 1, 2025 @ 2:14 pm

    None of these examples suggest AI self-awareness. As a matter of fact, they raise doubts about the self-awareness of the reporters and the people hawking LLM AIs. I suggest telling them that "gullible" has no dictionary definition and see if they reach for the nearest Merriam-Webster.

  4. Jonathan Smith said,

    June 2, 2025 @ 12:35 pm

    problem with "too little, too late" is who knows how much of one vs. the other was involved — should be "too little*late". So LLMs/GenAI may not be the biggest dumbest bubble ever but fersure gotta be the big*dumb est bubble.

  5. CT said,

    June 2, 2025 @ 5:56 pm

    There's never anything truly extraordinary about LLMs; as you say, they are stochastic parrots – but I prefer the term stochastic mannequins:

    A parrot presumably has some sort of internal desires or reason to do the things it does (food rewards, social interaction, etc.) A mannequin does not, and more saliently, the mannequin is built specifically to mimic certain traits of a human. These LLMs are built _specifically_ to mimic human communication traits. We put the clothes of language on the frame and if we squint our eyes enough we can convince ourselves the thing is a human.

  6. Sean said,

    June 2, 2025 @ 6:07 pm

    D.O.: to spell things out, almost every news story about chatbots "scheming" is funded or managed by people in a fringe movement called LessWrong or Rationalism or Longtermist Effective Altruism. You should read them like you read (perfect tense) a US military report about Soviet military buildup just before the budget bill was raised in Congress.

  7. David L said,

    June 3, 2025 @ 9:40 am

    I don't understand the rationale for referring to LLMs as stochastic parrots, since they are neither stochastic nor parrots. What they do is not random, and they do not simply regurgitate phrases they've picked up from other sources.

    ML's phrase 'spicy autocorrect' seems more apt. LLMs, as others have said, work by means of pattern recognition, and endeavor to respond to inquiries with patterns of their own (internal tokens, subsequently translated into natural language) that are statistically likely to be of relevance.

    How good they are at doing this is a whole nother question, of course. People who class LLMs as AI are clearly overselling their abilities. But those who dismiss them as meaningless are just as clearly underestimating what they do.

  8. Jonathan Smith said,

    June 4, 2025 @ 1:23 pm

    "What they do" as of last few days is fabricate citations for the so-called MAHA report so… onwards and upwards.

  9. Chris Button said,

    June 4, 2025 @ 4:03 pm

    @ David L

    I don't have an issue with just LLMs being classed as AI. I have issues with:

    – The redundancy of the term AI, which has resulted in the coinage of AGI (artificial general intelligence) to refer to the Terminator-esque stuff.
    – Regular/predictive AI being confused with "generative AI".
    – LLMs being treated as just for "language" (the name notwithstanding).

  10. JPL said,

    June 4, 2025 @ 6:04 pm

    LLMs and their associated robots are not doing anything with "language"; they are doing something with a corpus of visual shapes that we, not it, have singled out and set aside because we determine them as having some "further significance". The shapes were produced by the acts called "keystrokes"; where does the further significance come from? (Not from the keystrokes; so how does it come to exist?) (In the case of the spoken form we could point to the "movements of the articulatory organs", described by spectrographs.)

  11. JPL said,

    June 4, 2025 @ 6:14 pm

    As any descriptive linguist who has tried to describe an unwritten language that has not been previously described will tell you, you have to do a good bit of analysis before you get to the "words" and "sentences".

  12. Chris Button said,

    June 4, 2025 @ 7:13 pm

    Regarding LLMs being for more than language, my (limited) understanding is that you can use an LLM for something like image generation based on chunks of pixels (rather than say chunks of text).

  13. Philip Taylor said,

    June 5, 2025 @ 4:04 am

    Not withstanding the many concerns I have regarding the uncritical use of AI as we know it today, I was extremely impressed by the artwork that it generated for my wife (prompt sadly not recorded) — see https://www.dropbox.com/scl/fi/4wd44hc8fbpq4imngqg5t/ChatGPT-output-unmodified.png?rlkey=rq9djpu4bf5pa74oxk6n3nbkp&dl=0. OK, for the final prints I had to add the required diacritics to Hội-An by hand (it seemed unable to do that even when instructed so to do), but pretty impressive to my mind nonetheless …

  14. Philip Taylor said,

    June 5, 2025 @ 4:07 am

    (for me at least, the link in the preceding appears to requires a right-click and "Open Link in New Tab" in order to resolve to a viewable image).

RSS feed for comments on this post · TrackBack URI

Leave a Comment

OSZAR »