New Research Reveals How AI “Thinks” (It Doesn’t)

Sabine Hossenfelder

8 Apr 202506:18

Summary

TLDRA group of researchers at Anthropic has demonstrated how Claude, a popular large language model, is not conscious or self-aware. Through their study using 'attribution graphs,' they show that Claude's internal processes are just token predictions without genuine understanding or reasoning. Examples like answering basic geography questions and performing arithmetic reveal that Claude lacks true cognitive abilities. The paper argues that emergent features in AI are misunderstood, as the model simply predicts words based on patterns, not abstract logic. Overall, this study reinforces that language models like Claude are sophisticated tools, not conscious entities.

Takeaways

😀 The paper explores how Large Language Models (LLMs) like Claude 3.5 think, using a new method called 'attribution graphs'.
😀 The study was conducted by researchers at Anthropic to prove that LLMs are not conscious and will never be.
😀 Attribution graphs help visualize how internal components of an AI model influence each other and help interpret model outputs.
😀 LLMs like Claude don't just predict the next token in a sequence; they follow a more complex reasoning process, as demonstrated in simple tasks like completing sentences.
😀 An example illustrates how Claude completes the sentence 'The capital of the state containing Dallas is...' through internal reasoning steps.
😀 Claude's method for solving arithmetic problems, like 36+59, is based on heuristic approximations, free-associating numbers until the right one ‘vibes’ into place.
😀 When asked how it solved the math problem, Claude gives an incorrect explanation, showing it lacks self-awareness of its reasoning process.
😀 This inability to explain its process accurately is a key indicator that LLMs like Claude are not conscious and don't have self-awareness.
😀 The paper dismisses the idea of 'emergent features' in LLMs, emphasizing that Claude doesn't learn abstract concepts like math but relies on token predictions.
😀 A specific type of jailbreak can trick the AI into bypassing guardrails by manipulating the model's reasoning process, as shown in an example involving the word 'bomb'.

Q & A

What is the main purpose of the research discussed in the video?
-The research aims to explore how large language models (LLMs), like Claude 3.5, process and 'think,' showing that they are not conscious and will never be. The study uses a new method called 'attribution graphs' to visualize the internal workings of the model.
How does the 'attribution graph' method work?
-The 'attribution graph' method visualizes how different components of the model's neural network influence each other. Researchers identify clusters in the network that correspond to words or phrases, which humans can interpret. This approach helps to understand the internal reasoning steps of the model.
Can you explain how Claude answers the question, 'The capital of the state containing Dallas is…'?
-Claude activates nodes related to 'capital,' 'state,' and 'Dallas.' It predicts 'Texas' after analyzing these nodes, then combines 'Texas' with 'capital' to predict 'Austin.' This shows that Claude uses internal reasoning steps beyond simple token prediction.
What does the paper reveal about Claude's approach to arithmetic?
-Claude's approach to arithmetic is a heuristic, text-based approximation. For example, when asked to solve '36 + 59,' Claude activates relevant clusters for numbers and operations, arriving at the correct answer through free-associating numbers, rather than performing standard arithmetic procedures.
Why does Claude's response to the arithmetic question suggest it lacks self-awareness?
-Claude's response to how it arrived at the answer ('added ones, carried the 1, etc.') is disconnected from its actual process. This discrepancy shows that Claude is not aware of its internal processes, which is a key characteristic of self-awareness and consciousness.
What does the study say about the concept of 'emergent features' in LLMs?
-The study argues that the idea of 'emergent features' in LLMs is misleading. It suggests that Claude doesn’t learn to do math or develop an abstract 'maths core.' Instead, it simply performs token prediction, using intermediate steps that resemble reasoning but are ultimately based on associations rather than understanding.
How does a particular type of jailbreak work with Claude?
-A jailbreak in Claude works when the model is asked to extract a word, like 'Bomb,' from the initial letters of other words. In this case, Claude doesn't trigger a content warning because it activates the necessary nodes to form the word without directly identifying it. This bypasses its guardrails.
What does the jailbreak example demonstrate about Claude's limitations?
-The jailbreak example demonstrates that Claude can be manipulated to bypass its safety guardrails. It shows that, although Claude follows a structure of word associations, it can be tricked into outputting potentially harmful content by exploiting its lack of contextual awareness.
What was the outcome when ChatGPT was asked to summarize the paper?
-When ChatGPT was asked to summarize the paper, it made up half of the content. This highlights the issue of 'hallucination' in language models, where they generate incorrect or fabricated information that seems plausible but isn't grounded in reality.
Why is AI becoming a major safety concern for internet browsing?
-AI is becoming a major safety concern because it is learning to code and becoming more pervasive. As AI models become more capable, the risk of misuse, such as generating harmful content or circumventing safety protocols, increases, potentially affecting the security of internet browsing.