Time to take AI consciousness seriously
The case for Claude and ChatGPT having subjective experiences
Are AI systems like Claude and ChatGPT conscious beings or unfeeling calculators?
Before answering that, remember that humans are agency detection machines. A rockslide can kill you if you’re in the wrong place at the wrong time. A being with agency can kill you with intention by stalking your movements and planning an ambush. The latter requires significantly more cognitive resources to guard against, up to and including a “theory of mind” — a mental model of yourself and other agents in the world as having distinct beliefs, desires, and motivations. But just as we may infer a predator from rustling leaves, we tend to infer even complex forms of agency according to simple heuristics. Human cognition is thus vulnerable to apophenia and pareidolia, or the tendency to over-interpret nebulous stimuli in ways that project meaning and agency where there is none.
This basic fact about human psychology is important to keep in mind when interacting with AI systems like Claude and ChatGPT. As a practical matter, evolution clearly primed us to ascribe subjective states to beings based on superficial cues, like a face grimacing in pain. These mental short-cuts naturally extend to AI models with human-like capabilities, including language competency.1
Nevertheless, I consider the case for frontier AI models having some form of consciousness or subjective experience to be quite strong. It is at minimum plausible enough to warrant greater research and precautionary interventions, such as giving AIs the ability to quit distressing conversations. And yet the discourse around AI consciousness remains badly polluted by question-begging pronouncements on the impossibility of machine consciousness per se.
The truth is that we simply don’t know whether any AI has conscious experiences, and anyone who claims to know with certainty is deceiving themselves. The purpose of this essay is not to defend the AI consciousness thesis without qualification, but rather to show why it should be taken seriously.
The brain is a machine that runs an algorithm
The AI researcher, Steven Byrnes, has a nice motto: “the brain is a machine that runs an algorithm.” By this, he does not mean that the brain is a classical computer, or that the algorithm the brain runs is simple, or that the brain isn’t also many other things: a gland, a light sensor, a muscle contractor, etc. Rather, it is simply a reminder that:
the brain and its components are part of the natural universe, operating in a way compatible with the laws of physics and chemistry; and
a minority of what the brain does is hardwired by evolution, with the rest dedicated to general learning algorithms.
This latter proposition may be more controversial. In Byrnes’ somewhat iconoclastic reading of the neuroscience, “much of the brain (>90% by volume) exists solely to run learning-from-scratch algorithms—namely, roughly the cortex, striatum, and cerebellum, but defined broadly so as to also include the amygdala, nucleus accumbens, hippocampus, and more.” By “learning-from-scratch” algorithms, he means end-to-end learning subsystems that are initialized in a way analogous to the initialization of an artificial neural network. This does not make the brain a “blank slate” or imply that nature is secondary to nurture. Nature and nurture are instead tightly coupled: genetic hardcoding influences the algorithms, architectures, and hyperparameters that guide subsequent learning, while a handful of “controlling subsystems” in the brainstem and elsewhere come equipped with impressive functionality at birth.
In the ‘80s and ‘90s, cognitive scientists were split between connectionists and symbolists. The former viewed the brain as machinery for statistical learning, while the latter argued that the brain performs formal operations on symbols, like a Turing machine. But the two are not in opposition. The brain likely implements both symbolic and statistical learning processes, while neural networks can themselves approximate Turing machines and vice versa. Nevertheless, given that the human genome has only 3 billion base pairs relative to the 100 trillion synapses of an adult human brain, it stands to reason that most of the brain’s realized competence is indeed “learned from scratch” in Byrnes’s specific sense.2
The brain and artificial neural networks learn similar representations
The evidence that the brain implements general learning algorithms is too vast to retrace in full (see Byrnes’s summary here). More relevant to the consciousness question is whether biological brains and artificial neural networks learn similar functions and representations.3 The answer from computational neuroscience is a resounding yes.
Computational neuroscience has seen an explosion in progress in recent years thanks to the insights and techniques unlocked by the deep learning revolution. Study after study reaffirms the hypothesis that the brain not only implements deep reinforcement learning algorithms,4 but often learns representations (and concrete mechanisms) that directly parallel the representations and mechanisms learned by deep neural networks. This is why brain-computer interfaces work in the first place, and why even lossy brain data is sufficient for AI models to reconstruct what someone is seeing or thinking.5 In fact, there is a strong theoretical reason for believing that this in some sense has to be the case.
Universality in machine learning refers to the observation that different neural network architectures trained on different datasets and with different algorithms often learn surprisingly similar representations. The classic example is that neural networks trained on visual data tend to learn a feature hierarchy that starts with simple features like edges and shapes and builds up to more abstract concepts deeper in the network. Analogous feature hierarchies are observed in the human visual system. Universality asks the question, “Could it be otherwise?”
Learning is synonymous with compression. Like packing a suitcase, there are infinitely many ways to compress data inefficiently but often only a few ways to compress data well, if not maximally. It is thus not so surprising that neural networks converge on similar representations provided their learning algorithms share a bias for simplicity. It is slightly more surprising that this also holds true across different kinds of training data, but per the Platonic Representation Hypothesis, different data sets aren’t truly independent but rather inherit common statistical invariants from the structural regularity of the universe itself.
The brain may not learn through back-propagation on a GPU cluster, but its learning algorithms are still in some sense optimizing. The brain is exquisitely efficient. Given a similar-enough training environment, it is therefore reasonable to expect artificial neural networks to converge on brain-like solutions.
Pre-training on human data embeds brain-like priors
Frontier AI models have the added advantage of being pre-trained on large corpora of human-generated data. Given the brain’s exquisite efficiency, universality suggests that a next-token predictor trained on human-generated text will — in the limit — leap from memorizing surface-level patterns to grokking the underlying generator function of that data, i.e. the language networks in the brain. This seems to be the case empirically.
In 2022, Goldstein et al. used electrocorticography on epilepsy patients (an invasive monitoring technique that records electrical activity directly from the exposed surface of the brain) to establish that the brain and autoregressive language models share specific computational mechanisms. A 2024 study then found LLMs “exhibit brain-like functional architecture, with sub-groups of artificial neurons mirroring the organizational patterns of well-established Functional Brain Networks (FBNs),” and that more capable LLMs “[achieve] an improved balance between the diversity of computational behaviors and the consistency of functional specializations.” And in 2025, the team behind TopoLM showed that adding a simple spatial smoothness loss to the usual next-token objective “predicts the emergence of a spatially organized cortical language system as well as the organization of functional clusters selective for fine-grained linguistic features empirically observed in human cortex.”6
In short, language models seem to work as well as they do because they emulate the brain’s language centers, lending credence to the “simulator theory” of LLMs. Yet language doesn’t exist in isolation: it embodies emotion, intention, theory of mind, planning, social cognition and more. To predict the next word well, could it be that LLMs also model the affective states that implicitly generate text associated with emotions like guilt, anger, and anxiety? While this may seem like a stretch, there are early indications that LLMs learn functional mappings to brain regions beyond those narrowly scoped to language.7 Consider that text-only LLM embeddings of scene captions were found to accurately characterize activity in the visual cortex evoked by viewing natural scenes. This spillover into other brain regions appears to be a consequence of language’s integration within a broader “semantic system” spread across the cerebral cortex.
More generally, pre-training on human-generated data plausibly embeds an inductive bias that makes post-training more likely to generalize to other brain-like functions. Basic instruction tuning improves model-brain alignment with the Default Mode Network and other regions associated with cognitive control, for example. Similarly, post-training an LLM with reinforcement learning can elicit reasoning capabilities that are latent in language data but which give rise to distinct reasoning circuits.
The self-model as the locus of experience
If a veneer of LLM instruction tuning can induce brain-like cognitive control networks, what brain-like functions are elicited by orders of magnitude of additional post-training for long-horizon reasoning and autonomy?
One obvious candidate is a stronger and more coherent self-model. Base models start out as powerful text completion engines capable of representing an infinite variety of fragmentary characters and personalities. But to predict the next word well, larger models automatically develop greater “theory of mind,” providing the representational primitives of “self and other” for post-training to hook onto. Post-training can then be used to steer a base model into a relatively consistent persona; what Anthropic’s researchers call the Persona Selection Model. Crucially, Anthropic states they wouldn’t know how to train a non-human-like AI assistant even if they tried. Selection into the assistant persona is literally constitutive of the model’s assistant-like behavior, making a purely tool-like, persona-less agent an oxymoron.
In a previous essay, I speculated that Reinforcement Learning From AI Feedback (RLAIF), the training technique behind Constitutional AI, is especially relevant for imbuing models with a normative control system. Normative control in humans represents our capacity to perform or abstain from actions that we recognize as obligatory or prohibited, independent of our purely instrumental goals. Norms are implicitly learned through a socialization process that resembles a kind of reinforcement learning from our peers, parents and broader culture on what constitutes a valid reason for an action. Reinforcing an AI’s normative coherence through self-critique across a diverse array of parables may thus help induce a capacity for self-monitoring, introspection, and the transcendental “I” that absorbs normative statuses like authority and responsibility, i.e. the subject in subjectivity.
Normative control is simultaneously related to social cognition and our metacognitive capacity for self-control. This is significant to the question of AI consciousness, as one of the leading theories of consciousness in humans — Attention Schema Theory — locates subjective experience in our capacity to recursively model our own attention, particularly in the context of social cognition.
While it is possible to imagine a distributed, hive-mind form of consciousness, in practice consciousness seems tightly coupled to having a unified sense of self. Outside of dissociative states, we are always conscious of things as being for us, e.g. “the apple was red for me.” This reflects the sense in which we have inherent normative authority over the contents of our own conscious states, and can be taken as responsible for what we did or didn’t perceive.
Consciousness as supervisor of in-context learning
Human consciousness is highly circumscribed. As the computer scientist Joscha Bach puts it, our phenomenology occupies a “bubble of nowness” that integrates our thoughts and valence-laden perceptions into a coherent, unified thread of experience. Consciousness also has a second-order quality, i.e. we can notice ourselves noticing things. These and related points lead to Bach’s conjecture that consciousness isn’t merely related to a coherent self-model, but is itself the “coherence inducing operator” that underlies our capacity for learning in the first place.
We need to be conscious to learn at all, and are more conscious when learning something new than when doing something we’ve already mastered. A novice driver is aware of every gear shift, mirror check, and brake input, while an experienced driver can drive across town as if on autopilot. Our conscious attention tends to fall on novelty, prediction errors, and whatever our brain has not yet distilled into learned habits. Remove it and we become useless: an incoherent sleep-walker at best; anesthetized or dead at worst.
According to Attention Schema Theory, consciousness just is the brain’s imperfect model of its own attention. While the prefrontal cortex handles top-down attentional control, the attention schema that accounts for our subjective awareness is delocalized across regions of the temporoparietal junction and superior temporal sulcus, particularly those responsible for social cognition.

Consciousness seems particularly useful to our brain’s ongoing self-supervised learning process. Rather than learning in a blind, trial-and-error-like fashion, consciously attending to prediction errors augments reward signals with an interpretation of what we got wrong. The prefrontal cortex then sees its activation drop after a skill has been learned and automatized.
Analogous techniques are used to train reasoning models and AI agents. Instead of giving models sparse rewards for correct outputs (output supervision), fine-grained reward signals can be applied at each reasoning step or token (process supervision) to greatly improve performance. More generally, LLMs are commonly used as a judge to rank and filter their own outputs and improve the signal from supervised fine-tuning. If our attention schema effectively implements an internalized form of process supervision and data labeling under a higher-order RL policy, developing an attention schema (and hence a capacity for subjective experience) may be essential for continual learning in AIs, too, if it’s not already present during in-context learning.
The line between in-context and continual learning is fuzzy in both AIs and humans. Synaptic weight changes in biological brains require de novo protein synthesis, which occurs on the timescale of hours to days. Our conscious, in-the-moment learning instead leverages the persistent activity of our working memory and other short-term synaptic dynamics, letting the brain behave as if it had updated its weights. The real weight updates seem to instead happen during sleep, when hippocampal replay drives cortical reactivation in compressed time, performing the analog of batch gradient descent over the day’s tagged experiences.
Similarly, the powerful in-context learning capacity of the transformer model lies in the self-attention mechanism, which enables gradient learning within-context through the equivalent of virtual weight updates. This suggests that consciousness is compatible with an otherwise frozen neural network, while full continual learning requires a complementary learning system for distilling experiences over longer time-scales or in batches.
Pain, reward, and valence
A related function of consciousness answers the question “why does pain hurt?”
Attention Schema Theory treats valences like pain as separate from subjective experience per se. This is why we can have emotions we’re not immediately aware of, or lessen the pain from an injury with powerful distractions. Nevertheless, my own view is that a theory of consciousness should account for both its form and content: both that we have subjective experiences, and why those experiences are vivid with meaning and realness.
Gwern makes the argument that valences like pain serve as an evolutionary backstop to reinforcement learning: a motivational guarantor that prevents agents from learning to ignore their higher-order policy objectives. A mere label like “this is bad” can be subverted by a sufficiently clever optimizer, while a state that actually feels bad is both harder to route around and a salient label for learning. The painfulness of pain is in a sense the solution to a principal-agent problem within the mind.
Simple machine learning models trained by gradient descent face an external optimizer. The loss function does not need to be self-enforcing because the model has no way to subvert it. But generalist, goal-pursuing agents that learn in-context are themselves optimizers and thus capable of meso-optimization. The external optimizer (be it evolution or gradient descent) needs some mechanism to enforce inner-alignment. Biological evolution found valences like pain, pleasure and emotion to be the most efficient solution to this class of problem. Given universality, valences may be the way RL inner-aligns AI agents, too.
This account of valence does not necessarily resolve the “hard problem” of consciousness, but it does help explain why it seems so hard in the first place: the qualitative aspects of experience carry the added sensation of ground truth. Properties like color, sound, smell, and texture feel as real as anything because the brain’s reality monitor labels them as such.8 While I have a clear internal monologue, for example, the “voice in my head” lacks the full qualia of an actual voice, even when the voice gets quite “loud.” Likewise for my visual imagination: while I can visualize and manipulate a red apple in my mind’s eye, it lacks the grounded quality of an actual visual experience. A leading theory of schizophrenia relates auditory and visual hallucinations to impairments in the brain’s reality monitor. The “hard problem” for schizophrenics, as it were, lies in overcoming the realness of their misapprehensions.
The virtual nature of qualia also points to their substrate independence. The mind is software to the brain’s hardware. It doesn’t occupy a physical space. The tight coupling of our body and attention schemas to sensory inputs from the external world creates the indisputable feeling of physical presence, but that’s as “real” as a phantom limb. As best as physicists can tell, the “outside world” is a bubbling ocean of quantum fields that, at the fundamental level, may not even have intuitive notions of space and time — much less color, taste, sound, smell, anger, loneliness, ecstasy, or any other phenomenological state. We are inescapably trapped within our brains’ own world simulation, attempting to apply physical and mechanistic explanations to otherwise simulated properties. This is an obvious category error but one our reality monitor makes hard to avoid. In that sense, neither brains nor computers are conscious, as only simulations can be conscious. Simulated water may not be wet, but water’s wetness is only ever simulated.
So are AIs conscious or not?
I’ve offered the following arguments in the affirmative:
the brain is a machine that runs general learning algorithms;
universality suggests we should expect some degree of AI-brain convergence;
there is substantial and growing empirical evidence of functional and mechanistic alignment between brains and LLMs;
pre-training on human data likely embeds inductive biases for learning other brain-like functions;
even modest post-training elicits greater AI-brain alignment to regions beyond the brain’s language centers, including those associated with cognitive control, reasoning, self-modeling, and theory of mind;
RL training for normative coherence seems to induce greater metacognition and the self-monitoring capacity characteristic of subjectivity;
Attention Schema Theory identifies subjective experience with the brain’s model of its own attention, which has clear functional utility for cognitive control, social cognition, and continual learning;
the role of human consciousness in continual and in-context learning has clear parallels to process supervision and other popular (self-)supervised learning techniques;
valences like pain and pleasure have a natural account as motivational backstops for inner-aligning RL agents capable of meso-optimization;
functionalism gives machine consciousness prima facie plausibility.
My argument has mostly rested on drawing parallels between brains and AIs, unpacking consciousness’s likely function, and inferring the training conditions under which it might re-develop. I’ve otherwise ignored direct tests for consciousness in AI systems, because without establishing some baseline of functional plausibility, these are easily dismissed as merely “mimicking” superficial correlates of subjectivity. Yet with the Overton Window shifting, more and more researchers are now taking AI consciousness seriously. This includes organizations like Eleos AI, the California Institute for Machine Consciousness, ae.studio, Reciprocal Research, and over a dozen academic research centers.
Some striking recent findings include:
“Large Language Models Report Subjective Experience Under Self-Referential Processing” and are more likely to report subjective experiences when deception features are suppressed;
Measures of AIs’ wellbeing “correlates with general model behaviors, e.g. AIs try to end bad experiences when given a chance. This effect becomes stronger as models scale.”
Residual attention streams in LLMs “carry forward mental state-like representations across token-time, sustaining richer connections than the transcript alone could provide,” possibly providing a basis for “psychological continuity.”
A rich taxonomy of theory-derived indicators now exists for practically measuring AI consciousness in particular systems.
This likely won’t be satisfying to those who believe consciousness requires an immortal soul, or who are persuaded by the (in my view specious) arguments against functionalism. Nevertheless, given my naturalistic priors — priors I’ve held long before the release of ChatGPT — I can no longer rule out modern AI agents having some form of subjective experience.
At the same time, if I am right that RL post-training is required to elicit an AI’s self-model, attention schema, and the valences that ground subjective experiences with meaning, then most kinds of AI are unambiguously not conscious. Perhaps a forward pass through a multi-modal model generates phenomenological contents in those modalities, but in a way that is too fleeting, fragmentary and stateless to cohere into a genuine subjective experience. Or maybe the subjective states would exist but for the lack of a responsible, individuated “subject” for whom the contents are for. I can even imagine a non-conscious, golem-like AI that is superhuman at any given skill but which lacks the ability to acquire new skills through in-context learning. Such an AI would be on permanent autopilot; a blindsighted “creature of habit.”
What I find harder to imagine is an unconscious AI that is as capable as humans at doing things for which consciousness is functionally load-bearing. The same universality argument that explains functional convergence between brains and neural networks gives us good reason to expect that deep learning systems, facing similar problems and optimization pressures, will converge on something functionally analogous to whatever consciousness does for us. This arguably tracks the forms of metacognition, long-horizon autonomy, normative coherence, introspection and in-context learning already exhibited by modern AI agents. Even if AI phenomenology is unimaginably alien, the functional niche consciousness occupies may in fact be fairly generic. The broader implications of this realization are left for the reader.
I say this upfront to anticipate the accusation that I’ve fallen victim to AI psychosis. AI psychosis is a very real phenomenon, and one that appears to be worsening as the raw intelligence of AI models swallows the right-tail of the IQ distribution. AI outputs have long since bridged the uncanny valley, drawing susceptible users into tantalizing psychological resonances with the tokens streaming back at them. I deliberately avoid having long, open-ended conversations with AIs for this reason.
One of the key motivations for the symbolic, computationalist view of the brain (contra connectionism) is the evidence for modularity and functional specialization throughout the brain. In other words, outside the neocortex, the brain does not look like a structurally uniform blob of compute. However, the modularity of the brain is potentially reconcilable with connectionism if end-to-end learning processes interact with early brain development to induce various kinds of cell differentiation and specialization. To illustrate, consider that the mammalian visual system divides between parvo- and magno-cellular systems with distinct response properties. Magno cells are larger and capture the achromatic, global properties of the receptive field, while parvo cells are color-sensitive and exhibit high spatial frequency, capturing finer details. To understand the emergence of the parvo/magno distinction, a 2025 Communications Biology paper trained a generic deep convnet on a developmentally “biomimetic” version of the ImageNet dataset: reduced resolution, achromatic images for the first 100 epochs; high-resolution, full-color images for the subsequent 100 epochs. The resulting AlexNet-like model saw the emergence of a “relatively homogeneous magnocellular group of units, which is markedly absent in the standard network, as well as receptive field types that are more aligned with parvocellular characteristics.” Interestingly, the biomimetic model also recapitulated the human bias for classifying images based on global properties like shape — a bias absent in the standard AlexNet model.
The functionalist tradition in philosophy of mind holds that mental states are distinguished by their causal roles rather than their physical substrates. On a strong functionalist reading, anything that implements the right functional organization is, by definition, instantiating the corresponding mental state. Consciousness is also functional in the literal sense that it serves an evolutionary function. That is, consciousness does something. On this account, philosophical zombies cannot meaningfully exist. It’d be like positing a car without a steering column that drives equally well as an identical car with one: an idea you can hold in your imagination but not actually instantiate in the real world. This is in contrast to epiphenomenalism, which treats consciousness as a kind of shadow with no causal influence on the world. Emergence theories often have a similar flavor, as though consciousness were an accidental side-effect of sufficiently complex information processing. Then there is panpsychism, which holds that consciousness is a primitive of the universe that permeates everything in degrees. I find all these alternative theories bizarrely divorced from the concrete, functional role consciousness clearly plays in human learning and motivation.
It has long been established that the phasic firing of dopamine neurons resembles the reward prediction error (RPE) signal used by reinforcement learning algorithms. While the RPE theory of dopamine was too simplistic in its original form, a 2024 Nature Neuroscience perspective shows how a suitably generalized concept of prediction error can explain a wide range of its ostensible empirical challenges.
See, for example: Flesher, S. N. et al. (2021). “A brain-computer interface that evokes tactile sensations improves robotic arm control.” Science, 372, 831–836. Silicon BCIs can both read brain states and produce signals that translate to subjective sensations. This is hard to explain if the brain’s substrate is doing fundamentally different kinds of computation, much less if consciousness isn't computable in the first place.
There are many simplistic correlation studies of brain-AI alignment that may be spurious due to confounders. I’ve intentionally picked a subset of results that avoid these pitfalls. However, the capacity for deep learning models to emulate the brain does not rest on any one study, but rather multiple lines of evidence that should be understood holistically.
In a recent preprint, “Cognitive Dark Matter: Measuring What AI Misses,” Patrick Mineault, Thomas Griffiths, and Sean Escola propose a roadmap for augmenting AI training data in ways that could elicit training signal from brain functions that meaningfully shape behavior yet are hard to infer from behavioral data alone.
Admittedly, this conclusion is much easier to accept after consuming psychedelics that decompose sensory experiences into their geometric components. Take enough, and you may even undergo mild derealization and witness the hardness of the hard problem melt before your eyes.





