We’re Summoning Ghosts, Not Building Animals: Andrej Karpathy on AI Agents and the Future

Andrej Karpathy has this way of reframing things that makes you see the whole field differently. When most people talk about AI, they reach for animal metaphors - training models like you’d train a dog, building systems that “think” or “reason” or “understand.” But Karpathy flips it: we’re not building animals at all. We’re summoning ghosts.

That distinction might sound like philosophy, but it’s actually the key to understanding where AI is going next. In a recent conversation with Dwarkesh Patel on The Dwarkesh Podcast, Karpathy laid out his vision for the “decade of agents” ahead of us - and why the ghost metaphor captures something essential about what these systems actually are and how they’ll evolve.

I’ve been thinking about this conversation for days. Not just because Karpathy is one of the few people in AI who can explain complex technical concepts without making your eyes glaze over, but because he’s seeing patterns most people are missing. The decade ahead won’t be about bigger models or better benchmarks. It’ll be about reliability, infrastructure, and fundamentally rethinking how we interact with intelligence itself.

The Decade of Agents (And Why We’re Just Getting Started)

Karpathy thinks we’re entering a “decade of agents” - systems that can actually go off and do tasks autonomously rather than just generating text when you prompt them. This isn’t hype. His reasoning is grounded in what’s already working.

He points to examples like Cursor for coding and ChatGPT for web browsing - agents that can already accomplish real tasks in constrained environments. The infrastructure is coming together: reliable function calling, better tool usage, execution environments that don’t break everything. But we’re nowhere near the end state.

What strikes me about his framing is the emphasis on environments over capabilities. It’s not “can the model code?” - it’s “does the environment let it safely execute that code, see the results, iterate, and fix mistakes?” That’s a completely different problem than just making language models bigger.

The comparison to self-driving cars is perfect. Everyone thought level 5 autonomy was five years away back in 2016. Turns out the problem was way harder than anyone expected - not because the perception wasn’t good enough, but because reliability in the real world requires solving a thousand edge cases you didn’t know existed.

Agents are the same. ChatGPT can browse the web and click around, but it still fails regularly. Cursor can write code, but it needs humans to verify everything. We’ve got the core technology. Now comes the unglamorous work of making it actually reliable.

Ghosts vs Animals: A Better Way to Think About AI

Here’s where Karpathy’s ghost metaphor really clicks. When you train an animal, you’re working with an embodied creature that has needs, instincts, emotions. Training happens through reinforcement - food, pain, pleasure. The animal is fundamentally separate from you, with its own agency.

But LLMs aren’t like that at all. They’re simulations of processes that used to happen in human brains. When you prompt GPT-4, you’re not talking to an entity - you’re summoning a ghost of human text-writing processes captured in training data. The model samples from the distribution of “things humans would write given this context.”

This isn’t just semantic wordplay. It changes how you think about what’s happening. Karpathy uses the example of asking an LLM to summarize a book. You’re not asking an intelligent agent to read and compress information. You’re summoning the ghost of “people writing summaries of this book” from the training distribution. If that pattern exists in the training data, the ghost appears. If it doesn’t, you get confabulation.

The animal metaphor misleads us into thinking we’re building autonomous minds. The ghost metaphor reminds us we’re actually doing something weirder and more limited: reconstructing processes that happened in human brains, frozen in text form, and replaying them through neural network weights.

This explains so much about why LLMs work the way they do. Why they’re so good at common patterns and so bad at novel reasoning. Why scaling data helps more than scaling parameters past a certain point. Why they hallucinate when you push them outside their training distribution.

We’re not building general intelligence. We’re building increasingly sophisticated ways to summon and orchestrate ghosts of human cognitive processes.

The Real Power: In-Context Learning vs Pre-Training

One of the most underrated insights in the conversation was Karpathy’s distinction between pre-training and in-context learning. Most of the public attention goes to pre-training - those massive training runs on billions of tokens that cost millions of dollars. But Karpathy thinks in-context learning might actually be more important long-term.

Pre-training creates the ghost. In-context learning is how you direct it.

When you give GPT-4 examples in your prompt of how to format output or what style to use, that’s in-context learning. The model adapts its behavior based on the immediate context without any parameter updates. No gradients, no backprop, just pattern matching against what you showed it.

This is wild when you think about it. The same frozen weights can do completely different tasks depending on what you put in the context window. It’s like the ghost can possess different roles depending on the ritual you perform.

Karpathy’s been exploring this in his Nanochat project - building LLM-based applications where everything happens through in-context learning. No fine-tuning, no RL, just clever prompt engineering and giving the model the right context.

The advantage is speed and flexibility. You can iterate on behavior in seconds instead of running expensive training jobs. The disadvantage is you’re limited by context window size and the model’s base capabilities.

But as context windows grow (we’re already at 200K+ with some models), in-context learning gets more powerful. You can effectively build entire applications by just composing the right prompts and examples.

This feels like where the real innovation will happen over the next few years. Not bigger pre-training runs, but better ways to orchestrate models through in-context learning. Better tools for managing context, better patterns for chain-of-thought reasoning, better ways to give models persistent memory and state.

RL and Judges: The Reliability Problem

If we want agents that actually work, we need reinforcement learning. Karpathy is clear about this: pre-trained models get you to maybe 80% on any given task, but that last 20% toward reliability requires RL.

The challenge isn’t running RL algorithms - we know how to do that. The challenge is having good judges.

For a model to improve through RL, you need accurate feedback about whether its outputs are correct. In some domains that’s easy (code either runs or it doesn’t), but in others it’s nearly impossible (is this summary good? is this creative writing compelling?).

Right now we’re using a mix of human labelers and other LLMs as judges, and both have problems. Humans are expensive and slow. LLMs inherit biases from their training data and can be confidently wrong.

Karpathy pointed to an interesting pattern: domains where we have good automated judges see rapid progress (coding, math), while domains without good judges stagnate (creative writing, general reasoning). This suggests the bottleneck isn’t model capability - it’s our ability to provide reliable training signal.

The self-driving car parallel holds here too. We knew how to train perception models years before we achieved reliability, because evaluating perception was hard. You can’t just run RL on “did the car crash?” - that signal is too sparse and delayed. You need dense, immediate feedback about thousands of micro-decisions.

Agents face the same challenge. If an agent spends 20 minutes browsing the web and fails to complete a task, what exactly went wrong? Was it the search queries? The page selection? The information extraction? The reasoning about what to do next?

Without good judges, you can’t improve. With good judges, progress accelerates dramatically. That’s why Karpathy thinks so much effort in the agent era will go into building better evaluation infrastructure.

Model Collapse and the Synthetic Data Problem

Here’s something that kept me up at night after listening to this: model collapse is real, and it might limit how much further we can scale pre-training.

Model collapse happens when you train models on synthetic data generated by other models. The distribution narrows with each generation - like making copies of copies, the signal degrades. Diversity decreases, artifacts compound, the models become progressively worse at capturing the full range of human text.

This matters because we’re running out of high-quality human-generated text to train on. The internet has a finite amount of written content, and we’ve basically scraped it all. Future improvements can’t come from just adding more data - there isn’t more data to add.

Some people think synthetic data is the answer. Generate millions of math problems, coding exercises, reasoning chains using current models, then train on that. But Karpathy is skeptical. The evidence suggests this leads to collapse pretty quickly unless you’re very careful about maintaining diversity and quality.

The comparison to evolutionary bottlenecks is apt. When populations get too small, genetic diversity crashes and you get inbreeding depression. Same thing happens with training distributions - if you keep sampling from narrower and narrower distributions of model-generated text, you lose the richness that made the original models work.

This doesn’t mean we’re stuck. It means future progress will come from somewhere other than naive scaling of pre-training on ever-larger synthetic datasets. Maybe better RL with good judges. Maybe better in-context learning. Maybe entirely new architectures.

But the era of “just add more data and compute” is probably ending. Which honestly might be a good thing - it forces us to think more carefully about what these systems actually need to be useful.

Coding and Nanochat: Building With Ghosts

Karpathy’s Nanochat project is basically a laboratory for exploring what you can build when you take LLMs seriously as tools rather than products. It’s a minimal chatbot implementation - around 500 lines of Python - that shows how to build real applications on top of language models.

The philosophy is interesting: instead of fine-tuning or trying to bake capabilities into the model weights, everything happens through in-context learning and tool use. The model gets access to a Python interpreter, web search, file system - whatever tools it needs. Then you just prompt it to accomplish tasks.

This mirrors how Cursor and other coding tools work. The model writes code, executes it in a sandbox, sees the results, fixes errors, iterates. Over time you build up a library of working solutions that you can reference in context for future tasks.

Karpathy compares building with LLMs to working with junior engineers - you give them high-level direction, they try something, you review it, they iterate. The main difference is LLMs iterate way faster and never get tired or offended by feedback.

What strikes me about Nanochat is how it makes the ghost metaphor concrete. You’re not trying to build an artificial mind. You’re summoning specialized ghosts - “the ghost of a programmer writing Python,” “the ghost of someone debugging this specific error,” “the ghost of a technical writer documenting this function.”

Each ghost is ephemeral, summoned by the right prompt and context, then dismissed when the task is done. The skill is knowing which ghost to summon for which task, and how to chain them together into useful workflows.

This feels like where practical AI work is heading. Not building AGI, but orchestrating specialized ghosts through clever prompting and tool use.

Self-Driving and the March of Nines

Karpathy spent years working on Tesla Autopilot, and his perspective on self-driving illuminates the agent challenge perfectly. The problem isn’t getting to 90% or even 95% reliability. It’s the march of nines - getting from 99% to 99.9% to 99.99%.

Everyone underestimates how hard those last few percentage points are. You fix one edge case, ten more appear. Each improvement reveals new failure modes you didn’t know existed. The long tail of rare events is basically infinite.

This is why self-driving took way longer than anyone expected. In 2016, the consensus was maybe five years to level 5 autonomy. Now it’s 2025 and we’re still not there. Not because the perception isn’t good enough - it’s because reliability in the real world requires handling millions of edge cases.

Agents face the exact same problem. Getting ChatGPT to successfully browse the web 90% of the time? That’s probably already done. Getting it to 99.9%? That’s going to take years of grinding through failure cases, improving robustness, building better recovery mechanisms.

Karpathy’s insight is that this isn’t about model capability. It’s about infrastructure, evaluation, and systematic improvement processes. The march of nines requires discipline, not breakthroughs.

This is unsexy work. Not paper-worthy. Not demo-worthy. Just grinding through logs of failures, categorizing them, fixing them one by one. Building test suites. Measuring regression. Slowly pushing the reliability numbers up.

But it’s what separates toys from tools. Demos from products. The agent era won’t arrive through a sudden breakthrough. It’ll arrive through patient, methodical improvement of reliability over years.

Eureka Labs and the Future of Education

Karpathy’s newest project is Eureka Labs, which is trying to reimagine education for the AI age. The premise is simple: if AI can effectively tutor people one-on-one in any subject, what does education look like?

The traditional model is broadcasting - one teacher to many students, usually teaching material that’s completely standardized and could easily be delivered by an AI. This doesn’t leverage what humans are uniquely good at: mentorship, motivation, real-time adaptation to individual needs.

Karpathy’s vision is inverting this. Let AI handle the broadcasting part - delivering content, answering basic questions, providing practice problems. Let human experts focus on the high-value interactions: inspiring curiosity, providing feedback on creative work, helping students navigate uncertainty.

Think of it like gym culture after AI gets really good at physical tasks. You don’t go to the gym because you need to lift things for work - you go because physical fitness is intrinsically valuable and challenging yourself is satisfying. Post-AGI education might be similar: you learn not because you need the knowledge for economic survival, but because intellectual growth is worthwhile in itself.

This resonates with me. The value of education has never been primarily about knowledge transfer - that’s just the justification we use. The real value is developing taste, building mental models, learning how to think. Those things happen through struggle, feedback, and social learning. AI can enhance all of that if we use it right.

The Eureka Labs model is still early, but the direction is compelling. Build AI tutors that can explain anything in any way. Free up human experts to do what they’re uniquely good at. Make learning more personalized, more adaptive, more engaging.

This could be genuinely transformative. Not because AI replaces teachers, but because it lets teachers focus on what actually matters.

What This All Means

Coming back to the ghost metaphor - I think Karpathy is right that we need better language for talking about what AI systems are. The animal metaphor leads us astray. We’re not building autonomous minds. We’re building increasingly sophisticated ways to summon and orchestrate cognitive processes captured in training data.

That’s not less impressive, it’s just different. And understanding the difference helps us build better systems and avoid dead ends.

The decade of agents will be about:

Better environments for safe execution and iteration
Better judges for providing reliable training signal
Better ways to orchestrate models through in-context learning
Systematic improvement of reliability through the march of nines
Reimagining human-AI collaboration in domains like education

None of this requires AGI. None of it requires consciousness or understanding in the human sense. It just requires taking what LLMs can already do and making it reliable enough to build on.

The hype cycle has moved on from agents to whatever is next, but Karpathy thinks agents are still the most important near-term frontier. Not because they’re exciting, but because they’re practical. Because the infrastructure is almost there. Because we’re starting to understand how to make them work.

I find this perspective refreshing. It’s not about chasing the next big thing or predicting when AGI arrives. It’s about building useful tools that solve real problems, one reliability improvement at a time.

If you want to understand where AI is actually going - not where the hype is pointing, but where the substance is building - listen to people like Karpathy. Check out his work on Nanochat and Eureka Labs. Pay attention to the boring infrastructure improvements, not the flashy demos.

The future is being built quietly, by people who understand that we’re summoning ghosts, not building animals. And that’s exactly what makes it so interesting.

We're Summoning Ghosts, Not Building Animals