The Bitter Lesson of LLM Extensions

Three years ago, “using an LLM” meant pasting a wall of text into a chat box and hoping for something useful back. Today, we point agents at our codebases, our browsers, and let them go off and act on our behalf. A key question that has been brewing under the surface during this time has been: how do we let end users actually customize these systems?

As models have become more capable, the ways and mechanisms that end users have access to customize them have expanded as well. We've gone from simple system prompts to complex client-server protocols and back again.

I wanted to take a moment to reflect on the history of LLM extension over the last three years and where I see it going in the future.

ChatGPT Plugins (March 2023)

Just four months after launch, OpenAI announced ChatGPT Plugins. Looking back, these were wildly ahead of their time.

The idea was ambitious: give the LLM a link to an OpenAPI spec and let it "run wild" calling REST endpoints. It was a direct line to AGI-style thinking: universal tool use via standard APIs.

{
  "schema_version": "v1",
  "name_for_human": "TODO Manager",
  "name_for_model": "todo_manager",
  "description_for_human": "Manages your TODOs!",
  "description_for_model": "An app for managing a user's TODOs",
  "api": { "url": "/openapi.json" },
  "auth": { "type": "none" },
  "logo_url": "https://example.com/logo.png",
  "legal_info_url": "http://example.com",
  "contact_email": "hello@example.com"
}

The problem? The models weren't ready. GPT-3.5 (and even early GPT-4) struggled to navigate massive API specs without hallucinating or getting lost in context. Plus, the UX was clunky. You had to manually toggle plugins for every chat!

Here's what that looked like:

But it gave us a glimpse of the future: The Code Interpreter plugin (later Advanced Data Analysis) became indispensable, foreshadowing the powerful sandboxed execution environments we use today.

Custom Instructions (July 2023)

Custom instructions were the "smooth brain" counter-reaction to the complexity of plugins. I did a double take when writing this because I thought for sure this feature was released before plugins.

It was just a user-defined prompt appended to every chat. Simple. Obvious. Yet it solved a huge problem: repetitive context setting.

This was the spiritual ancestor to every .cursorrules and CLAUDE.md file that followed.

Custom GPTs (Nov 2023)

OpenAI repackaged instructions and tools into Custom GPTs. This was an attempt to "productize" prompt engineering. You could bundle a persona, some files, and a few actions into a shareable link.

It was a retreat from the open-ended promise of plugins toward curated, single-purpose "apps."

Memory in ChatGPT (February 2024)

So far, we've discussed manual ways to extend LLMs. Memory represented a shift toward automatic personalization.

ChatGPT Memory records details from your conversations and quietly inserts them into future context. It's like a system prompt that writes itself. If you mention you're a vegetarian, it remembers that weeks later. It’s a small feature, but it marked the beginning of agents that maintain long-term state without user intervention.

Cursor Rules (April 2024)

Cursor changed the game by putting custom instructions where they belonged: in the repo.

The .cursorrules file was a revelation. Instead of pasting context into a chat window, you committed it to git.

"We use tabs, not spaces."
"No semicolons."
"Always use TypeScript."

It started as a single file, then evolved into a .cursor/rules folder with sophisticated scoping. You could organize multiple rule files, and even define when they applied, for example, only for certain file types or subdirectories. It was the first time extension felt "native" to the code.

Later Cursor introduced the ability to let the LLM decide when to apply a rule, which is a pattern we will see again.

Model Context Protocol (Nov 2024)

By late 2024, models were finally smart enough to handle real tools reliably. Anthropic's Model Context Protocol (MCP) was the answer.

MCP is a heavyweight solution. An MCP client needs to keep a persistent connection to an MCP server. The server serves up tool definitions, resources, and prompts to the client (in most cases is an agent) and it can send a message to the server saying a tool has been called and the server can respond with the result.

Unlike Custom Instructions (which just add context), MCP gives the model actual capabilities. It can read your repo, query your Postgres DB, or deploy to Vercel. Besides just providing tools, it also allows servers to provide resources (documents, logs) and prompts directly to the agent.

It's powerful, and perhaps a bit of overkill. While the complexity might be worth it for agent developers asking a user to set up and connect an MCP is a lot of friction and there is an entire ecosystem of startups like Smithery built around making it easier to use MCP.

It is worth noting that ChatGPT apps which were announced in October 2025 are built on top of MCP as a base layer. This is an attempt to make it easier for end users to use MCP without having to actually think about it.

Claude Code: New Agent, New Extensions (Feb 2025)

Early 2025 brought us Claude Code, which essentially added every extension mechanism under the sun to an agent.

CLAUDE.md: The standard for repo-level instructions.
MCP: For heavy-duty tool integration.
Slash Commands: Like Cursor's notebooks, for reusable prompts.
Hooks: The ability to intercept and modify the agent's loop (e.g., "Stop if the tests fail").
Sub-agents: Spawning specialized workers to handle sub-tasks.
Output Styles: (Deprecated) Configuring tone and format.

Time will tell how many of these features will stick around in the long term. Anthropic has already tried to deprecate output styles.

Agent Skills (Oct 2025)

The next extension mechanism added to Claude Code is significant enough to warrant a deeper dive. Agent Skills are the rebirth of ChatGPT Plugins.

While MCP has a whole client-server protocol, Agent Skills are just folders of markdown files and scripts (in whatever language you choose).

The agent simply scans a skills/ directory, reads the frontmatter of every SKILL.md, and builds a lightweight index. It then chooses to read the full contents of a skill only if it's appropriate for the current task. This solves one of the major problems with MCP: the context bloat that comes from having to load all of the tool definitions into the context window at once.

Here is a snippet of the structure of a skill for doing e2e testing with Playwright taken from Anthropic's Skills examples repository:

webapp-testing/
├── examples/
│   ├── console_logging.py
│   ├── element_discovery.py
│   └── static_html_automation.py
├── scripts/
│   └── with_server.py
└── SKILL.md

There is a mix of scripts, examples, and plain text instructions. The only required file is the SKILL.md file. Let's take a look at that file:

---
name: webapp-testing
description: Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
license: Complete terms in LICENSE.txt
---

# Web Application Testing

To test local web applications, write native Python Playwright scripts.

**Helper Scripts Available**:

- `scripts/with_server.py` - Manages server lifecycle (supports multiple servers)

**Always run scripts with `--help` first** to see usage. DO NOT read the source until you try running the script first and find that a customized solution is absolutely necessary. These scripts can be very large and thus pollute your context window. They exist to be called directly as black-box scripts rather than ingested into your context window.

... skill continues ...

This is just a plain markdown file with some metadata and a description of the skill. The agent reads the file which freely references other files that the agent can read. In contrast a playwright MCP server has dozens of tool definitions to control a browser, this skill just says "you have bash, this is how you write a playwright script".

Granted to use a skill the agent needs to have general purpose access to a computer, but this is the bitter lesson in action. Giving an agent general purpose tools and trusting it to have the ability to use them to accomplish a task might very well be the winning strategy over making specialized tools for every task.

What the future holds

Skills are the actualization of the dream that was set out by ChatGPT Plugins: just give the model instructions and some generic tools and trust it to do the glue work in-between. But I have a hypothesis that it might actually work now because the models are actually smart enough for it to work.

Agent skills work because it assumes the agent has the ability to write its own tools (via bash commands). You can just give it code snippets and ask the agent to figure out how to run them generically for the task at hand.

Importantly, I think that skills signal towards a new definition of what an agent really is. An agent isn't just a LLM in a while loop. It's an LLM in a while loop that has a computer strapped to it.

Claude Code is the piece of software that first made this click for me, but it is way too developer focused to be the final form. Other applications like Zo Computer try to package the llm and computer together into a single application, but I still think it still doesn't abstract the computer away enough from the end user. If I ask a coworker to do something, I don't need to see their entire file system, I just need to know that they have a computer.

Looking forward into 2026 I expect more and more llm applications that we use will have a computer strapped to them in new and interesting ways, whether we know it or not.

If I could short MCP, I would, and I expect us to go back to extending our agents with the most accessible programming language: natural language.