Building Agentic AI: What Actually Works (And What Doesn't)
I've spent the last few months trying to get AI agents to do real work — not toy demos, real tasks on real codebases. Here's the full journey, every wrong turn included.
Starting raw: LLM + Python
The first attempt was dead simple. I loaded up Llama, served it with ollama serve, and wired up Python functions as tools. Just me and the model. Write a function, let the model call it, get results back.
It worked for basic tasks. Then the hallucinations started. The model would confidently call functions that didn't exist, pass arguments in the wrong format, or just make up results. There was no structure, no guardrails — just a language model freestyling with my codebase.
Frameworks enter the picture
I heard about agentic AI frameworks and started exploring. Tried smolagents, read about LangChain and the others. Smolagents was clean and I liked the approach, but with Llama 3B as the brain, we were struggling. The model just couldn't plan reliably at that size.
So we went bigger — Qwen 7B. And we got overconfident with it.
More freedom, worse results
We gave Qwen 7B everything: shell access, the ability to write its own Python code, about five different tools, and free reign to get tasks done however it saw fit.
The way it saw fit was very poor, often.
It would chain together bizarre sequences of tool calls, use shell commands when a simple function call would do, or write Python scripts that reimplemented tools it already had access to. More capability didn't mean better results — it meant more creative ways to fail.
Constraining the toolbox
The fix was counterintuitive: give the model less freedom, not more. Instead of letting it write arbitrary code, I wrote specific tool functions ahead of time and told the model to pick from those. No shell access. No code generation. Just: here are your tools, pick the right one.
Results got substantially better. The model was good at selecting the right tool for the job — it just couldn't be trusted to build the tool on the fly.
But we hit a new problem: malformed JSON. The model would pick the right tool but format the arguments wrong. Broken brackets, trailing commas, unescaped strings.
Self-correction loops
I had what I thought was a brilliant idea: let the model fix its own bad JSON. When a tool call failed to parse, I'd re-prompt the model with the malformed output and ask it to fix it.
The small models could actually fix their own JSON — that part worked. The problem was everything else. The model would make wrong tool calls, produce bad results, and couldn't keep up when tasks required multiple steps. It could handle one-shot corrections but fell apart on anything that needed sustained reasoning across a chain of actions.
So we upgraded the brain — GPT-4o-mini. The difference was night and day. Multi-step tasks actually worked. The model could hold context across a sequence of tool calls, self-correct when needed, and produce results that were consistently usable.
The context overload problem
But then we hit the next wall. The model had four roles: inspect, diagnose, analyze, and fix. One system prompt tried to explain all four roles, when to use each one, and what tools were available for each. The prompt was massive.
The model would get confused about which role it was in. It would try to fix code when it should have been diagnosing. It would analyze when it should have been inspecting. Too many responsibilities in one context window.
Cassettes: loading the right context at the right time
This is where things clicked.
Think about it like the Matrix — when Neo needs to learn karate, they don't dump every martial art into his brain at once. They load the right module at the right time. Same principle applies to AI agents.
I broke the one mega system prompt into four individual role definitions, each in its own markdown file. The agent would start by analyzing the user's query, determine what kind of task it was, and load only the context it needed for that specific role.
Need to inspect a webpage? Load the inspector role with its tools and instructions. Need to diagnose why a scraper broke? Load the diagnostics role with the relevant error context. The model never sees the fix instructions when it's supposed to be diagnosing.
This worked well for most tasks, especially analysis. The model was focused, the context was clean, and the results were consistent.
Where agents still struggle
The model also had a fix role — specifically for repairing scrapers when HTML layouts changed on target sites. This is where we hit the ceiling.
The model struggled with:
- Large codebases — too much code to fit in context, and the model couldn't reason about what was relevant
- Inheritance chains — following class hierarchies across multiple files was unreliable
- Multi-file fixes — when a fix required changes in three different files, the model would fix one and break the others
The root cause wasn't the model's reasoning. It was that the code wasn't written for agents. Human-readable code and agent-operable code aren't the same thing.
Coding for agents, not humans
This is a new idea I've been seeing come up more, and it matches what we experienced firsthand: you need to write code that agents can actually work with.
For our specific case — a fleet of 30+ web scrapers — this means pulling CSS selectors out of the scraper logic and into separate config files. Instead of the agent needing to understand the full scraper codebase to fix a broken selector, it just:
- Loads the target webpage
- Inspects the current HTML structure
- Reads the selector config file
- Compares what's in the config to what's on the page
- Updates the config
That's a task an agent can actually do reliably. No inheritance to follow, no multi-file reasoning, no understanding of the full scraper architecture. Just: does the config match the page? If not, update it.
We're in the middle of refactoring now. With 30+ scrapers, it's not a quick job. But the pattern is clear — the more declarative and modular your code is, the better agents can work with it.
What's next: MCP
The next step in this journey is the Model Context Protocol. We've started researching how MCP can help us build more capable agents — specifically around standardizing how agents discover and use tools, and how context gets passed between different parts of a system.
The pattern we built manually (role files, selective context loading, constrained tool sets) is essentially what MCP is trying to standardize. Instead of hand-rolling the orchestration, there's a protocol for it.
We're early in the investigation, but it feels like the natural next step from where we are.
If you're building with agents, here's the short version of everything I learned:
- Small models can select tools. They can't build them. Constrain the toolbox.
- One mega-prompt doesn't scale. Break roles apart and load context selectively.
- Small models can self-correct, but they can't reason across steps. Multi-step tasks need a bigger brain.
- Your code isn't agent-ready. If an agent can't operate on your codebase, the problem might be the codebase, not the agent.
- Give the model the right context at the right time. Not everything at once. Think cassettes, not encyclopedias.