Mission Impossible: Managing AI Agents in the Real World | by David Bethune | Apr, 2025

by oqtey
Mission Impossible: Managing AI Agents in the Real World | by David Bethune | Apr, 2025

We are at a new frontier with AI tools in every industry, and particularly in software development. They are changing underneath us faster than any human can adapt, and charging us for the privilege. Maintaining control of these robots feels like an impossible mission. Today I’ll share some battle-tested techniques that you can use to rein in your AI agents, chats, and other tools.

This article is part of my AI Library. If you’re new to agentic coding, start with License to Kill. To dig deep into why and how this stuff works the way it does, check out Something from Nothing.

After many successes and failures I’ve had with AI agents, the mission comes down to careful planning and restraining the context of what your agents can do.

Everything that can be done wrong, I’ve done wrong.

If you can set the right guidelines, agents can deliver better results. Everything than can be done wrong, I’ve done wrong and learned from it. It should be no surprise that agents write software this way. Humans do, too, and that’s where agents get their ideas.

  1. Choosing Your Tools
  2. Choosing What to Work On
  3. Finding a Route
  4. Making a Plan
  5. Revising the Plan
  6. Testing the Plan
  7. Finding Bigger Problems
  8. Making Rules
  9. Performance Payback
  10. Choosing Models
  11. Cost Controls
  12. Model Context Protocol (MCP)

Fair Warning: All screenshots and UI mentions are likely to be outdated by the time you read this, owing to how fast these tools change. The concepts will be there in different parts of the app if you hunt around.

Wheatfield with Crows is my favorite Van Gogh painting. It’s stunning in person and can’t be reproduced in any medium. Oil paint impasto reflects photons into your eye in ways that aren’t duplicated by ink on paper or by emissive screens. In the same way, the materials and techniques in your existing code and your prompts determine the true qualities of the finished app.

Choosing Your Tools

I’m putting this issue here not because it’s the most important but because folks think it is. In art, there is a great difference between tools, materials, and technique. When the work is done, only the materials remain, transformed as they were by your tools and technique.

When you work with AI tools, the materials are your inputs — your code, diagrams, data, and prompts. The technique is how you weave these materials together and the order in which you present them. You’ll find a recurring theme throughout this article: the quality of the materials you provide will be the single most important factor in your AI agent’s success.

My examples will be from Cursor but I want to emphasize that the tool du jour has very little impact on what you can do, just like all washing machines have similar functions but different buttons.

AI tools change daily. If you haven’t updated yours today, it’s probably behind. The trick is to find the tool that offers a workflow that you like — one that offers a balance of investigation and action and boosts your own workflow.

You can apply everything I’ll talk about in Cursor with tools like Windsurf, Copilot for VS Code, or even by pasting stuff into ChatGPT or Google Gemini. You can also use all of these tools for free but, as with everything in life, the paid version is significantly better, so don’t judge the paid version of something by the free results.

Will non-devs create high quality output with these tools? Absolutely not.

It’s also important to know your tool deeply and keep up with its changelog and documentation. I know no one reads that anymore but, ironically, in the age of “ask an AI anything,” the user of such an AI would do well to read the doc directly as the true secrets of its power are revealed only in the pages of those cryptic tomes.

When working with AI, you must be realistic about your own abilities and shortcomings, for they will permeate everything the agent builds for you. When is it time to investigate, and when is the time to take action? You have to be the one to know and control that flow.

With AI tools, different skills pay the bills. Does that mean that non-devs (or non-artists) will create high quality output with these tools? Absolutely not. It means just the opposite.

In addition to your standard set of coding skills, you’ll need deep architectural insights and an ability to communicate them in plain English. That’s not a skill set that’s common among programmers. Don’t be upset at the LLM when its output is just as bad as your input.

Roland’s new V-Stage, shown here with music legend Patrice Rushen of Forget Me Nots fame, reminds me of how far pianos have come. Did you know that the keyboard we have today wasn’t standardized at first? Prior to the adoption of equal temperament tuning, the musician before you might have composed their piece with entirely different frequencies than yours. Talk about bad vibes.

Choosing What to Work On

You might notice a heavy emphasis on planning in the topics I’ve picked for this article. That’s because, with agents, 90% of your work is going to be planning. The popular term “vibe coding” suggests that you can just ask for anything or say anything and get results. What’s shocking about this idea is that it’s true.

Vibe coding is the exact wrong approach unless you only want an artifact to show someone.

Today’s models have progressed far enough to literally write anything. And that’s now a problem, as I covered extensively in my last article. Vibe coding is the exact wrong approach unless you only want an artifact to show someone.

If you’re making code that’s expected to ship, you can only think of vibe outputs as prototypes. They look great but, like paper airplanes, don’t really fly. That doesn’t mean they’re not useful. The “M” in LLM means model. It was always a model all along. If we keep that in mind, we can use the model only where we need it, while we control building the final product.

We need to make a reusable plan for things we only plan to do once. That seems insane.

It may seem like the amount of planning I’m suggesting will take longer than just writing something shippable yourself. And that’s very likely to be true. The difference is that we get a reusable plan out of this process, something we are not likely to have around for any other kind of code we wrote ourselves.

Here I reference both a file to work on and a reusable plan, written previously (by Cursor, of course!). Notice that it is better able to follow a plan like this than just trying to complete the same actions from the original prompt that created the plan.

We need to make a reusable plan for things we only plan to do once. That seems insane. Why would it need to be reusable if we’re only doing it once? There are two reasons. The most glaring is that the agent is unlikely to do it all correctly the first time. If your plan isn’t written with multiple runs in mind, you’ll waste time backtracking and re-explaining the plan instead of just nuking your repo and changing the plan, then re-running it.

If you’re not sure if the plan will work, the agent won’t be sure either.

If writing a reusable, runnable thing that outputs data and a UI sounds a lot like programming, Welcome to the New Age. The second benefit of this reusable plan (that lives in your repo) is that you or the agent can read it again when you want to refactor or extend your design.

With this in mind, it’s important to carefully scope your work. Don’t ask for the finish line at the beginning. Try to divide the work you ask for into modular parts that can be completed successfully. If you’re not sure they can be completed successfully, send the agent back to the investigation phase to improve the plan.

If you’re not sure if the plan will work, the agent won’t be sure either. Agents that lack confidence in their own plans tend to go wildly off track, a recurring theme in recent Mission Impossible movies, now that I think about it. AI agents will make up a solution on-the-fly if your plan doesn’t adequately cover a situation. This is a side effect of having been trained on every kind of code. The agent thinks, “I’ve got it! I have a solution!”

The more steps your solution requires, the less likely the agent is to be able to make up a missing step. It will invent one that could break other areas of your app or appear to work in prototype but fall apart in practice. You must plan for only small, deliverable steps.

The outstanding narrative game Road 96 uses randomness to create different interactions for you, the player character, based on the stories of the game’s seven NPCs. All of the NPCs are going the same place but by different routes. These interactions result in new objects and skills that you retain. In the same way, you must choose actions, questions, and answers that are appropriate for each AI model to get to the finish line — and keep those artifacts around for future use.

Finding a Route

Once you’ve selected your agent’s target, you must also a find a route for it to travel. This seems laborious, too, and always elicits cries of, “But if I have to do that, I might as well code it myself!”

Sometimes that’s true. If a change is so small that you could just make it immediately, then you should make it. If a change requires so much explanation that you are having trouble explaining it, your code needs architectural help (much more on this later).

The agent is not following any “rules” no matter how many ways we try to pretend that it is.

You will find that procedures that seem very simple to a person, like “Take the third item off of there and do something to it,” cause serious problems for LLMs. Simple console operations like copying files or running builds are also problematic.

To understand why, we need to adjust our expectations. The agent is not following any “rules” no matter how many ways we try to pretend that it is. It’s merely predicting the next most likely piece of text to output from whatever series of prompts it has in the thread at the moment.

It’s exhilarating to watch an agent code a feature out of thin air and then go play with it minutes later. It can lull us into a false sense of security where we start asking for things we shouldn’t ask for — things we should do ourselves. We see one great miracle and then ask for a small one in the same codebase, not realizing we are on an agent high.

If you’re not sure how to implement it, that’s fine — just ask the agent. The more you reference your own code and data, the better the answer will be.

You should probably try it at least once with something inconsequential just to have the resulting crash. If you try vibe coding something you plan to keep, be prepared to spend precious hours or days rescuing the beautiful baby you created together because you’re in love with your “progress.” Your rescue will involve combing through the agent’s code and your existing code to find out how and why it came to be.

Thus, I would recommend that you begin at the beginning and ensure that you know exactly how to implement the thing you’re asking for. If you’re not sure how to implement it, that’s fine — just ask the agent. Put it in “planning” or “asking” mode first or just say, “I’d like to understand how…,” or “I’m trying to understand the implications of…” The more programmer-speak you use in your question, referencing your own code and data, the better the answer will be.

Star Trek was already in reruns when I was born in 1968. My mother didn’t let us watch it because it was “too scary.” The series premise, to go where no man has gone before, certainly implies risk, and Kirk was quick to make a plan and communicate it before anyone teleported to a new planet or ship. You, too, must quickly make and revise plans in this new AI space.

Making a Plan

It would be great if we could just make the plan in one step. It’s like asking to learn to play the piano in one step. You’ll get better with time as you realize the problems with agentic coding stem mostly from your poor plans and your bad code, rather than from bad models or broken tools.

Some people will not be able to admit this. Developers are famously bad at communicating with other humans, yet this is exactly the #2 skill that agentic coding requires (#1 still being regular programming).

Often programmers live in a world of “Well, it worked when I tested it locally,” and we won’t want to fess-up to architectural nightmares or implementation problems in our own codebase, which we don’t normally show anyone.

In this earliest part of the conversation about a new plan, the model suggested some things and I hated all of them, but some pushback from me and some manually-coded architectural changes before I asked the agent to “just do it” resulted in a very different and very clean technique of pure editable JSON files for the metadata so that it can be separated from any individual game and injected at build time. The agent’s suggestions didn’t take into account that JS-based solutions don’t show up for social shares like Discord or Slack, which don’t run JS and would wind up with only the template metadata.

Thus we arrive at the new programming language, some mash-up of English and pseudocode, brimming with arcane references to your own existing app and its files, functions, and data structures.

Tools like Cursor that let you @mention these parts of your code do an even better job with plans than tools that just use a typing or cut-and-paste (or even file upload) interface.

In my repos now, a /plans folder is a first-class citizen. I start every new complex request with asking Cursor to write a plan and put it in my-plan-for-this.md, a Markdown file in that folder.

By saving these with the repo commits, they become usable programs that I can run later by @mentioning them. I can even start a new thread (and often do) by mentioning a plan by name, then asking for revisions or asking to take a single step inside it while I retain control of the repo. I make constant commits with clear messages about which plans have been written or changed and which steps have been run.

We’ve heard the term “code as doc” and here it is in practice.

  1. Plans are runnable programs.
  2. Written in Markdown.
  3. That contain real code and data.
  4. And get saved in your repo.
  5. With plan-related commits and messages.

Whew! This concept of developing, revising, and saving your own plans is far more important than trying to download someone else’s plans or rules file, despite the fact that hundreds of those appeared overnight on the web.

You can get a book about renovation from Home Depot but that book doesn’t have a plan for your house. The same is true here.

Look inside a plan and you’ll see what enables the magic. They are light years beyond what a person would write.

By letting a tool like Cursor create its own plans, we gain fascinating insights into how the app crafts its own prompts — the ones that are actually fed to a model. We know it’s prompts all the way down. When you write a prompt in the chat box that references a plan you made, the contents of that plan are attached to the prompt. That’s the sum total of the magic.

Here you can see the actual plan I had Cursor write, along with the interactive chat window while it was being composed. Notice the level of detail and markdown formatting which humans wouldn’t take the time to add. This is very helpful when you want to edit or run these plans. Also notice that Cursor wants to be updated (bottom left), like always!

But look inside one of these files and you’ll see what enables the magic. Its plans are light years beyond what a person would write. They’re fully commented, described with narrative text, full of examples in real JSON or TS (in my app — yours will be in your languages), and use abundant Markdown formatting to make them both machine readable and pretty when you look at them in an IDE preview pane.

When was the last time we had code/doc like that? Remember, these plans are runnable software which you invoke by saying, “Let’s go ahead and do step 2 in @world-domination-plan.md. When you change the plan, commit your repo with a human-readable comment. When you run a step, commit again and comment that you “ran step 2” or something similar.

You will need these commit breadcrumbs when you later want to roll-back something or look at ideas from old plans you’ve since changed or removed. Often old plans contain juicy bits that we might want to look at later — things where the AI’s inventiveness or clever problem solving might be useful after all, even if the way it describes isn’t implementable in the code you have now.

Chappell Roan’s Pink Pony Club is arresting enough and it‘s already mass market. No wonder Momma screamed when she saw it. An AI agent will plan and do things that seem just as off-the-rails and that comes from their mass exposure to other people’s programming, which weighs more heavily on the training set than your code or your prompts — unless you specifically tell it otherwise.

Revising the Plan

As soon as your plan is written into the Markdown file, it will be wrong. That’s all the time it takes!

Should this be frustrating? That depends. Yes, it’s irritating to see it go off the rails here in the plan, before we even get started. It’s not very comforting and doesn’t bode well for actually executing on the plan. We start to scream, “God, what have you done!” when it wants to be a Pink Pony girl.

To make this work, you have to find out if the finished product with the correct architecture and design takes less time overall using this process than with pure manual coding (or even just autocomplete). An individual step like revising the plan seems annoying because we didn’t have to do that before. Of course, the only reason we didn’t have to do it before is that there was no plan before.

It’s unhelpful to lecture the LLM about what’s wrong because you are just adding more context to something that falls apart the more context you give it.

We all code stuff out of our heads. If we write doc, it’s mostly after we coded something. If we try to write doc before code, it’s also wrong immediately because the code deviates from the plan in practice.

In this closeup from one of my plans, you can see that they actually contain real example code like CSS, TS, and JSON. You can use this ability to avoid having to say where or what to copy from when doing a repeated step. Ask Cursor to take your example (from wherever it is, like one that works) and include it in the plan. This comprehensive plan, hundreds of lines long, took several revisions to get right.

Don’t be starstruck when you see the new plan. It looks so complete, so professional, how could it be wrong? Trust me. It’s wrong. You really need to read all of it. If there are simple things wrong, like whole sections that shouldn’t be there, just yank them yourself. Don’t yell at your spouse that they should have taken out the trash. Just dump it.

It’s unhelpful to lecture the LLM about what’s wrong because you are just adding more context to a prompt chain that falls apart the more context you give it. Having said that, if there are changes across the plan, like widespread implications, data formats, techniques, etc. that are wrong, you don’t need to re-write those. Just tell the model to change the plan and why and let it go through the whole thing again, making all the necessary updates. Then read it again.

You’ll be impressed when you craft your first plan that looks right. Then you’ll run it…

When you first start writing plans, you’ll likely need more than two revision steps before you even run it. You are learning a new style of programming. When I loaded cassettes into the ADAM and tried to edit them, they often didn’t work the first time. You’ll be impressed when you craft your first plan that looks like it’s right, down to the letter.

Then you’ll run it and find out it’s not.

GTA Online is certainly a game where one must make plans and quickly revise them, and they have consequences as you can see in my mugshot. A running theme of Grand Theft Auto is the gleeful way you must drive through road signs, blockades, and even people to meet your goals. Without adult supervision, your AI agent will adopt its plans to do bad things because you said it had to get to the goal.

Testing the Plan

After seeing your plan written out better than you could write it yourself, the agent will undoubtedly offer to just go ahead and shoot to kill. And you should absolutely not do this.

I’ll stop here to say that, even if you have no intention of letting an agent change your code, it can be very useful to have it generate documentation for you or others in the form of these plans. You can ask it to describe how something works in your existing code, put it in an .md file in a /docs folder, and grow that library of doc.

It’s smart to do this even if no one else reads your code because you can @mention these doc files to attach them to prompts, thus making “mini-rules,” and we’ll look at how to turn those into other kinds of rules, including automated ones, in a moment.

Often, all the right steps are in the plan but, for various reasons, they might need to be done in a different order than the model suggested. You might decide to do file or terminal operations locally, for example. Don’t bother changing the plan or telling Cursor anything. Just tell it to do step 3 when you want that, even if it’s first. Don’t spend credits on edumacation that ain’t goin’ nowhere, as we know LLMs don’t increase their understanding through more talking.

Testing your plan vs. your actual code will reveal many ugly truths about what you, the human, have written.

Often, you’ll want to make some other refactor or cleanup before having the agent start the plan, and you should do anything you can to “clear the path.”

This is another place where we lose folks on the AI road. “But if I just code it myself, I don’t have to do any of that.” Hard to argue that one. The truth is that testing what’s written in your plan vs. what’s actually in your codebase will reveal many ugly truths about what you, the human, have written.

It’s easy to say, “I don’t have to time to cleanup my code right now. I need to ship this.” And that, my friends, is how we get tech debt.

One of the best uses of agents is in refactoring — and yet we hear people saying it can’t be done! I’m here to exclaim the opposite. A careful refactoring of something pointed out by the AI is extremely useful. It’s shining light onto a cavity that you haven’t seen. Drill, baby, drill!

As long as your refactoring follows the plan for making plans that I’m laying out here, you’ll wind up with less tech debt across the codebase. You’ll have code that’s much easier for you and for the agent to work with in the future — code that won’t get left behind because nobody understands and it nobody can really work on it anymore.

Whether or not you choose to refactor from something you see in the plan, you should do that as a separate thread with its own plan. Avoid the temptation to keep injecting that drug from Dr. Feelgood. Whichever plan you’re on, only allow the tool to do one step at a time, and commit and test yourself after each step.

If you tell the AI what you want, you might be heard one day and ignored the next.

It’s exasperating to find that the AI keeps missing “obvious” things, but they are only obvious to you from the time you spent working on this codebase and others. LLMs don’t accumulate knowledge like people. Even having millions of lines of code in their training doesn’t give them any understanding of code, it only gives them a terrific predictive ability based on repeated exposure to code in other contexts.

A large training base actually makes it less likely that it will correctly guess custom code that applies to your environment. The retrieval system always shows the most likely answer (in the context of the prompt and the current random seed). This means that the most likely answer will not be the solution that exactly matches your custom architecture. It won’t be the predominate stuff in the training set.

This is why we see LLMs constantly try to steer you toward rando solutions that they are “certain” will work. It’s because they’ve seen these in their training sets in the context of the question you asked. That doesn’t mean that the meaning or use of your code is taken into account when giving the answer.

If we tell a human, “Look, Larry, we always use composition and don’t write things that inherit from each other,” you would expect that to be a one-time mention or maybe even something you add to a code style manual. If you tell the AI that, you might be heard one day and ignored the next. It’s not “learning” anything from you. It’s predicting what you want to hear.

You can improve some of these predictions with plans and rules, but we’ll never get to 100%. LLMs are not databases and don’t reproduce the code they were trained on directly. It can even be hard for them to reproduce your own code exactly, since that’s not the most likely code to be predicted. And they have no way to understand your code’s meaning and purpose unless you describe how those intersect with the way it’s written.

After the model exclaims that all your steps are finished and the app is working, it might — depending on its mood — offer to run it in the terminal. I would say, “Never mind,” at least with web apps. There are agent wiring tools that supposedly can round-trip between writing code and viewing the browser results (to go back and fix the code), but I’m not a believer in that just yet.

Your original “goal” could be very far away in the thread and no longer considered important.

I run all builds and tests in my own terminal window and you probably should, too. And I look at all user-facing output in the browser, just as a user would and gradually as the app develops. Asking the LLM to test its own output has a chance of sending it down a wrong road to make a fix, or faking the test to make it work (like inserting mock data or changing the test criteria), or even outright lying and saying the tests work when they clearly don’t.

In a real example of this, I had Cursor help me fix text with ellipses. It’s supposed to happen with CSS and grow or shrink with other items but is notoriously difficult to get right inside a series of shifting containers. After giving up on its own solution, Cursor changed the Typescript code to trim off some number of characters from the string and put “…” at the end! This might look the same in some cases but very much is not. It took me a bit to notice. Trust, but verify.

The reason for this is the same reason for all the other behavior we hate. These are predictive answers, not provable ones. The AI is predicting that the code works and then predicting a solution that will work when the first one doesn’t. Your original “goal” isn’t even part of the predictive process. It could be very far away in the thread and no longer considered important. People do this, too. We forget a fact or a name after a convo or a movie scene goes on for a long time.

Take the time to write a good ticket and you’ll get back a real fix.

When your real, human test fails, don’t ask the AI to correct the problem immediately. Instead, you guessed it, ask for a plan for the fix. Provide screenshots of the output that’s a problem and explain exactly why. Provide console or terminal messages and screen captures of the browser inspector where those would help the agent in finding the fix.

I pasted this screenshot into a Cursor chat while debugging the text that ends with ellipses. I used a trick that works like dental disclosing tablets — putting red boxes (with CSS) around the problem elements. Then you can mention that in your prompt to help Cursor see what it should be working on. You can also paste architectural diagrams if you have those or need to draw one to explain something better than words.

In other words, don’t write a shitty JIRA ticket. Take the time to write a good ticket and you’ll get back a real fix. The fix itself may take more than one try (thus having a plan for it), but you’ll be surprised at how many flowers bloom from these crazy planting sessions. The joy we all feel as software developers when it “just works” is very much there when you get the agent to the finish line — after following your plan!

There is no secret tech debt left behind, either. This method forces you to become a ninja over your own codebase and to write and review documentation about how it works. The AI isn’t replacing you as a developer, it’s helping you level-up.

And those three dots? In the end, I had to manually fix the ellipses everywhere I wanted them. Because it’s tricky and not often used, there aren’t many examples in the model’s training — and you’ll find this true of every cool thing in your code. You will have to ‘splain all the cool stuff and the most unique parts, your secret sauce, you will have to design and somewhat implement manually.

I visited this set in Rosarito, Mexico where Titanic was filmed during the brief time it was open for tours. What’s not shown here is that the ship has no other side. When shots from the port side were needed they were taken through a reversing lens. Luggage tags and signage for those scenes were printed with mirrored text to appear correctly on film. How many illusions in your code will working with an AI agent uncover?

Finding Bigger Problems

The most humbling part of agentic coding, and possibly the most beneficial, is the realization that all of the bad code is your fault. I’ve been in many standup meetings, retrospectives, and other blame games in my career and I can’t recall a time when everyone stood up and gleefully took responsibility for their bad code. Yet this is what you must do to optimize for agents.

We get away with shipping bad code because most bugs don’t take down the whole house of cards.

We write bad code for two reasons. First, coding is extremely difficult. The immediacy of seeing the agent code something and then realizing that it’s wrong should be proof that the real act is harder than it appears.

Even the best-informed, most well intentioned programmers write and ship lousy code. We are driven to by the demands of the product and by our employers, and we get away with it because most bugs don’t take down the whole house of cards, so we fix and push past them.

Fixing code, even when we know it’s not great, is also hard. It takes time and brainpower we don’t always want to give. Sometimes, it really isn’t worth it. We can’t rebuild an ideal version of everything in software dev just to get one product out the door. The days of fully handcrafted software where one team is responsible for every line of code went away decades ago. Even as an Army of One, we ship code with compromises.

AI tears down a wall. Ignore what’s on the other side at your own peril.

The second reason we write bad code is caused by the first. It’s the badge of honor that we wear as programmers, the fact that we do work that the vast majority of people cannot do — something we share with great doctors and lawyers and other professionals at a similar pay grade.

We don’t want to come down from the ivory tower of our profession and say that our stuff has holes. It has flaws. It has a crappy UI. We didn’t make what the user wanted. We made it too hard. We didn’t want to learn a new tool. We like doing it this way already, etc., etc. We say, “The operation was successful but the patient died.”

If you work in software, you know this is true. AI tears down this wall. Ignore what’s on the other side at your own peril.

Cursor didn’t say, “Sorry, Dave, I’m afraid your software can’t do that.”

In developing the Watson Engine for my new game, I started with a framework of my own design that I’d used in global production in numerous apps already. It definitely allowed me get a demo together quickly. But when I started poking around the edges with Cursor, looking to add more game designer features, I came to see architectural problems that were not problems in the previous apps but would be stumbling blocks for Watson.

Here’s an architectural diagram that I made to help decide how I wanted a major refactor to work. I pasted this diagram directly into Cursor while having it write the plan with me. Consistent naming and formatting, like braces around names of JSON objects and square brackets for arrays, let Cursor understand me without explaining. This is another new variation of “doc as code,” having the AI write something that matches an architecture diagram. In case you’re wondering, I use Xmind for these diagrams.

It’s important to note that Cursor didn’t say, “Sorry, Dave, I’m afraid your software can’t do that.” Other devs have complained that agents will fully comply and in fact blow smoke up your skirt to tell you how great your thing already is. The epiphany comes from you carefully looking at the kinds of problems you’re running into with the agent.

  1. What is it misunderstanding?
  2. What is it repeatedly having trouble implementing?
  3. What part of your architecture or design are sub-optimal, thus forcing “hoop jumping” by the agent?

If you can spot these bandits when they repeatedly appear in your field of view, you can see a path to a bigger architectural change that will simplify development for the agent, and for you. Don’t try to wrestle the LLM into working around your bad design. Just fix it, and use the AI to plan and implement those changes.

The reason this is needed with agentic coding is that we do it already with our own code, we’re just hesitant to admit it and certainly don’t talk about it with others. We reach that “code exhaustion” point where we say, “This whole thing isn’t working and I need to rethink it.” When we are able to state the problem out loud, we are halfway to the solution.

The only difference between being personally “beat down” into fixing your own bad design and agentic coding is the agent’s huge speed jump points out your problems sooner. That’s also an opportunity because you can use the agent in investigation mode to figure out your architectural problem and solve it in isolation, earlier in your delivery process. The earlier you find a problem, the cheaper and easier it is to fix.

When I was a child in the 70’s, I opened one of my dad’s drawers and found a slide rule, like this one. Having just discovered (the just invented) electronic calculators, I quickly put it back and closed the drawer. The “rule” in a slide rule is the ruler device, the tool. But the rules about how to use it have to come from you. I didn’t have those rules and couldn’t make use of it. Everyone starts out this way with AI agents.

Making Rules

In my first article about Cursor, I didn’t make use of its rules file and instead evolved this method of plans I’m sharing with you. Since then, Cursor has upgraded the rules system so it works just like these plans, including having Cursor write and update rules itself.

Rules files differ from plans in one key way, and that’s how they get added to a prompt.

  1. Always
  2. Auto Attached
  3. Agent Assigned
  4. Manual

An Always rule gets sent before every single prompt you type. Save this for context that the model really needs on every request, and that’s probably less than you think. It’s tempting to say, “It needs to know everything about my app for every change,” but that’s not really true.

The more context your prompt has, the more likely it will fail.

I use two Always rules files, one for the Watson Engine and one for the underlying Slate framework. That second file is used in other repos built with Slate, like my personal website. When I find that Cursor does something wacky that’s related to one of these underlying frameworks, I will ask Cursor to “update the rules file with what we’ve learned.” This reduces the chances of the same mistake happening again and has completely eliminated some major annoyances.

Here’s part of my Always rule file for Slate applications. Try to write in dictionary style here, with concise positive statements (do’s, not don’ts). It’s worth writing stuff in here if you find it’s coming up wrong repeatedly. With the latest updates, Cursor can write and modify these files itself when you ask.

An Auto Attached rule gets attached when a file you want the AI to work on matches a regex pattern. If you needed to make a rule for all your .ts files or only the ones in the /node folder, you might apply it here. I don’t personally use this feature.

Agent Assigned is a cool rule that gets attached based on a prompt. In other words, you write a short description of when the rule file should be applied. If you have a collection of style rules, you might mark that file as Agent Assigned with the description, “When I ask for style changes anywhere in the app.” That invokes a hidden prompt (one of many in Cursor) that asks if your prompt matches that rule, then injects the rule if it does before submitting the prompt itself.

This is an example of the “reasoning” done by AI reasoning models (chewing your own cud) but in this case it’s done by Cursor and works with all models.

A Manual rule is just what it sounds like. This is perfect for things like build rules or rules for other languages or environments like the terminal that you don’t need in every situation. Marking a rule manual requires a @mention to have it included it in a prompt, just like the plan files we made. That way, it doesn’t pollute the context of your other prompts and doesn’t consume tokens.

Printers are considered highly unreliable today, but in the 1970’s, IBM invented high volume devices for printing bills. Later, they moved into machines that could remove a check from an envelope you’d mailed and credit your account with the right amount. These machines were built through rigorous debugging, as paper is notoriously hard to handle. Careful examination of your code all the way through from startup to shipping will show where the path is broken. These kind of machines are still used to send you bills and bank statements.

Performance Payback

It’s clear by now that agentic coding requires an investment in human effort as well as AI subscription and token costs that must be justified by the results.

At IBM, I was trained in debugging mainframe compilers at the assembler level and I’ve always enjoyed a debugging-first view of programming. 70% of software development time is debugging, though that’s not discussed much. So perhaps I’m biased when I say that refactoring and debugging are outstanding uses of AI, perhaps more so than feature development.

Feature development is the great joy of coding, while debugging happens closer to the sewers. The problem with the “castles in the sky” possibilities of feature coding with AI is that you will get back just that — a fairy castle that only lives in the sky and can’t be shipped. As a mental confection, it doesn’t have enough bones yet. There’s no there, there.

Feel free to admit that you don’t know something or can’t understand why it doesn’t work.

Debugging is harder than making up new features. Here we have to take something broken, unbreak it, and not break anything else in the process. This requires far deeper understanding than writing a new feature in isolation. The investigative phase of agent coding shines here.

No AI is coming to your standup to hear you bare your soul. Feel free to admit that you don’t know about something or you can’t understand why it doesn’t work. Be sure to do everything you know how to do first and say everything you did and what results you found.

I’m definitely biased in my views on Sherlock Holmes (a protagonist in my game), but did you know his creator Arthur Conan Doyle invented modern forensic science when he described these techniques of collecting and laying out facts by asking questions and admitting holes in one’s knowledge? Up until then, even famous police departments like London’s relied on guesswork and fantasy ideation.

By taking a forensic approach with an agent, you come away a better investigator and a better programmer. You’ll be better able to craft the rules and the prompts you need to get your own code to the next level, and you’ll be able to talk about it with other people — programmers and non-programmers — in an understandable way.

Refactoring is the way to start agent coding.

Refactoring is where I would start with any serious work on agent coding. You’re going to wind up doing it anyway as you find the problems I mentioned. If you already know something has code smell in your own app, have the AI help you remove it. If you’re not sure of a safe way, ask. If you’re not sure about options or impacts, ask. The test for a refactor is that it works exactly the same as before, and that’s easy for you to verify.

Refactoring with an AI is so fast that we can and should do it in a lot of places where we never took the time. Tasks where we say, “Ugh, I should change this but I would have to do that across 43 modules,” become easy when we have the AI make a plan and fix the plan. That part might take as long as doing or one or two changes manually. But the other 41 modules are going to take just seconds. Performance payback.

When there’s a special case that comes up in your plans, just describe it then and there. There’s no need to modify the plan itself unless the special is going to happen repeatedly. Even then, you can often say something like “Do that in the way we did the animations.” It’s tempting to pile on context but it doesn’t create learning, it only increases the prompt length and adds confusion.

When you go to revise or add to your refactored code, it’s cleaner and more in line with your long term goals. You don’t have to figure out how to provide a place to add the new feature because you cleaned that up already. With the speed of these AI tools, we will be cleaning up a lot more code than we ever did, making it more valuable over the long haul of our business. Performance payback.

Finally, when we do write features, we can refer to the squeaky clean codebase we’ve factored up to and we can reference all the rules and plans that we’ve made for the new feature. We’re not going to say, “Make a new so-and-so.” We’re going to say, “Make a plan that includes…” and mention our important rules and files. Then, we’ll will work with the model to revise and act on that plan, one step at a time.

As before, everything I’ve said not to do is something I did. These are brave times on the new frontier. You will make mistakes, too, and find your own way. Don’t give up on the early tries. You’ll have to adapt your combat style to make effective use of these weapons.

Aside from what I’ve learned, forcing myself through these architectural changes and new prompting methods resulted in an overall design that makes it much easier to add features to my engine and, ultimately, to ship it to the platforms I want so that it can be monetized. Performance payback.

L.A. Noire from Rockstar Games is an amazing title that happily ate 50 hours of my life. It’s proof that good software doesn’t go bad. The game moves fluidly from action, to planning/debugging (what you’ll say to suspects and witnesses and how you’ll handle their BS), and finally to deep thinking where you must decide whom to accuse. Although AI tools will offer to “automatically” move between these type of actions, you’ll want to manually control that.

Choosing Models

In my last article, I didn’t make any changes to Cursor’s model settings. Since then, the number of options available and their prices have exploded dramatically. Having models consume credits is the primary way that Cursor consumes your money, so these settings now bear examination.

What I write here is likely to be out of date soon, too, so take it with a grain of salt. What I’ve found is that models come in three basic flavors:

  1. Action models.
  2. Planning/debugging models.
  3. Deep thinking models.
Here’s a tiny piece of the current Cursor models list, and it’s a long list! You can judiciously turn on and off the models you like that have good value for the money, then use them in the right kinds of prompts.

Action models are the ones that don’t have the “max” or “thinking” label on them. These are great for direct instructions in the form of, “Do this according to my prewritten plan.” Direct action models are cheaper. They make only one pass through your prompt before acting on it, hence this name I made up.

Planning models are the ones that have “reasoning” or “thinking” on the label. These cost more and consume more credits even with the same request because they send the results from the first pass back to the model with the original request to check for conformance.

It’s good to use one of these models when you’re writing or revising a plan for debugging, for a refactor, or for adding a new feature — which is why I made up this name. But you should manually switch to an action model and start a new thread once the plan is written. In the new thread, write, “According to our plan @menu-fixes-plan.md do step 1.” This keeps the model from thinking too deeply and possibly inventing or revising instructions on its own.

Deep thinking models in Cursor are labeled max. These are allowed a much larger context window and can take any number of reasoning steps until they’ve convinced themselves that they’re done.

A nebulous definition of both “done” and how we get there can lead to surprising and expensive results, especially since deep thinking models in Cursor are priced on a per prompt basis (currently 5¢). These are excellent for complex assignments like planning a massive refactor or deciding how a new feature will be implemented and crafting example code and data structures in a plan file. But don’t leave them turned on when you don’t need them.

A problem with letting Cursor choose the model is that it can limit the context, often excluding the parts you need.

It might seem like leaving Cursor in the default mode of auto-selecting the model would be the easiest, and I originally thought so, too. The problem with that approach is that you’re letting Cursor pick how to submit your prompt, what it can do, and what it will cost. Today, I recommend carefully looking at the menu of model prices, choosing which ones you will enable in your Cursor settings, and then choosing which categories of model you will use for any specific prompt from the drop-down before you start the thread.

Another problem with using action models when you don’t have a plan is that Cursor limits the context, often excluding the parts you need. If you ask for a change that’s deep inside a file (say 400 lines in), Cursor might not even see the lines you’re after if the model limits the prompt to the first 250 lines — and yes, it often does just that.

When dealing with configurations, CSS, or data files this mistake can be fatal because Cursor might decide that the data doesn’t exist or should be invented when it’s really just later in the file. Cursor can recover from these hiccups (when it says, “reading the next 250 lines,” for example), but if you know you need a larger context, it’s easier to pay for a max model for the request that makes the plan. You don’t need max to revise the plan (try a reasoning model) or to act on it (use an action model).

When starting a new chat, you can choose the “agent profile” and edit, in a limited way, the abilities it has. If I start with this Ask profile, for example, no files can be changed. All the answers have to come out in the chat window. You can switch to agent mode as soon as you want action and just ask for it without repeating the convo from Ask mode, which is included in all prompts in the thread.

For a final chef’s touch, you can now make combinations of model settings and tool limits and give them your own names. The Cursor doc suggested “Ask” and “Plan” as examples, so I made those. I like to pull down my “Plan” setting from the chat dialog and get the plan handled, then start a new thread, pull down Agent (a built-in settings group you can also edit), mention the plan, and ask to start a step.

The famous Sorcerer’s Apprentice scene in 1940’s Fantasia is from a much older Goethe poem written in 1797. (Disney loved public domain IP but did have to license the music for this one.) The TL;DR is “Only the master should invoke powerful spirits,” and this applies more in agentic coding than any other kind of software development. You’ll need to become a master of programming yourself to master it.

Cost Controls

Cursor is an amazing application with lots of magic to discover and use, but they would prefer you not use any cost controls. After all, wouldn’t it best if we just had the full Sorcerer’s Apprentice do all the mopping? It certainly would cost more.

This is no shade to Anysphere, its maker. We repeatedly see stories of “I spent $150 in a weekend with vibe coding.” Who wouldn’t love that as a vendor?

The onus is on you to control your costs through four levers:

  1. Set monthly cost limits.
  2. Turn models on and off.
  3. Look for deals on model pricing.
  4. Pick the right model for the prompt.

Cursor allows you to set a monthly spending limit which can’t be exceed until you adjust it. This is your first line of defense. You should regularly visit your account usage page to see how much you’re consuming versus where your code is today. When you fill a swimming pool, you typically look at the water meter before and after and Cursor has just such a usage meter.

Think about the human time and real money spending versus the code you got out of it to see if it’s a good value. Remember that the output is only as good as your input. Some types of tasks you assign will result in minor miracles. Others will be abject failures. Use the tool only for the areas where it’s proven successful, and keep trying new areas to see what’s possible.

Your second line of defense is to enable and disable models in your Cursor settings. If a model is crappy or too expensive, disable it. Then, no one can call it regardless of rules or automatic tool selection.

Don’t be surprised if new free models aren’t as good as the proven, expensive favorites.

Next, look for special pricing on models as they are introduced (or succeeded). These deals often allow some amount of free or heavily discounted use to entice to you try new models. As of this writing, there are more than 25 available, all with their own pricing.

As I said of all things free, don’t be surprised if new free models aren’t as good as the proven, expensive favorites. They got that way because of millions of devs like you using them, showing the value in the premium tier.

Like an online water meter, you can check your Cursor consumption before and after you fill the pool. No one knows what this stuff should cost or will eventually cost, so it’s a good idea to go in here and look at actual use and spending at key points in your project.

I read too many stories of devs not willing to invest $20 in trying a premium AI. Twenty whole dollars. Then they say AI gives crappy results because they were running Llama free on their laptop. Come on.

If a model disappoints, disconnect it. It’s not worth your time to add extra bad answers on top of the bad answers the best models already give.

Like tooling, model offerings change daily. Don’t lull yourself to sleep on AI coding because of something you experienced last year, last month, or even last week. One of the biggest risks for providers of these models and tools is how easy it is to switch between them. Use that to your advantage to see what’s viable today.

The full list of Cursor models and their prices can change at any time. You can use these prices with what you know about enabling models and using them in specific prompts to get better value and better results.

It’s also possible to use models from different vendors for planning, debugging, or coding. Experiment with several to see which ones seem to understand you the best, which ones write the best plans, and which ones stay on track when executing plans. This is all very much in flex so don’t get married to anything.

The final price control is that drop-down I mentioned in the chat box. By choosing the smallest model for the task you need (but no smaller), you reduce the context window and reduce drifting, improving the results.

1982’s Tron was a groundbreaking film for computer graphics. The dev gamer turned hacker protagonist is trapped inside the machine by an evil algo, the Master Control Program. Despite the (not accidental) name, the MCP for AI agents is just an API for them to talk to each other. Making MCP work is another battle of man versus machine.

Model Context Protocol (MCP)

If the buzzword last month was “vibe coding,” this month it’s MCP. And while it’s tempting to think of it as the Master Control Program from TRON with similar capabilities, MCP is no such thing. It’s merely a protocol, a method of passing LLM prompts and tool calls back and forth between software running on different machines.

There are two important things to know about MCP right away:

  1. Anything you can do with MCP you’re already doing without it.
  2. It winds up being prompt and tool calls, all the way down.

Naïve voices in our industry are suggesting that somehow with MCP we’ll be able to wrangle all these agentic cats and they’ll finally be under our command. But that defies the first rule of MCP. Anything it can do you are already doing.

MCP is an API format for agents to talk to each other.

We know this is true because MCP only provides a schema, a way to declare what LLMs and tools you want to call and a way for those tools and agents to declare what kind of queries they accept. To make use of any of this, you must already know the tools (APIs) you want to call and must provide the integrations in your app to make use of the LLM results (RAG) — things you’re already doing. If you have a large enough selection of models and tools that you need to call, it might help you to define those.

The actual data in MCP? No surprise there. It’s JSON and Markdown. Keys and values go in JSON and prompts are in Markdown. If one thing can be said for AI development, it’s that we’ve quickly settled on JSON APIs (and in fact largely standardized around one schema for LLM calls, the OpenAI API format, used even by competing vendors). The use of Markdown to handle prompts echoes the use of Markdown in plans and rules, for we saw that these merely get attached to a prompt in practice and are not separated from (nor more important than) the rest of the prompt.

This leads to MCP realization #2. Since at the end of the day, all we have is the combination of prompts and tool calling, we shouldn’t expect any magic to arise from MCP that we aren’t already seeing with manually orchestrated LLM prompts or RAG integrations.

In fact, I’d argue that canonizing prompts and tool call invocations is likely to be more brittle than the way we do it now. There’s a tendency to over-engineer everything if we’re telling the LLM when to activate a tool based on some prompt or reply.

On the other end, we tend to forget things that we should bring up in the scaffold because they seem obvious to people. Already, early MCP adopters are reporting “leaking” in their scaffolds where the (fixed) rules fail to properly account for the real (variable) inputs and outputs. We know this to be the case from interactive agentic coding. MCP is just frozen, declarative code for the tool calls and prompts that run through it.

That’s all for today, folks. I hope you found this info helpful. If you did, a clap, comment, or share to your network helps my visibility. I do thank you.

I’m also available for AI consulting, writing, and coding for your organization.

As always, I invite you to review my other articles here on Medium and visit the AI Dev Blog on my website for more insights, ideas, and contact information.

Until next time… Be well!
— D

Related Posts

Leave a Comment