The trouble with vibe coding: when AI hype meets real-world software

Ben Wilkes

Technical Lead
Data & AI

March 18, 2025

The trouble with vibe coding: When AI hype meets real-world software

I still remember the first time I watched an AI agent churn out code—it was equal parts exhilarating and unnerving. Fast forward a few months, and here I am—a professional software engineer building commercial software entirely with agentic AI.

I’m a firm believer in what we call “agentic engineering”—a disciplined, systematic approach to AI-assisted software development. Meanwhile, “vibe coding,” a term Andrej Karpathy coined, has gained serious traction online, with people using it in ways he never intended. It might sound catchy, but its connotations and potential misinterpretations are concerning.

When vibes meet the real world

My concerns focus on the production of commercial software. Software that other companies pay for to help operate their business, that must be robust enough for production environments, and that will need to be maintained over time by multiple people, often in different teams. Software that needs to be readable, extensible, and testable. If you’re “hacking” out a personal project, vibe away! But in the professional sphere, we need more.

The code wall

It’s hard not to be tempted to let the agents take on larger tasks, gradually ceding control. In my experience, using these agents correctly can produce high quality code and dramatically boost productivity, but they aren’t yet reliable enough to build commercial software autonomously. At some point, for one reason or another, you—or someone else working on the codebase—will run into an issue, and you’ll need to understand the underlying code to fix it. We’ve seen cases where “something” breaks or, worse, behaves in a way that wasn’t expected when it was “fine before.” Trying to unravel such issues is as challenging as navigating a legacy codebase—essentially, hitting the code wall.

When codebases grow, so must understanding

The challenges multiply when codebases grow larger and more complex. You need to be more targeted with your prompting in such cases, which is impossible if you don’t understand the code! When teams of people work on a codebase simultaneously, it’s crucial that everyone is on the same page.

Same question, different answer

We’ve also seen many instances of the non-deterministic nature of these models in code generation. It’s the nature of our industry that there are always multiple ways to do something, and we’ve seen agents implement similar functionality in different patterns within the same codebase if left unguided. We’ve also seen agents make unnecessary refactors into new patterns as part of changes for other features.

The testing blindspot

Commercial software must be testable and must be tested. Yet, testing has often been misunderstood and, at times, sidelined in our profession. “Vibe coding” makes it an afterthought. Asking the agent to simply “include tests” doesn’t work. The tests will be brittle and they will break when you refactor.

At present, the strength of the agents is on “generating”, not “verifying”. I expect we will see a lot of improvement here in the coming months, possibly new testing and bug identification agents. For now, we spend more time and effort prompting our tests than prompting the source code.

The discipline behind the magic

So, how do we prevent these concerns and use agentic engineering to deliver good-quality commercial software? In a word: discipline. We apply the same discipline—and very similar practices—that we applied to manual engineering as we do to agentic engineering.

Put a fine point on it

To start with, you must give the agent clear, detailed instructions. This is probably where we spend most of our time. You must have detailed requirements; the more detailed, the better. We prompt a reasoning model (o3-mini, Claude Sonnet Thinking, etc) to generate user stories, although the actual type of document doesn’t matter—the detail and clarity do.

Small bites, not big gulps

Whether we’re creating a new application or extending an existing one, we break large problems down into small, clear, detailed problems. It’s not because the model can’t handle more; it’s so we can review and understand what it’s done. This point about small chunks is worth emphasising. As I said before, it’s hard not to be tempted to let the agents take on larger tasks. However, we’ve significantly improved our productivity and maintained quality by quickly tackling lots of small pieces of work and remaining in control of each one.

Using your house style

There are a number of agents out there that can generate software, and my current go-to is Cursor “chat” paired with Claude Sonnet. One of the features I love about it is the “rules for AI.” These allow us to define reusable code standards, conventions, and patterns. We have reverse engineered company standards into rules for AI. Again, these rules are detailed, clear instructions for the agent, leading to more consistent and deterministic outcomes.

Git has your back

We simply use Git to manage rapid changes, employing the same straightforward, disciplined processes we’ve used for years. We start by creating a new feature branch, review all the generated code using Git diff, and revert any problematic changes. When we’ve finished, we create an MR and have a human peer review.

Three’s not a crowd

I enjoy pair programming—and I now love pairing (or “swarming”) with both a human and AI. I treat AI as a collaborative partner, it will surprise and delight me, but I’m guiding and reviewing its work. I know what good looks like.

Gen test refactor

Once on our branch, prompting is crucial. We start by prompting Cursor to generate the source code, then manually verify that it works. If errors arise (as they often do), we feed the logs back to the model for fixes. Should we get caught in a loop of constant changes, we know it’s time to rework our prompt or rewrite our requirements. So, we’ll revert and start again.

Next, we prompt the agent to generate tests—both unit tests and end-to-end tests. I’ve always been a massive fan of test-driven development (TDD) in traditional engineering, but in agentic engineering we don’t do TDD for unit tests; it’s not intuitive for us or the agent. We put the effort into crafting prompts and leveraging Cursor’s rules to generate tests that test functionality rather than implementation, that aren’t brittle, and that enable us to refactor or make changes with confidence. For end-to-end tests, such as with Playwright, it is possible to test first if you want to, as they are more decoupled from the underlying implementation.

The next step on our feature branch will be refactoring—”Gen Test Refactor,” if you like—similar to “Red, Green, Refactor” in TDD. When it comes to refactoring, we know what good looks like, so we’ll ask the agent to make specifically targeted changes. Asking an agent just to refactor leads to unexpected results, as again you’ve not been clear enough.

Demos dazzle, but production matters

Another area worth mentioning is the prototype-to-production pipeline. When it comes to initial prototypes and proofs of concept, we’ve found that agentic AI shines. We can whip up demo-able software at speeds that would have been unthinkable just months ago.

However, the critical step is to then be disciplined in transforming these concepts into production-ready code. It’s all too easy to be seduced by how quickly the agent can build something that “works,” while overlooking the engineering rigor needed to make it work reliably. This is precisely where vibe coding falls short—and where agentic engineering principles are needed.

Feel the vibe, do the engineering

So there you have it. While I’ve got no problem with “vibe coding” on personal projects, I’m not one for bringing that mentality to commercial software. Generative AI and agentic engineering are still in their infancy, and sloppy practices now could hold back their tremendous potential. We need clear guidelines—think along the lines of an agile manifesto or Bob Martin’s “Clean Code” —to steer us in the right direction. That’s why it’s frustrating when someone like Andrej Karpathy coins the term “vibe coding.”

Admittedly, “discipline” might not sound as fun as “vibe,” but it’s been our secret to producing quality commercial software at least five times faster than traditional methods. It also ensures that anyone on the team, including AI, can easily understand, test, and extend or change the code when needed.

I’m well aware that today is the worst these agents and models are going to be—the pace of innovation is truly impressive. However, we must deploy these technologies professionally, fully aware of both their capabilities and limitations, if we’re to see successful adoption in the commercial sector.

So next time you go to work, leave your vibes at the door and let “agentic engineering” drive innovation.

You may also like

Blog

The AI pragmatist manifesto: A realistic approach to AI

Will 2025 be the year of agentic AI? A good time to innovate responsibly

Blog

Will 2025 be the year of agentic AI? A good time to innovate responsibly

Blog

GenAI in regulated enterprises: Balancing innovation with privacy and compliance

Get in touch

Solving a complex business problem? You need experts by your side.

All business models have their pros and cons. But, when you consider the type of problems we help our clients to solve at Equal Experts, it’s worth thinking about the level of experience and the best consultancy approach to solve them.

 

If you’d like to find out more about working with us – get in touch. We’d love to hear from you.