Markus Oberlehner

AI-Enhanced Development: Building Successful Applications with the Support of LLMs


The ChatGPT-induced AI boom has led to a proliferation of AI coding tools, some more useful than others. One category of tools that caught my attention is AI-assisted test generation. However, many AI coding assistants have gotten it backwards: they focus on generating tests from code rather than generating code from tests.

People who expect superhuman capabilities from AI will almost inevitably be disappointed (at least as of writing this on May 27th, 2024). Yet, there is hope for AI-assisted development workflows. Fortunately for us, LLMs share many human flaws, so we can apply the techniques and best practices we’ve developed to address our limitations to AI-assisted development.

People who expect superhuman capabilities from AI will almost inevitably be disappointed.

Let’s explore how we can adapt best practices like Test-Driven Development (TDD) and writing user stories and acceptance criteria to the post-ChatGPT era.

LLMs Don’t Know What They Don’t Know (And Neither Do Humans)

When talking about the limitations of state-of-the-art LLMs, the topic of “hallucinations” often comes up. Hallucinations are instances where the AI generates information that seems plausible but is not grounded in reality. But is this really so different from human behavior?

Consider the classic ‘lost in the mall’ experiment. In this study, researchers were able to implant false memories in participants by suggesting that they had been lost in a shopping mall as a child, even though this event never actually occurred. Over time, many participants began to “remember” the false event vividly and even added details that the researchers never mentioned.

Over time, many participants began to “remember” the false event vividly and even added details that the researchers never mentioned.

And there are more flaws LLMs and humans share, such as the tendency to overemphasize what’s most salient in our minds. When humans see code that needs tests, we happily produce tests focusing on what we can see: implementation details. LLMs are no different; they, too, will jump on the code in front of them and generate tests that focus on the code’s inner workings instead of its observable behavior.

Why Writing Tests After Writing Code is a Bad Idea

But why is this even a bad thing? Writing tests after writing code leads to a focus on implementation details rather than on the actual behavior that matters to the end user.

Imagine you have a piece of software with 100% code coverage. But most of your tests focus on the internal workings of your functions and components. When you decide to refactor a particular function, you might find yourself in a situation where dozens of tests break, even though the end-user experience remains unchanged.

When we tie our tests too closely to the implementation, they become brittle and break when we want to refactor the code. This brittleness discourages us from making necessary changes and improvements to our codebase, ultimately leading to technical debt and reducing the overall quality of the applications we build.

When we tie our tests too closely to the implementation, they become brittle and break when we want to refactor the code.

Best Practices in the Age of AI

When striving to continue to build high-quality applications in the age of AI, it’s more important than ever to adhere to best practices that ensure our software’s quality, reliability, and adaptability. Let’s look at how some of these established best practices help us now and become even more relevant when we let AI take over.

First, Write User Stories

After we’ve explored opportunities and done initial research, we want to document our findings in the form of user stories–our chance to provide context to what we do. Context helps humans and LLMs understand the “why” and prevents us from filling in the gaps in our knowledge with incorrect assumptions. These narratives nudge us to think clearly about what outcomes we want to achieve for our users. Such clarity is vital for humans and LLMs alike, as it reduces the likelihood of misinterpretation and errors later when we start writing the actual code.

As a Grocery Shopper  
I want to keep a list of groceries I need  
so that I have them ready for my next shopping trip.

After we’ve created a clear user story, the next step is to define our acceptance criteria. Acceptance criteria outline the conditions under which we consider a user story complete, providing even more context for both humans and AI.

- It should be possible to add items to the shopping list.
- It should be possible to remove items from the shopping list.

At this phase, humans remain in charge—at least for now. Yet AI can assist in generating more detailed acceptance criteria or suggesting edge cases.

Now, we have enough context to move on to writing tests.

Next, Write Tests

Our user stories and acceptance criteria provide the necessary context for writing meaningful tests focusing on achieving the desired outcomes.

At this step, AI can take over coding for us. Based on the user story and acceptance criteria from above, current LLMs are more than capable of writing tests like the one you can see here:

it("should be possible to add items to the list", async () => {
  await shoppingList.open();
  await shoppingList.addItem("Bread");
  await shoppingList.expectItem("Bread");
});

Of course, we must also provide our coding assistant with relevant context about how we expect it to write tests, the rest of our codebase, and the overall environment.

After letting the AI do the busy work, it’s time for humans to step in again. We need to validate the tests and ensure they meet all acceptance criteria. At this phase, we could make some adaptions, add new tests or new acceptance criteria, and remove redundant or superfluous tests.

We need to validate the tests and ensure they meet all acceptance criteria.

Ideally, our coding assistant learns from our human interventions and doesn’t make the same mistake twice. Doesn’t this sound a lot like the description of a perfect human junior developer?

Last, Write code

As soon as we’re certain that the tests are sound and can prove that we meet our desired outcomes as long as they pass, finally, it’s time to write code. Again, this is where the AI can shine.

Only because the AI now has all the context about our intent can it act mostly autonomously with minimal human oversight. In this phase, the AI writes unit tests and code as it sees fit. After it is done (all the previously written tests pass), it is time for the next human intervention: code review.

Only because the AI now has all the context about our intent can it act mostly autonomously with minimal human oversight.

However, one crucial missing piece is still necessary to make this work.

Fast Feedback Loops

Imagine we don’t have any automated tests in place: The AI would take several minutes to hours to write code, and then we might step in to manually check everything works. If not, we need to provide feedback to the LLM on precisely what doesn’t work and wait several minutes again for the following result.

As the novelty of watching the AI do its thing wears off, nobody wants to work this way anymore. When humans work without practicing TDD, we’re at least busy writing code between the tedious manual testing phases.

Thanks to automated tests, we can drastically shorten feedback loops. Although shorter feedback loops are also a key metric for success in traditional development workflows, they’ll become essential with the rise of AI. Only with fast feedback through automated tests can we let AI do its magic fully autonomously.

Only with fast feedback through automated tests can we let AI do its magic fully autonomously.

CI/CD pipelines tend to take many minutes or even hours to complete. If this is the only way to ensure our code works correctly, we must act. Our automated AI coding assistant needs to be able to run only the most relevant tests during coding sprints and gather feedback in seconds rather than minutes.

So, we’ve probably got some work to do. But the beauty of this work is that it benefits us even as long as we don’t have the tools yet to fully automate our coding process with AI. Also, humans massively profit from faster feedback loops.

Not Humans After All?

In this article, I touched on some surprising similarities between current LLMs and how human brains work.

But although we humans, too, suffer from hallucinations, LLMs seem more prone to it and almost incapable of knowing that they don’t know something. However, a more positive difference is worth pointing out: near-perfect retrieval of information across vast amounts of context.

While humans struggle to remember something we just read in a one-page news article five minutes ago, the most recent LLMs can “remember” and retrieve information from data spanning multiple books.

I don’t know yet how this might affect the code produced by LLMs, but I can imagine that it might challenge established best practices. For example, why write DRY code when labor is cheap and the AI has perfect knowledge of every instance where the code is repeated?

Wrapping It Up

It turns out humans and LLMs are not so different after all. Although some of the flaws we share might seem like deal breakers initially, I argue they are not. On the bright side, we have plenty of experience dealing with our insufficiencies.

Similar to today’s best practices, AI-supported workflows should also start by documenting desired outcomes in the form of user stories and acceptance criteria, writing tests first, and then writing code.

That way, we can ensure all stakeholders, including our AI companions, have the necessary context to build successful applications.