This post is a bit of an exception (or maybe the start of a new genre) for this blog. While my usual content is strictly limited to presenting facts, this article is more of an op-ed presenting my personal take on “AI¹ assisted” coding.

While I do not consider myself a machine learning (ML) specialist, I have a strong background on the subject, and machine learning has been a big part of my work through my whole career. As such, I have been up-to-date with research trends and developments,² and I have always had a positive attitude and an enthusiasm for developments in a field which, to some extent, I consider to be my own as well.

However, despite of the innate enthusiasm that comes with working on the subject, I have always been cautious with drawing conclusions, and very interested in understanding the broader implications of such technology.

My point of view is fairly complex but, oversimplifying it in a nutshell, I see some moderate benefits in the current generation of “AI” assisted coding tools, but also a swath of caveats. I disagree with the more luddite or catastrophist takes, and I see some limited value in adopting and using these tools intelligently, but at the same time I do not share the over-enthusiasm of many people in the broader software community, and I am wary of the tones used to oversell this kind of solutions.

My personal journey with LLM assisted coding

Always enthusiast about new tools and finding new ways to optimize my own workflow, I tried out GitHub Copilot as soon as the one-year long technical preview was announced back in 2021.

Testing it for the first time, the whole concept and its behaviour were far from surprising, having followed the developments of sequential models and their applications to natural language processing (NLP) for many years before, but it was of course interesting to see a large model trained on a vast dataset in action, which is something you cannot really do yourself in most but the more resourceful research contexts.

At the time my expectations were down to Earth, and my reception was unsurprisingly lukewarm. I was not particularly impressed with it and with how little value it brought to my personal projects, therefore I decided it was not worth subscribing after the end of the tech preview.

Nonetheless, I have been extremely interested in following not just the developments of AI assisted coding, but also its reception by the software engineering community and its impact on the profession.

Local models at work

Rather than my small scale personal projects, a more interesting application (and experimentation ground) would be on my job. There I work with a mid-sized monorepo (a few million SLoCs) of fairly high quality code, subject to very strict standards, and in a complex application domain (signal processing) where we develop non-trivial algorithms structured in a non-trivial software architecture.

Of course, the first and foremost concern is confidentiality, therefore I could not just slap a personal Copilot subscription in my work environment (which is not allowed by IT rules anyways, like in any sane organization).

Therefore, I started experimenting with local models that I could run on my developer machine, avoiding confidentiality issues and any possible leak of IP. One of the upsides on working on signal processing is that even my work laptop is CUDA-capable, being equipped with an RTX 2000 Ada GPU that can handle inference of some reasonably sized LLMs. For this purpose I used Ollama to run llama3.2 and nomic-embed-text (for embeddings).

This was fascinating, but the limited capability of the models make it clear that it was not going to be a particularly useful tool in my workflow, let alone a game-changing one.

Copilot Enterprise

Shortly afterwards, however, my employer purchased a Copilot Enterprise plan, and dished out licenses to us developers. It felt like an exciting development and I immediately jumped in, and not just as a user.

I reached out to the manager owning Copilot in our company to propose a plan to develop training material and make sure developers can properly take advantage of the tool while being aware of its pitfalls. I also brought the topic for discussion in our C++ steering board and community of practice, of which I am an active member, to propose the development of custom instructions to align Copilot with our internal code standards and make the tool more useful and productive.

Understanding the limitations

Copilot (like similar models) can surely impress new users with its ability to generate snippets of code from context in ways not seen before. However, when examining closely, I see it being good mainly at two things:

Mimicking its context. This promotes repetition and duplication, that in virtually all serious projects will have a negative effect on maintainability, and should be addressed not by parroting code but by using reflection or code generation.
Suggesting things learned from its training. This is potentially more interesting, but comes with a lot of dangers and pitfalls. The most obvious issue is the risk of incorporating into our codebase regurgitations of copyrighted training material that falls under an incompatible license (or that is flat-out not allowed to be reproduced at all). Even past any legal issues, these models tend to be reliable only for very simple things, for which they are not necessarily the best tool anyways. When it comes to anything nontrivial, they tend to fall flat pretty fast.

All of this is enough for me to put a serious weight on the cons, and that is before we even factor in the problem of hallucinations and lack of semantic understanding.

When prompt-engineering with the most recent Copilot model to date, using our enterprise plan, I was surprised by how easily the model could get confused on very simple things, and overall how little understanding of the language it possessed.

Just for example, I added a line to the custom instructions file, instructing the agent to always specify a trailing return type when defining a C++ lambda.³ This caused it to start adding trailing return types to every function declaration (including free and member functions).

void myFunction() -> void
{
   // ...
}

This is obviously not valid syntax, and a clear tell of how little syntactic understanding these models have (let alone semantic understanding). The surprising part is that I was careful in crafting a specific, clear, while still succinct prompt. The only solution turned out to also explicitly specify in the prompt that such instruction did not apply to free and member functions.

Speaking about semantic understanding, there the problems are even graver. Again for the sake of example, I will tell one of the first silly moments I soon witnessed when trying Copilot at work. I was implementing some sorting routines, which is a particularly popular and mainstream topic covered from the very basic algorithm 101 undergrad courses all the way up to large amounts of academic research, industrial practice, and a plethora of production libraries. These models have plenty of training experience with it, making it a very “easy” scenario for them.

I had a file where I had implemented a pretty standard hybrid heap-sort routine

namespace detail
{

template <typename Iterator, typename Comparator>
constexpr void doInsertionSort(Iterator begin, Iterator end, Comparator comparator) { /* ... */ }

template <typename Iterator, typename Comparator>
constexpr void doHeapSort(Iterator begin, Iterator end, Comparator comparator) { /* ... */ }

}  // namespace detail

/// Lengthy documentation here...
///
template <typename Iterator, typename Comparator>
constexpr void sort(Iterator begin, Iterator end, Comparator comparator)
{
    if (distance(beigin, end) < THRESHOLD) {
        detail::doInsertionSort(begin, end, comparator);
    } else {
        detail::doHeapSort(begin, end, comparator);
    }
}

I wrote a reasonable prompt asking the model to implement a function performing quicksort. Which led to the following hilarious suggestion:

/// Blurb of lengthy comment here...
///
template <typename Iterator, typename Comparator>
constexpr void quicksort(Iterator begin, Iterator end, Comparator comparator)
{
    if (distance(beigin, end) < THRESHOLD) {
        detail::doInsertionSort(begin, end, comparator);
    } else {
        // Perform quicksort
        detail::doHeapSort(begin, end, comparator);
    }
}

The snippet was implementing a quicksort function that actually uses heapsort under the hood, which obviously makes no sense whatsoever. Note the priceless comment stating Perform quicksort right before calling a heapsort routine. This says it all about the level of semantic understanding we are talking about. In all of this, the model obviously put a lot of effort in writing a lengthy Doxygen, and dutifully mimicking the structure I used in my own code, while it completely failed at the core essence of the problem.

When explaining this to people in layman’s terms, what I usually say is this: these models do not have understanding of the language in any form that we might label as “understanding” in human terms (or in most types of artificial terms such as language servers or static analysis tools), they are simply trained to produce text that looks plausible, and that is pretty much all they do.

Not everything is bad

Not everything is bad of course, and focusing on the downsides and shortcomings should not let us forget the upsides of these LLM-based tools.

They can still be pretty good at simple things, for instance they are more likely to be right when suggesting autocompletion for very simple snippets. Personally I still see little value in that, because in most cases those same simple autocompletion results can be provided much faster and more reliably by a good language server.⁴ We have already discussed also how they can be good a generating suggestions based on mimicking pre-existing patterns in the surrounding contexts. And this is also not particularly valuable to me, as those are low-value suggestions that would be better handled by code reuse, reflection, or code generation.

What I personally find more appealing is their ability to do some refactoring tasks that would be annoying to do by hand, but are simple enough to be understood by the LLM. As a practical example from work, I was writing a piece of tooling code using OpenCV, and I wanted to rewrite some snippets of nested loops

for (std::int32_t y{}; y < image.rows; ++y) {
    for (std::int32_t x{}; x < image.cols; ++x) {
        // something interesting ...
    }
}

into Mat::forEach() calls

image.forEach<Pixel>([](Pixel& pixel, const std::int32_t position[]) -> void {
    // something interesting ...
});

The model handles these fairly well. In this case, it is true that I save time by avoiding this tedious handwork, but there is still a downside in the fact that I need to review the model output to make sure it does what I want. And reviewing someone else’s (or something else’s) code is generally effort-intensive, so it does not come for free.

Another useful application of these models is assisted “brainstorming” or gathering of ideas, where polling a model or discussing with an LLM chat bot can be a way to gather or refine ideas. But arguably this already happens at a higher level than implementation, and goes against the illusion of a tool that spits out ready-made production code for you.

Skill factor

One aspect that surely intrigues me is the amount of praise received by LLM based coding assistants and tools. Part of it comes from suppliers of said tools, which should of course be taken with a grain of salt, but a big part of it also comes from the software development community as a whole, and can be easily experienced on the web and on social media. I have seen countless people bragging of how these tools have made them manifold more productive and save them countless amounts of work.

This clashes with my practical experience, where the concrete usefulness of these tools is in reality so limited. And not just in my own experience, as I get the same feedback from coworkers and close friends on my same skill level. The feeling is further confirmed by the fact that Copilot usage statistics in my company show that most users actually just toyed with Copilot, and only a fraction of developers is actually using it for real in their daily work.

I spent quite some time reflecting on this dichotomy and trying to better understand its reality. My take is that these tools provide significantly more values in two cases: in contexts where software quality and maintainability are less of a concern, and to juniors or less skilled developers.

On the first point, the world of software development encompasses many different quality levels. I belong to a field where high quality is non-negotiable, and maintainability is a necessity. But I should not forget that there are many, many contexts where writing sub-optimal and poorly maintainable code is a viable business, all the way down to write-only code factories. I can see how AI automation can help there.⁵

On the second point, for simple tasks, other tools (such as regular expressions, a good language server, and good command of any decent IDE or text editor) are typically faster and more convenient than an LLM in the hands of a power user, which means LLMs have a more limited window of utility at my skill level. But I understand how they can feel so much more powerful and valuable to junior and inexperienced programmers, as we are reaching the point where these tools can effectively replace an unskilled coder.

A core issue I see is how often, unfortunately, people confuse programming/coding with software engineering/development. As any experienced software developer well knows, coding is the simplest part of product development, and by far. The real challenges are streamlining the requirement analysis process, handling effective communication (both internal and external), managing data, structuring and maintaining an efficient and effective organization, designing a robust and maintainable architecture, developing efficient concepts and algorithms. Translating to code is a small step in comparison with all of that.

When someone tells me that an LLM can save them 90% of their work time, the only thing that tells me is that said people spend 90% of their work time doing coding, not engineering, and probably also not the most challenging parts of the coding process itself.

As a concrete example, I have seen several posts and articles in the past arguing how great LLMs are for their authors, because they could for instance automate the generation of Doxygen that looks like this:

/// Open a socket
/// @param[in] address The address
/// @param[in] port    The port
/// @return The operation result.
OperationResult openSocket(std::string address, std::uint32_t const port);

Now, the problem I see here is that documentation like that is not just completely useless to begin with, as it does not add any more value than what would already be conveyed by default-generated Doxygen if the function was not commented to begin with, it is also actively harmful as it adds maintenance cost to keep such useless comments up to date when the code itself evolves. In my team I have a clear rule, and this kind of Doxygen written purely for the sake of it, for a self-documenting function signature, is actively discouraged.

If someone comes and tells me that automating tasks of this kind saves 90% of their time, my conclusion is that they spend 90% of their time doing things that are useless to begin with.

Zeitgeist

It is really interesting for me to reflect upon the direction of the industry as a whole. It is undeniable that AI assisted tools are a hot topic and are having an impact on the software industry that it is impossible to ignore.

Companies producing and selling AI tools are making big promises, gathering large investments, and proposing big infrastructure projects (such as power grid ramp-ups to keep up with energy-hungry datacenters needed to train and perform inference of such models). Some companies are promising an imminent come of artificial general intelligence that would open the doors of automation to vast portions of office work.

The truth is that we barely have understandable definitions of intelligence, and sometimes the terminology in this context escapes definitions altogether.⁶ Expected (or maybe wished) timelines are also often presented in fuzzy and intentionally unintuitive ways. It does not help how facts are often blown out of proportion, in ways that can easily confuse non-technical audience (or investors).⁷ High energy consumption of model training and inference, with its environmental costs and related societal impact, are also hard to not take into consideration.

Some people are buying into these promises and either envision revolutionary societal developments that borrow a great deal of optimism, or foresee doomsday scenarios that call for luddite reaction. On the other hand, some people are much more skeptical, ranging from mild disillusion to clear-cut arguments about being in a bubble that might only have a short time to live before bursting.

Many promises around AI are likely to be far-fetched, as we might be reaching a performance plateau and there is no obvious evidence that LLMs will necessarily scale to human-level intelligence, let alone super-intelligence. Despite of my technical insight on the subject, I do not want to make arguments for either side of this conversation.

When hearing forecasts and personal views, I take them with a generous amount of salt, even when they come from “prominent figures”. I trust their scientific work because I trust the methodology, but I do not blindly trust anything that comes out based purely on their personal opinion.⁸ Not that all forecasts and opinions come from prominent or even just experienced figures: most of the posts and discussion on social media and social blogging come from people with limited if not minimal technical and scientific knowledge. Knowing how to use Gemini’s API or how to use off-the-shelf models in your scripts does not make you a machine learning expert.

However, regardless of if and how the bubble will burst, it is very likely that some of the changes brought by LLMs are here to stay, even though the extent will surely be up for debate. And focusing again on the software development process and industry, it is likely that these tools will boost productivity to some degree.

An interesting question is what the short-term impact will be on the software engineering landscape. Regardless on whether the promises of massive automation will be delivered or not, even just the existence of these promises is affecting investments, rewarding downsizing and inhibiting growth of engineering teams. And while senior and more experienced software engineer roles are unlikely to be made redundant at large scale any time soon, junior roles (exactly the ones that are the most enthusiastic about these new tools) are the ones to be most directly threatened in the short term, and with them the capability of growing and developing engineering teams in the long run.

Footnotes

Technically speaking, the expression “artificial intelligence” (AI) has a broad meaning, and many problems and sub-fields of computer science traditionally fall into it. Lots of traditional problems and methods belong to its domain, also outside of large language models, deep learning, or even outside of machine learning altogether. The expression “AI” has lately been “rebranded” in mainstream conversation to refer to LLMs and their applications.

Here, for consistency with mainstream conversations, I use the expression in that sense even though, if I wanted to adopt a technically correct terminology, I would rather call this topic “LLM assisted coding”. ↩
My own work has been mostly related with image processing, computer vision, and adjacent topics, but a lot of the scientific baggage is common and concepts, algorithms, and tools are continuously shared and borrowed across adjacent fields. Methods developed for natural language processing have been routinely adopted and adapted to computer vision and vice versa. ↩
This does not come out of my personal preference, there is an AUTOSAR rule that puts a blanket ban on return type deduction, and compliance with it is not up for discussion. ↩
In my toolbox, a good language server is still order of magnitudes more useful and valuable than any LLM so far. ↩
A few weeks ago I stumbled upon a Reddit thread where some participants were arguing how terrible Gerrit is compared to GitHub. Being myself a longtime Gerrit user, I had a hard time understanding exactly their reasoning, until I realized what the underlying problem was: most participants in that thread had no concept of code reviews at all.

In that light, I can see why merging a change through Gerrit might look unnecessarily cumbersome to them compared to GitHub’s pull request model: they were completely ignoring the code review process itself, which is the whole point of using a tool ike Gerrit, while at the same time code review is a totally optional (and rather late addition) to GitHub’s workflow (to the point of often feeling like an afterthought).

On the spot, reading that conversation hit me like a cold shower, but it was really just a reminder of how diverse levels of quality exist in the programming community as a whole, and that large swaths of “programmers” or “software engineers” are not even familiar with practices that are commonly regarded as most basic in any serious software engineering effort. ↩
For instance, as far as we know, OpenAI plans to declare artificial general intelligence as a reality upon reaching 100 billion USD in profits, which would be a weird definition of intelligence to say the least. ↩
A good example was Google’s CEO claiming that “today, more than a quarter of all new code at Google is generated by AI”, when the underlying data behind the claim was actually referring to percentage of autocompletion characters accepted by developers. That paints a very different picture and does not make it sound like 25% of code is created without human insight or intervention, as “code generation” would make it sound instead. ↩
As a clear example, I cannot forget how in 2016 Hinton, amidst the hype around deep learning-based breakthroughs in computer vision and image understanding, memorably said “[We] should stop training radiologists now, it is just completely obvious that within five years deep learning is going to do better than radiologists”. In hindsight it is obvious how silly that sounds now, and even without hindsight that sounded already silly back then (at the time I was myself working on medical imaging research). And we are talking about the opinion of Geoffrey Hinton, the Turing- and Nobel-laureate “godfather” of deep learning.

There are probably more general analogies between the computer vision hype train that happened around 2016 (when self-driving cars were being promised to come within a few years at most) and the hype around natural language processing that we are witnessing today. ↩

Op-ed — My experience and reflections on AI assisted coding

Reality check and thoughts on what lies ahead