I recently vibe coded a dozen apps to check first hand what the state of LLM based coding looks like these days. It was honestly quite underwhelming. The best analogy I can come up with is a possessed kid projectile vomiting code at me. Like a broken clock that is correct twice a day, you get hints of help but frustration soon after. The whole process without a doubt was the fastest way to build technical debt.

Not the usual app building, what I tried

I used the most popular 3 commercial LLMs (ChatGPT, Gemini and Claude) to vibe code personal, educational and work related things. Here are the dozen ideas I tried:

  • Personal: sentimental news analyser, MIDI based music composer, visual traffic jam detector, ecosystem simulator (something I coded when I was 14 in C#)
  • Educational: these are mostly projects I supervised at Imperial, 3D snooker reconstructor, webcam based heart-rate detector, first-order logic based LLM evaluation, neuro-symbolic classification of birds
  • Research / work: a mixture of research at Imperial and work at Semafind, featureless tracking of insects in high resolution high speed cameras, neuro-symbolic argumentation of LLM based agents, IoT sensor data time-series prediction and modelling and finally, improving the speed of SemaDB on ARM based SIMD instruction sets for Go language.

I agree and admit these are perhaps not your usual “build me a calculator app” vibe coding situations. Most of them are unknown in their solution and require a second thought on how to approach them. But I was sold “PhD level” assistants so my expectations were high. I didn’t choose these topics to purposefully challenge the LLMs or the vibe coding process, but because this is what coding mostly looks like for me.

For each topic, I gave the description of the problem, the requirements and asked the tools build / vibe code an appropriate solution.

The Bad, The Ugly and The Fugly

The initial start for all these with a prompt seems really good. You see all this code getting generated, there are spinners for all the tools I used. They make the whole process look like “we’re getting it done, don’t worry, you’ll get your cake and eat it in no time”. Well, that rarely happened. The common denominator was more along the lines of:

  1. Diminishing returns: It starts with a grand skeleton of the application and spits out a working or an almost working version. What I mean is that the LLM usually creates something that looks convincing. But almost always, the difficult part such as an algorithm or a key challenge would be hallucinated as having an existing function like “I’ll just import this magic track_insect function” to solve this problem you’re asking. The problem was about tracking insects, I don’t really care about the plumbing around it. I noticed the more you interact with the LLM to fix or update the code the worse it would get until a point it usually stops running at all. The returns if any are definitely logarithmic that diminishes really quickly.
  2. Biased towards popular libraries: All web code generated had the same React and tailwind CSS injected. Any computer science related code was in Python. I don’t think this is surprising at all if the goal of the LLM is to maximise the likelihood of the data it has seen. There is just more Python, Javascript and popular library code out there. What is frustrating is that you can’t move on and say “here is the problem and I want you to use SIMD C code, here is some documentation” when you are being sold ludicrous subscriptions for “PhD level” assistants.
  3. Increased spaghetti code: The second interesting part was the tangled code all these tools produced. If it started going down a rabbit hole, it doubled down on it by integrating around the core part it had generated. I suspect this is a self-feeding problem because of the scale of spaghetti code produced in code used to train these models in the first place. Whenever I ask it to refactor, they seem to get feisty and throw some abstract base classes, split into many files, change names. These are all the steps you would potentially do to refactor but not perhaps all together unless you want to satisfy your software engineering overlords. Or whenever you see the words refactor, you do those things as they are likely in the data.
  4. Lack of documentation: You get a lot of code but little documentation. Sure, you can ask it to document it but it just tells you what the code does, not really why. Maybe the LLMs don’t know why either besides it is statistically likely that it is done this way. For example, in all my attempts they explain line by line the code: “iterate over the images to extract objects and check for cars”. Yeah okay, I can see that you’re calling an object detector and extracting cars.
  5. Super friendly: Finally, this is my favourite. Even when the LLM implementation is correct, if I give the slightest doubt that it might wrong all of them, especially Gemini for some reason, goes into an apologetic narrative and tarnish the correct bits by throwing plausible looking mistakes to fix their code. It’s sad to watch.

Be prepared as you must be the master

In conclusion, if you don’t know what you are doing, the LLMs don’t seem to be much wiser and they will waste your time tomorrow if not today. My advice would be to do some research from reputable sources on what to code, how to code it to get an idea then maybe exercise vibe coding to get the initial structure. Question yourself, not the LLM because the LLM will always apologise and serve you the soup cold despite telling you it definitely turned on the microwave oven the soup should now be fixed hot.