The Hidden Cost of AI: Tech Debt in the GenAI Era
Maintaining GenAI products have some new challenges we're just finding out about
My team has been building internal tools with AI for a while now, and we made an interesting finding over the last couple of weeks. It turns out that prompts written for one model (e.g., GPT 4o) don’t work too well in another model even from the same company (e.g., GPT 5).
This will create an unexpected source of tech debt,1 creating work every time a new model comes out. In this post, I’ll talk about this problem and other tech debt that’s unique to GenAI programs. I also have a digression into quantum computing that is mostly relevant to this topic.
Tech debt is a subject that executives hate thinking about. It sucks up developer time refactoring code, updating libraries, rebuilding data pipelines, etc., and once it’s done, there’s often very little to show for it. The product looks the same, but now it just doesn’t crash or security vulnerabilities are fixed. Y2K remains the ultimate example of this.2 Executives would rather spend that valuable engineering capacity on building new things and not doing this unfortunately necessary maintenance.
So, even though, this is a topic you probably don’t want to read about, please spare a moment to think about it because if you are building with LLMs, it will catch up with you soon.
Mo’ models, mo’ problems
Let’s start with the most obvious issue. What happens to your AI application when the model changes?
Every new model release (e.g., GPT-4o to GPT-5) brings improvements: better reasoning, fewer hallucinations, more nuanced understanding. With GPT-5, the model is now able to think when it needs to. But this changes the most effective techniques for writing prompts.
Prompts that once worked perfectly now underperform. For example, a bot we built that was designed to create scope questions is now giving us step-by-step solutions instead of just the questions. GPT-5 is an especially tricky model because if the prompt accidentally triggers “thinking,” it can take a long time to answer and eat up a lot of tokens.
The good news is that you can ask GenAI to rewrite the prompts for you. OpenAI even created a GPT-5 prompt optimizer to help make this easier, but you will still need to experiment and refine to get it working. Maybe some future super advanced model will be able to rewrite old prompts perfectly, but we are not there yet.
You might think that if a program works, why bother upgrading to a newer model? There are two reasons:
Informed users will demand the latest model. If they see a product that uses GPT-4o, they will think that they should just go to ChatGPT to use GPT-5 because they will get better results. This is even more true of LLMs like Claude and Llama that have more intuitive naming conventions. Nobody wants to use 3.1 when 3.2 is out
Over time, LLMs stop supporting older models. So, even if you want to keep using the old version, at some point, you’ll be forced to migrate.
Given the pace of new models, you might need to expect that you’ll have to do a model and prompt upgrade at least once a year for any LLM-based tool. So, if you are an investor or exec at a software company, make sure to bake this into your plans!
Fine-Tuning Forever
A more obvious form of tech-debt comes from fine tuning a model. This is when you take an existing model and then put some labelled proprietary data in it to create a new custom model. This is mostly done in highly technical industries where there may be lots of relevant data not in the base model.
The challenge with this approach is that when a new model comes out, you then need to fine tune the new model in order to take advantage of the performance benefits. As a simple example, imagine you use Llama 3.1 and fine tune a model which costs $100K and takes 3 months. Then, Llama 3.2 comes out. So now you need to spend another $100K and 3 months to fine tune 3.2. With fine tuning, you’re always a model behind the state-of-the-art because of the training time and cost.
This dynamic is the main reason why I don’t recommend fine-tuning as an approach unless there’s really a big difference in performance. The maintenance is just too painful.
Sidebar: Post-Quantum Encryption
Quantum computing is not directly related to AI, but it’s a topic that I imagine many of you are thinking about. I talked about QC and AI a few months ago here.
Quantum computing raises an important security concern: quantum computers would be able to factor numbers very fast, which would allow them to break current encryption techniques easily. Quantum computing is still at least a decade away. But, the concern right now is that a bad actor could steal encrypted data, store it until QC is real, and then decrypt the old data. Obviously, most data are not valuable in 10 years, but there are some places like healthcare and financial services where this is a real concern.
This is a real Y2K-style problem where if companies update their security, nobody will ever pay attention to this issue, and if companies don’t, there will be horror stories of people’s sensitive data leaking.
The good news is that 18% of Fortune 500 companies have already done this upgrade, so the transition is happening. It’s another good example, though, about how new technology is creating requirements for lots of unsexy maintenance work.
The Bottom Line
Hopefully, I’ve convinced you that GenAI applications will have their own challenges to maintain them. There are a few implications of this:
Be especially ruthless about killing GenAI applications that are obsolete. If you built an internal tool, and now ChatGPT can do that activity, just send people to ChatGPT
Budget some amount of engineering bandwidth for upgrading products. Note that there will be a bottleneck where the moment a new model comes out, you will simultaneously need to upgrade existing products and build the brand-new products that the new model makes possible
There are also some other issues for GenAI apps that I’m not going to go deep on for brevity but can cause serious problems: changes in regulation, changes in the underlying data the model uses (although GenAI is probably more robust to this than traditional software), changes in the APIs, changes in the underlying tools (e.g., LangChain)
Hopefully, the main takeaway here is that GenAI apps won’t magically upgrade when new models come out, and you should set expectations properly with your executive, engineering and product teams.
Have you seen things break when you changed models? Please drop me a note or leave a comment.
Note: The opinions expressed in this article are my own and do not represent the views of Bain & Company.
Tech debt are issues that accumulate in software that eventually need to be fixed to keep the software running properly. Often tech debt comes from cutting corners to get products launched more quickly or from deferring necessary maintenance.
For those of you too young to remember, Y2K refers to an issue in the 90s. Old software in the 70s and early 80s was created with two digits for the year. Engineers realized that if the year rolled over in 2000, the programs would think it was 1900 and go haywire. In the end, when Jan 1, 2000 happened, there was no major impact, but that was because engineers had spent years and tens of billions of dollars to fix critical systems.




Really interesting angle thanks
This article comes at the perfect time, thank you for articulatting so clearly the insidious prompt-level tech debt we are only now starting to truely grapple with in our GenAI development.