3 Comments
User's avatar
ToxSec's avatar

Really interesting angle thanks

Expand full comment
Daniel Popescu / ⧉ Pluralisk's avatar

This article comes at the perfect time, thank you for articulatting so clearly the insidious prompt-level tech debt we are only now starting to truely grapple with in our GenAI development.

Expand full comment
Lakshya Agarwal's avatar

The difference arises from post-training regimes that the model companies follow. The earlier generations (4o-era) were primarily RLHF-tuned, while the current ones (5-era) are RLVR-tuned.

RLHF requires human feedback, while RLVR is “verifiable rewards” (code/math-adjacent), meaning it can scale much faster. It’s way easier to generate 20 math questions in calculus that are slightly different from each other than it is to generate 20 conversational scenarios and collect “ideal-state” feedback. RLVR is also better suited for “agentic” trajectories where the model can explore its environment autonomously and learn from its errors to eventually complete the objective. This is the key unlock that is delivering the current performance gain in coding. Cursor IDE is an example of such an environment for code, where given an issue and a codebase, the model receives a positive reward if its actions are able to convert a failing test to passing, or a negative rewards if it errors out.

While the objective for both regimes is to move the base model capabilities towards “chat-style” scenarios, RLVR is able to do it at a much higher efficiency. As a result, the models are highly prompt-sensitive, relative to their earlier counterparts.

A promising solution to this “tech debt” is GEPA [1], which combines genetic evolution and Pareto scoring to iteratively improve prompts based on task feedback. Similar to how traditional ML ended up with online learning and continuous optimization, we may see something similar play out.

[1] http://lakshyaag.com/blogs/gepa

Expand full comment