OpenAI’s new model is here: what does that mean?

The new o1 model shows us a new way to make progress

Sep 19, 2024

Last week, I talked about whether we are on a path towards exponential improvement in LLMs or more incremental growth over the next 6-12 months. I said that Strawberry (the code name for OpenAI’s new o1 model) would be an important signpost to determining the speed of future innovation. Well, o1 is here and it can do some impressive things, but it’s not clear yet what this means for the future trajectory of GenAI .

Thinking about o1Here are the key implications for investors:

When evaluating services businesses, it is always important to think about how AI can augment the core jobs people are doing. The new o1 model might enable augmentation for tasks that previously seemed out-of-reach for AI. So, important to test on the new model to see where the technology can help
The model does not improve on tasks that GenAI could already do pretty well like document summarization, content creation, and customer support. We will need a different model to make progress in those areas
Previously, one of the nice things about GenAI is that you could build an application using one model and then when a newer model comes along, you could swap them out and see what kind of improvement you get. This new model is not so simple. They warn in the documentation that you need to put less content in the context window. Don’t worry about the details – just note that queries that work well in older models may not work here and vice versa. Going forward, we don’t know if the next models will work fine if we switch them out or require another shift in prompt design. It means that you may need to plan for more investment to keep GenAI applications up-to-date than you had previously budgeted for
The space continues to move fast, and tasks that are hard for AI may get easier before too long (more on that next week)

Now let’s take a step back and discuss what o1 is and how it works.

What is o1?

The o11 model looks similar to prior OpenAI models. The key difference is that when you ask a question, the model says “thinking” for some amount of time, usually between 8 seconds and 2 minutes, before it answers. You can then click on the thinking button to see the steps that it went through.

Here’s an example:

It keeps going a bit further and adds in unreported accidents to get to total size of $30-40B. Based on a quick search on perplexity, this seems pretty accurate. GPT 4o came up with $15.75B which is too low because it didn’t incorporate unreported accidents, and its estimate around average cost per repair is also low.

You can also see that o1 was thinking for 26 seconds. When you click into that “thinking tab,” it breaks down the key steps in plain English. It’s worth reading through some of them:

Note that when it gets to the “Assessing market factors” step, it gets the same answer as GPT 4o. But then, it correctly decides to increase the repair costs and gets to the correct answer. This is fascinating because you can really see how the extra thinking time leads to a better answer.

How does o1 work?

One of the exciting trends in GenAI is having the bots write prompts for you rather than counting on users to come up with the best prompt for a given situation. It turns out that GenAI models are so weird that humans can’t possibly come up with the truly optimal prompts. In this famous paper, the winning prompt to get an LLM to solve math problems was a system prompt of “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.” Somehow the model does better if it thinks it’s in the middle of an episode of Star Trek.

OpenAI uses a similar technique with Dall-E 3. For the image at the top of the post, I used the prompt “Please create an image for my Substack about a new GenAI model that thinks. Don't use a brain. I used that last time.” (You can decide for yourself whether this image meets that criteria.) In ChatGPT, you can click into a menu and see the actual prompt it used to generate the image, which was

“A futuristic scene with a glowing, translucent AI model shaped like a geometric structure hovering above a digital landscape filled with flowing data streams. The AI model is surrounded by abstract holographic interfaces, charts, and symbols, representing complex thought processes. The background features a gradient of dark blues and purples, with subtle light flares highlighting the ethereal nature of the model. The focus is on innovation, intelligence, and the integration of AI into human-like reasoning, without the use of a brain symbol.”

So, it took my mediocre prompt and enriched it to make it much better. I believe that is what is happening in o1. The new model builds a chain of thought prompts for any request. Chain of thought is a fancy term for breaking a problem into steps as LLMs often do better solving complex problems when they are broken into steps rather than in one go. When I broke the market sizing task into a couple of steps, it still made the lower assumption around cost per repair. But, when o1 broke the problem into 14 steps, that allowed it to get the right answer.

What can o1 do?

Let’s look at the types of problems where o1 can outshine other LLMs. One sign that you are giving o1 a good problem is how long it takes to think. The longer it thinks for, the more steps it is taking, and therefore, the more likely it is to outperform other models.

o1 seems to be extremely good at really hard math and science problems. You can see some compelling data from OpenAI here:

Competition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)

(The AIME is a super hard math exam for high school students.) Note that the model that has been released so far is o1-preview, so this says that there is another much more powerful o1 model that is coming soon. Of the PhD questions, the model did the best on the physics questions but still did pretty well on chemistry and biology.

The new model even does well on coding problems that are complex and/or require modification of the dev environment rather than just writing code. The Devin team has an incredible example of this where they gave the bot a coding task, but there is a bug in the package the bot is supposed to use. The bug can be fixed by downgrading the emoji library. 4o was not able to figure this out while o1 actually searched through the documentation, found the issue, and fixed it! A colleague of mine has said that he’s seen this improved intelligence in coding firsthand as well.

Note that the model does not perform better on simpler tasks like summaries, brainstorming, travel advice, etc. Also, note that the model seems to do better with only relevant information in the context window instead of as much as possible. So, it’s not a step forward in every dimension but definitely for an important subset of problems.

Conclusion

So, that’s what o1 is. The question is what this means for the future of AI models and how we can think about the future impacts. I’ll discuss that next week.

The name is a bit confusing. With 4o, the o is supposed to stand for “omni” to represent the model’s multi-modal capabilities (i.e., it can do images, code, web search, etc.). In o1, the o is supposed to stand for “OpenAI,” representing that we are starting over with a new approach. Might have been better to just name the model. Strawberry was fine! Since the model’s trademark is that when you ask a question, it says it is “thinking,” I would’ve called it Rodin (because he made the most famous artwork about someone thinking).