It’s been almost two weeks since Claude Sonnet 3.5 came out. And… drumroll 🥁… I switched Wanderly to Claude’s Sonnet 3.5 model! 👏 I wish I could have done it even faster, but as a solopreneur, I still have to be very judicious about how much context I switch1. I’m in the very final stages of launching apps for both iOS and Android and planning a new marketing strategy (more on both very soon).
I’ve run three scrappy evals between OpenAI’s models and Anthropic’s models, which I wrote about in October ‘23 and May ‘24. Each time, I’ve been excited by Claude’s potential, but it’s fallen short of GPT’s overall capabilities.
When I tested in May, I narrowed my evaluation criteria to a concept I’m calling “system-prompt2 drift”. To test system-prompt drift, you can ask the model to do something hard and measure if the model gets less good at doing the hard thing as the conversation gets deeper (and the model usually regresses to its default). In my case, that was trying to get the model to output a Flesch-Kincaid reading level between levels 2-3, equivalent to a 2nd-grade reader.
Below is the graph from my last newsletter. You can see that Claude 3 drifted away from the desired reading level as the conversation got deeper. Anecdotally, some of the stories with Claude 3 got off the rails.
But that’s changed with Claude Sonnet 3.5! During my eval that was completed this weekend, you can see that Claude Sonnet 3.5 is in the green zone and stays in the green zone… even more consistently than OpenAI’s GPT-4o3!
So I rolled out Claude Sonnet 3.5 yesterday to Wanderly. And for anyone who reads this who is at Anthropic, I wanted to say kudos! I’m sure a lot of love went into this model, and it’s paying off.
As an aside, it’s a real luxury to swap so easily between models quickly. In a few years, it’ll be interesting to see how the companies differentiate and how their market share intersects with open-source models. Maybe the capabilities of Claude 4 and GPT-5 will be enough for most devs… and when open-source models catch up, where does that leave the industry? 🤔
A couple of quick thoughts before I get back to bug-squashing and polishing the mobile apps:
I love the texture of the Claude responses. Claude grasped my desired Wanderly style well (which I want to improve even more now that Claude respects it). It also added little flourishes like all-caps onomatopoeia like “BOOM” and “HONK,” which I didn’t ask for but work well for my target reader.
The evals are actually helping me debug my prompts now that my instructions are being followed more closely. For instance, you can see in the graphs above that my intro needs some tuning to get a little lower on reading level.
The dev tools for Anthropic / Claude are still pretty thin. Given OpenAI’s mindshare headstart, there are fewer blog posts about Claude, the docs from Anthropic could use more examples, and Amazon Bedrock has been slow to support Sonnet 3.5 globally. If you want to be a developer on the bleeding edge of tech adoption, you’re going to have to work harder to use Claude for now.
A reader4 of my last post asked me why I never really talk about Google’s Gemini models: I've done a couple of tests with Gemini, and maybe it'll get better, but it's never gone well. e.g., https://x.com/fearofpoets/status/1758908757516624098. I've never seen reviews that convinced me that it was worth putting in the coding effort to give the API a shot. When that changes, I'll definitely add it to the side-by-side matrix though :)
Alright, that’s all for now. I felt like I owed y’all an update because I revisited this topic several times, and Anthropic flipped OpenAI for the first time. Big news!
In closing, ping me if you’re interested in testing out either the iOS or Android app. I’d love your help getting it ready for public release. 🙂
I sometimes wonder if a post about how I spend my time and the focus challenges of solopreneurship would be interesting. Let me know if you think I should write it.
The system prompt is a set of overarching instructions beyond just the user-agent conversation with the model.
Some of you may notice that GPT-4o is performing less well than last time. There are a couple of reasons why: 1) these models do update under the hood, so it's good to check back in regularly, and 2) when I say "scrappy eval", part of what I mean is that the sample size isn't as large as I would like it to be. I have decided, for now, that the cost of scaled evals isn't worth the cost to Wanderly's bottom line. Once I get to scale 🤞, I can invest more.
I love reader questions btw. They always give me great alternative perspectives and a burst of motivation. Thanks to all of you who write back! 🙏