$1/M input tokens and $5/M output tokens is good compared to Claude Sonnet 4.5 b...

Bolwin · 2025-10-15T17:17:19 1760548639

With caching that's 10 cents per million in. Most of the cheap open source models (which this claims to beat, except glm 4.6) have limited and not as effective caching.

This could be massive.

Tiberium · 2025-10-15T17:40:10 1760550010

The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.

https://docs.claude.com/en/docs/build-with-claude/prompt-cac...

https://ai.google.dev/gemini-api/docs/caching

https://platform.openai.com/docs/guides/prompt-caching

https://docs.x.ai/docs/models#cached-prompt-tokens

criemen · 2025-10-15T17:58:35 1760551115

I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens). They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.

If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!

simonw · 2025-10-15T18:00:16 1760551216

It's not about storing data on disk, it's about keeping data resident in memory.

jbellis · 2025-10-16T00:48:37 1760575717

Deepseek pioneered automatic prefix caching and caches on SSD. SSD reads are so fast compared to LLM inference that I can't think of a reason to waste ram on it.

jychang · 2025-10-16T05:30:14 1760592614

It’s not instantly fast though. Context is probably ~20gb of VRAM at max context size. That’s gonna take some time to get from SSD no matter what.

TtFT will get slower if you export kv cache to SSD.

criemen · 2025-10-15T18:08:38 1760551718

Fascinating, so I have to think more "pay for RAM/redis" than "pay for SSD"?

nthypes · 2025-10-15T18:43:24 1760553804

"pay for data on VRAM" RAM of GPU

criemen · 2025-10-15T18:57:31 1760554651

But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?

tazjin · 2025-10-15T19:44:26 1760557466

Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake

simonw · 2025-10-15T23:31:57 1760571117

OK that's pretty fascinating, turns out Mooncake includes a trick that can populate GPU VRAM directly from NVMe SSD without it having to go through the host's regular CPU and RAM first!

https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/tran...

> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.

dotancohen · 2025-10-15T19:23:34 1760556214

They are not caching to save network bandwidth. They are caching to increase interference speed and reduce (their own) costs.

minimaxir · 2025-10-15T19:00:08 1760554808

That is slow.

tempusalaria · 2025-10-15T17:45:39 1760550339

I vastly prefer the manual caching. There are several aspects of automatic caching that are suboptimal, with only moderately less developer burden. I don’t use Anthropic much but I wish the others had manual cache options

simonw · 2025-10-15T17:48:29 1760550509

What's sub-optimal about the OpenAI approach, where you get 90% discount on tokens that you've previously sent within X minutes?

tempusalaria · 2025-10-16T17:33:58 1760636038

Lots of situations, here are 2 I’ve faced recently (cannot give too much detail for privacy reasons, but should be clear enough)

1) low latency desired, long user prompt 2) function runs many parallel requests, but is not fired with common prefix very often. OpenAI was very inconsistent about properly caching the prefix for use across all requests, but with Anthropic it’s very easy to pre-fire

stavros · 2025-10-15T22:54:04 1760568844

Is it wherever the tokens are, or is it the N first tokens they've seen before? Ie if my prompt is 99% the same, except for the first token, will it be cached?

simonw · 2025-10-15T23:33:41 1760571221

The prefix has to be stable. If you are 99% the same but the first token is different it won't cache at all. You end up having to design your prompts to accommodate this.

thefroh · 2025-10-16T14:43:01 1760625781

which is important to bear in mind if people are introducing a "drop earliest messages" sliding window for context management in a "chat-like" experience. once you're at that context limit and start dropping the earliest messages, you're guaranteeing every message afterwards will be a cache miss.

a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.

if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.

stavros · 2025-10-16T07:11:51 1760598711

That's what I thought, thanks Simon.

thefroh · 2025-10-16T04:53:18 1760590398

because you can have multiple breakpoints with Anthropic's approach, whereas with OpenAI, you only have breakpoints for what was sent.

for example if a user sends a large number of tokens, like a file, and a question, and then they change the question.

simonw · 2025-10-16T05:25:43 1760592343

I thought OpenAI would still handle case? Their cache would work up to the end of the file and you would then pay for uncached tokens for the user's question. Have I misunderstood how their caching works?

thefroh · 2025-10-16T14:30:30 1760625030

not if call #1 is the file + the question, call #2 is the file + a different question, no.

if call #1 is the file, call #2 is the file + the question, call #3 is the file + a different question, then yes.

and consider that "the file" can equally be a lengthy chat history, especially after the cache TTL has elapsed.

simonw · 2025-10-16T19:38:18 1760643498

I vibe-coded up a quick UI for exploring this: https://tools.simonwillison.net/prompt-caching

As far as I can tell it will indeed reuse the cache up to the point, so this works:

Prompt A + B + C - uncached

Prompt A + B + D - uses cache for A + B

Prompt A + E - uses cache for A

logicchains · 2025-10-15T17:24:46 1760549086

$1/M is hardly a big improvement over GPT5's $1.250/M (or Gemini Pro's $1.5/M), and given how much worse Haiku is than those at any kind of difficult problem (or problems with a large context size), I can't imagine it being a particularly competitive alternative for coding. Especially for anything math/logic related, I find GPT5 and Gemini Pro to be significantly better even than Opus (which reflects in their models having won Olympiad prizes while Anthropic's have not).

HarHarVeryFunny · 2025-10-15T17:44:23 1760550263

GPT-5 is $10/M for output tokens, twice the cost of Haiku 4.5 at $5/M, despite Haiku apparently being better at some tasks (SWE Bench).

I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?

criemen · 2025-10-15T17:59:44 1760551184

> I suppose it depends on how you are using it, but for coding isn't output cost more relevant than input - requirements in, code out ?

Depends on what you're doing, but for modifying an existing project (rather than greenfield), input tokens >> output tokens in my experience.

logicchains · 2025-10-15T18:30:15 1760553015

Unless you're working on a small greenfield project, you'll usually have 10s-100s of thousands of relevant words (~tokens) of relevant code in context for every query, vs a few hundred words of changes being output per query. Because most changes to an existing project are relatively small in scope.

justinbaker84 · 2025-10-15T19:09:16 1760555356

I am a professional developer so I don't care about the costs. I would be willing to pay more for 4.5 Haiku vs 4.5 Sonnet because the speed is so valuable.

I spend way to much time waiting for the cutting edge models to return a response. 73% on SWE Bench is plenty good enough for me.

jhancock · 2025-10-15T21:51:53 1760565113

How do you review code when the LLM can produce so much so fast?

justinbaker84 · 2025-10-16T14:13:34 1760624014

Just read it when it is done writing it.

evan_ · 2025-10-16T04:32:19 1760589139

with an LLM

simonw · 2025-10-15T17:26:35 1760549195

Yeah, I'm a bit disappointed by the price. Claude 3.5 Haiku was $0.8/$4, 4.5 Haiku is $1/$5.

I was hoping Anthropic would introduce something price-competitive with the cheaper models from OpenAI and Gemini, which get as low as $0.05/$0.40 (GPT-5-Nano) and $0.075/$0.30 (Gemini 2.0 Flash Lite).

diwank · 2025-10-15T18:01:27 1760551287

I am a bit mind boggled by the pricing lately, especially since the cost increased even further. Is this driven by choices in model deployment (unquantized etc) or simply by perceived quality (as in 'hey our model is crazy good and we are going to charge for it)?

odie5533 · 2025-10-15T18:08:04 1760551684

There's probably less margin on the low end, so they don't want to focus on capturing it.

dr_dshiv · 2025-10-15T18:32:51 1760553171

Margin? Hahahahaha

odie5533 · 2025-10-15T19:25:46 1760556346

Inference is profitable.

reppap · 2025-10-16T11:14:53 1760613293

If you completely ignore inference revenue needing to offset training costs. Is inference still profitable if you account for the amortized training cost?

simonw · 2025-10-16T13:32:52 1760621572

Not for the big labs, who are engaged in an astonishingly competitive buildout right now.

There are a bunch of companies who offer inference against open weight models trained by other people. They get to skip the training costs.

yunwal · 2025-10-16T12:28:12 1760617692

> If you completely ignore inference revenue needing to offset training costs.

This is what people mean when they say margin. When you buy a pair of shoes, the margin is price/(materials+labor), and doesn’t include the price of the factory or the store they were bought in

rudedogg · 2025-10-15T18:50:53 1760554253

This also means API usage through Claude Code got more expensive (but better if benchmarks are to be believed)