Pretty true, and definitely a good exercise. But if we're going to actual use th...

libraryofbabel · 2025-10-16T21:14:41 1760649281

Oh sure! And if I was talking someone through building a barebones agent, I'd definitely tag on a warning along the lines of "but don't actually use this without XYZ!" That said, you can add prompt caching by just setting a couple of parameters in the api calls to the LLM. I agree constraints is a much more complex topic, although even in my 100-line example I am able to fit in a user approval step before file write or bash actions.

apsurd · 2025-10-16T21:52:33 1760651553

when you say prompt caching, does it mean cache the thing you send to the llm or the thing you get back?

sounds like prompt is what you send, and caching is important here because what you send is derived from previous responses from llm calls earlier?

sorry to sound dense, I struggle to understand where and how in the mental model the non-determinism of a response is dealt with. is it just that it's all cached?

libraryofbabel · 2025-10-16T22:14:32 1760652872

Not dense to ask questions! There are two separate concepts in play:

1) Maintaining the state of the "conversation" history with the LLM. LLMs are stateless, so you have to store the entire series of interactions on the client side in your agent (every user prompt, every LLM response, every tool call, every tool call result). You then send the entire previous conversation history to the LLM every time you call it, so it can "see" what has already happened. In a basic agent, it's essentially just a big list of strings, and you pass it into the LLM api on every LLM call.

2) "Prompt caching", which is a clever optimization in the LLM infrastructure to take advantage of the fact that most LLM interactions involve processing a lot of unchanging past conversation history, plus a little bit of new text at the end. Understanding it requires understanding the internals of LLM transformer architecture, but the essence of it is that you can save a lot of GPU compute time by caching previous result states that then become intermediate states for the next LLM call. You cache on the entire history: the base prompt, the user's messages, the LLM's responses, the LLM's tool calls, everything. As a user of an LLM api, you don't have to worry about how any of it works under the hood, you just have to enable it. The reason to turn it on is it dramatically increases response time and reduces cost.

Hope that clarifies!

apsurd · 2025-10-16T23:15:29 1760656529

Very helpful. It helps me better understand the specifics behind each call and response, the internal units and whether those units are sent and received "live" from the LLM or come from a traditional db or cache store.

I'm personally just curious how far, clever, insightful, any given product is "on top of" the foundation models. I'm not in it deep enough to make claims one way or the other.

So this shines a little more light, thanks!

ayewo · 2025-10-17T14:46:30 1760712390

This recent comment https://news.ycombinator.com/item?id=45598670 by @simonw really helped drive home the point that LLMs are really being fed an array of strings.

colordrops · 2025-10-17T00:07:36 1760659656

Why wouldn't you turn on prompt caching? There must be a reason why it's a toggle rather than just being on for everything.

TimMoore · 2025-10-17T01:00:24 1760662824

Writing to the cache is more expensive than a request with caching disabled. So it only makes economic sense to do it when you know you're going to use the cached results. See https://docs.claude.com/en/docs/build-with-claude/prompt-cac...

adastra22 · 2025-10-17T05:17:51 1760678271

When you know the context is a one-and-done. Caching costs more than just running the prompt, but less than running the prompt twice.