Pretty true, and definitely a good exercise. But if we're going to actual use these things in practice, you need more. Things like prompt caching, capabilities/constraints, etc. It's pretty dangerous to let an agent go hog wild in an unprotected environment.
Oh sure! And if I was talking someone through building a barebones agent, I'd definitely tag on a warning along the lines of "but don't actually use this without XYZ!" That said, you can add prompt caching by just setting a couple of parameters in the api calls to the LLM. I agree constraints is a much more complex topic, although even in my 100-line example I am able to fit in a user approval step before file write or bash actions.
when you say prompt caching, does it mean cache the thing you send to the llm or the thing you get back?
sounds like prompt is what you send, and caching is important here because what you send is derived from previous responses from llm calls earlier?
sorry to sound dense, I struggle to understand where and how in the mental model the non-determinism of a response is dealt with. is it just that it's all cached?
Not dense to ask questions! There are two separate concepts in play:
1) Maintaining the state of the "conversation" history with the LLM. LLMs are stateless, so you have to store the entire series of interactions on the client side in your agent (every user prompt, every LLM response, every tool call, every tool call result). You then send the entire previous conversation history to the LLM every time you call it, so it can "see" what has already happened. In a basic agent, it's essentially just a big list of strings, and you pass it into the LLM api on every LLM call.
2) "Prompt caching", which is a clever optimization in the LLM infrastructure to take advantage of the fact that most LLM interactions involve processing a lot of unchanging past conversation history, plus a little bit of new text at the end. Understanding it requires understanding the internals of LLM transformer architecture, but the essence of it is that you can save a lot of GPU compute time by caching previous result states that then become intermediate states for the next LLM call. You cache on the entire history: the base prompt, the user's messages, the LLM's responses, the LLM's tool calls, everything. As a user of an LLM api, you don't have to worry about how any of it works under the hood, you just have to enable it. The reason to turn it on is it dramatically increases response time and reduces cost.
Very helpful. It helps me better understand the specifics behind each call and response, the internal units and whether those units are sent and received "live" from the LLM or come from a traditional db or cache store.
I'm personally just curious how far, clever, insightful, any given product is "on top of" the foundation models. I'm not in it deep enough to make claims one way or the other.