Every call to the Claude API returns a stop_reason field. In a single-turn application, you probably ignore it — the response is done, move on. In an agentic loop, stop_reason is the control signal your orchestrator must act on correctly every single time. Get it wrong once and your loop either exits too early, runs forever, or corrupts its own context.
There are four possible values: end_turn, tool_use, max_tokens, and stop_sequence. Each one means something different about what the model did and what you need to do next. The most dangerous assumption in agentic development is treating any of them as equivalent to the others.
This lesson covers what each value means, how to handle it correctly, and the specific failure mode that catches most developers off guard: receiving max_tokens while the model was mid-way through emitting a tool call.
Subtopic 1.1. Questions present a stopped loop with a specific stop_reason and ask what the orchestrator does next. The max_tokens + partial tool_use combination is the highest-frequency trap on the exam. Know what 'partial tool_use block' means before test day.
Each stop_reason describes why the model stopped generating, not what it produced. Your loop needs to branch on this value before touching anything else in the response.
import anthropic
client = anthropic.Anthropic()
def run_agentic_loop(messages: list, tools: list) -> str:
while True:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
return next(
(block.text for block in response.content if hasattr(block, "text")),
""
)
if response.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": response.content})
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
continue
if response.stop_reason == "max_tokens":
has_tool_use = any(
block.type == "tool_use" for block in response.content
)
if has_tool_use:
messages.append({"role": "assistant", "content": response.content})
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
else:
return next(
(block.text for block in response.content if hasattr(block, "text")),
"[Response truncated]"
)
continue
if response.stop_reason == "stop_sequence":
return next(
(block.text for block in response.content if hasattr(block, "text")),
""
)
When max_tokens fires, your instinct is to treat the response as complete — after all, the model stopped. But stop_reason tells you why it stopped, not whether it was done. If the model was in the middle of deciding which tool to call when it hit the token limit, the response content will contain a partial or complete tool_use block with no corresponding tool_result.
If you return this to the user or exit the loop without executing the pending tool, you have silently dropped work. The model believed it issued a tool call and is waiting for results that will never come. The correct response is to execute whatever tool calls are present — even if the tool_use block was truncated — append the results, and continue the loop. The model will pick up where it left off.
The exam will offer 'Retry the same request with a higher max_tokens value' as an option when max_tokens fires with a partial tool_use block. This is wrong — retrying starts a new generation, loses the partial tool call, and costs double the tokens. Append-and-continue is always the correct response to max_tokens with a pending tool call.
In a real loop, max_tokens is often a symptom of a deeper problem: context accumulation. Each turn appends more content — the previous assistant response, tool results, new user input. After enough turns, your context window fills up and max_tokens starts firing even on straightforward requests.
The right production pattern is to monitor your context size at each turn and trigger a distillation step before you hit the limit. The Anthropic SDK's token counting API lets you check size before sending. Build this check into your loop from day one.
import anthropic
client = anthropic.Anthropic()
CONTEXT_LIMIT = 200_000
DISTILL_THRESHOLD = 0.75
def run_loop_with_monitoring(messages: list, tools: list, system: str) -> str:
while True:
token_count = client.messages.count_tokens(
model="claude-opus-4-7",
system=system,
tools=tools,
messages=messages,
)
if token_count.input_tokens > CONTEXT_LIMIT * DISTILL_THRESHOLD:
messages = distill_context(messages, system)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
system=system,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
return next(
(b.text for b in response.content if hasattr(b, "text")), ""
)
if response.stop_reason in ("tool_use", "max_tokens"):
messages.append({"role": "assistant", "content": response.content})
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
if response.stop_reason in ("stop_sequence",):
return next(
(b.text for b in response.content if hasattr(b, "text")), ""
)
The 'lost in the middle' effect is relevant here even before you hit the token limit. Research shows transformer models give less attention weight to content in the middle of long contexts. In long loops, anchor your system prompt at the top (it's always there) and keep the most recent tool results at the bottom. Summarize older turns rather than keeping them verbatim — this concentrates attention where it matters most.
max_tokens + partial tool_use block = append and continue. max_tokens + no tool block = surface the result. Always monitor context size — distill before you hit 75% capacity.