Agentic loop lifecycle & stop_reason values — Claude Certified Architect – Foundations (Free Preview) — Claude Cert Academy

The most common reason agentic loops hang in production isn't a bug in your tool — it's misreading stop_reason.

Every call to the Claude API returns a stop_reason field. In a single-turn application, you probably ignore it — the response is done, move on. In an agentic loop, stop_reason is the control signal your orchestrator must act on correctly every single time. Get it wrong once and your loop either exits too early, runs forever, or corrupts its own context.

There are four possible values: end_turn, tool_use, max_tokens, and stop_sequence. Each one means something different about what the model did and what you need to do next. The most dangerous assumption in agentic development is treating any of them as equivalent to the others.

This lesson covers what each value means, how to handle it correctly, and the specific failure mode that catches most developers off guard: receiving max_tokens while the model was mid-way through emitting a tool call.

Subtopic 1.1. Questions present a stopped loop with a specific stop_reason and ask what the orchestrator does next. The max_tokens + partial tool_use combination is the highest-frequency trap on the exam. Know what 'partial tool_use block' means before test day.

The four stop_reason values

Each stop_reason describes why the model stopped generating, not what it produced. Your loop needs to branch on this value before touching anything else in the response.

import anthropic

client = anthropic.Anthropic()

def run_agentic_loop(messages: list, tools: list) -> str:
    while True:
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return next(
                (block.text for block in response.content if hasattr(block, "text")),
                ""
            )

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = execute_tools(response.content)
            messages.append({"role": "user", "content": tool_results})
            continue

        if response.stop_reason == "max_tokens":
            has_tool_use = any(
                block.type == "tool_use" for block in response.content
            )
            if has_tool_use:
                messages.append({"role": "assistant", "content": response.content})
                tool_results = execute_tools(response.content)
                messages.append({"role": "user", "content": tool_results})
            else:
                return next(
                    (block.text for block in response.content if hasattr(block, "text")),
                    "[Response truncated]"
                )
            continue

        if response.stop_reason == "stop_sequence":
            return next(
                (block.text for block in response.content if hasattr(block, "text")),
                ""
            )

The max_tokens + partial tool_use trap

When max_tokens fires, your instinct is to treat the response as complete — after all, the model stopped. But stop_reason tells you why it stopped, not whether it was done. If the model was in the middle of deciding which tool to call when it hit the token limit, the response content will contain a partial or complete tool_use block with no corresponding tool_result.

If you return this to the user or exit the loop without executing the pending tool, you have silently dropped work. The model believed it issued a tool call and is waiting for results that will never come. The correct response is to execute whatever tool calls are present — even if the tool_use block was truncated — append the results, and continue the loop. The model will pick up where it left off.

The exam will offer 'Retry the same request with a higher max_tokens value' as an option when max_tokens fires with a partial tool_use block. This is wrong — retrying starts a new generation, loses the partial tool call, and costs double the tokens. Append-and-continue is always the correct response to max_tokens with a pending tool call.

Production pattern: context limit monitoring

In a real loop, max_tokens is often a symptom of a deeper problem: context accumulation. Each turn appends more content — the previous assistant response, tool results, new user input. After enough turns, your context window fills up and max_tokens starts firing even on straightforward requests.

The right production pattern is to monitor your context size at each turn and trigger a distillation step before you hit the limit. The Anthropic SDK's token counting API lets you check size before sending. Build this check into your loop from day one.

import anthropic

client = anthropic.Anthropic()
CONTEXT_LIMIT = 200_000
DISTILL_THRESHOLD = 0.75

def run_loop_with_monitoring(messages: list, tools: list, system: str) -> str:
    while True:
        token_count = client.messages.count_tokens(
            model="claude-opus-4-7",
            system=system,
            tools=tools,
            messages=messages,
        )

        if token_count.input_tokens > CONTEXT_LIMIT * DISTILL_THRESHOLD:
            messages = distill_context(messages, system)

        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            system=system,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )

        if response.stop_reason in ("tool_use", "max_tokens"):
            messages.append({"role": "assistant", "content": response.content})
            tool_results = execute_tools(response.content)
            messages.append({"role": "user", "content": tool_results})

        if response.stop_reason in ("stop_sequence",):
            return next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )

The 'lost in the middle' effect is relevant here even before you hit the token limit. Research shows transformer models give less attention weight to content in the middle of long contexts. In long loops, anchor your system prompt at the top (it's always there) and keep the most recent tool results at the bottom. Summarize older turns rather than keeping them verbatim — this concentrates attention where it matters most.

max_tokens + partial tool_use block = append and continue. max_tokens + no tool block = surface the result. Always monitor context size — distill before you hit 75% capacity.

Continue to Claude Cert Academy