Predict before you read

Before you read — what is the most common production failure mode for agentic systems built on chain-of-LLM-calls without explicit state?

Think about partial failures, retries, and what happens when one step in a 10-step chain fails at step 8.

From Tokens to Embodied Minds  ·  Chapter 22 of 36
Chapter 22

LangGraph and orchestration that survives

Graph state, checkpointing, and human-in-the-loop

Typed state
LangGraph's TypedDict graph state — every node reads from and writes to an explicit, versioned state object
Checkpoint
built-in Postgres/SQLite-backed checkpointing — enables human-in-the-loop, time travel, and recovery from partial failures
interrupt()
the primitive that separates irreversible agent actions from reversible ones — no production agent should skip it
Maturity ladder

The chain abstraction — chain of LLM calls, each passing its output to the next — breaks at production for a predictable reason: it has no state and no recovery semantics. A failure at step 8 of a 10-step chain forces a restart from step 1, re-executing every prior step including any external side effects — duplicate emails sent, duplicate CRM entries created, duplicate API calls billed. The implicit assumption of a chain is that all steps are idempotent and cheap to retry. Production steps are neither. LangGraph (LangChain, 2024) replaces the chain with an explicit directed graph over a typed state object. Nodes are pure Python functions that read from and write to a shared TypedDict state. Edges are conditional — the output of a node determines which node runs next. The checkpointer (Postgres-backed, SQLite-backed, or Redis-backed) persists the state after each node completes, enabling resume-from-checkpoint on failure, human-in-the-loop interrupt-and-resume, and time-travel debugging (replay from any prior checkpoint). For DealLens — a multi-step diligence flow that writes to external systems — this is not an optional architecture upgrade. It is the minimum viable production pattern.

Graph state and node design

LangGraph state is a TypedDict annotated with reducer functions. Each key in the state dict has a type and an optional annotation (Annotated[list, operator.add]) that specifies how concurrent node writes to the same key are merged. Nodes are Python functions with signature (state: State) -> dict — they receive the full current state and return a partial update dict. The graph runner merges the partial update into the state using the declared reducers. This explicit state model means you can inspect the full state at any checkpoint, serialize it to JSON, and reason about exactly what each node saw and produced.

Node design principle: each node should do exactly one logical thing and produce a deterministic output for a given input state. A retrieve node retrieves — it does not score. A score node scores — it does not draft. A draft-memo node drafts — it does not send. Sending (the irreversible external write) is a separate node, always preceded by an interrupt. This separation is not over-engineering; it is the architecture that makes time-travel replay safe (you can re-run the draft node without re-sending), and it is what makes each node independently testable with mock state.

Checkpointing and human-in-the-loop

LangGraph's checkpointer persists the full state after every node completion. On failure or restart, graph.invoke(None, config={"configurable": {"thread_id": "run-123"}}) loads the last checkpoint and resumes from the next unexecuted node. For DealLens, this means a failure during deal scoring does not re-run the expensive retrieval node — it resumes from the last persisted state. The Postgres checkpointer is the production choice: it is durable, queryable, and supports concurrent runs. SQLite is adequate for single-user local use.

The interrupt() primitive (LangGraph >= 0.2) pauses execution at a node boundary and surfaces the current state to a human reviewer via the LangGraph API or the LangGraph Studio UI. The human can approve, reject, or modify the state before execution resumes. For DealLens, interrupt-before-send-email is the pattern: the draft-memo node completes and persists its draft to state; execution pauses; the analyst reviews and edits the draft in the UI; the analyst approves; execution resumes and the send-email node fires. This is not a UX nicety — it is the audit trail that makes the system compliant with investment process requirements.

Time-travel debugging: call graph.get_state_history(config) to retrieve all checkpointed states for a run. Call graph.update_state(config, {"override_key": "value"}, as_node="score") to inject a corrected state at a prior checkpoint and re-run from that point. This is the debugger for production agentic systems — the equivalent of a step-through debugger for deterministic code.

LangGraph vs CrewAI vs hand-written asyncio

The three options: LangGraph (explicit typed state, checkpointing, interrupt semantics, production-grade); CrewAI (agent-role abstraction, task delegation, zero checkpointing, good for prototypes with < 5 sequential steps); hand-written asyncio state machine (maximum control, zero magic, requires implementing checkpointing yourself, justified when LangGraph's abstractions add more friction than value). For DealLens — multi-step, external writes, requires audit trail, requires human review — LangGraph is the correct choice. For a quick 3-step retrieval pipeline without external writes, CrewAI is adequate. For a system with unusual parallelism patterns or a team that finds LangGraph's graph API confusing, hand-written asyncio with a Redis state store is a legitimate option.

The common mistake: choosing LangGraph because it is popular, then fighting its state model and adding complexity by working around it. If your flow is a DAG with no cycles, no human-in-the-loop, and no external writes, a LangChain chain or a simple async function is sufficient. LangGraph's value is specifically in its loop support (nodes can route back to prior nodes), its interrupt semantics, and its checkpoint persistence. Use those features or use something simpler.

The humanoid task planner as a state machine

The JHU humanoid's high-level planner is structurally identical to DealLens's screening flow: a state machine where nodes execute perception, planning, tool-calling, and action steps, with interrupt points before any irreversible physical action. The state contains the current task goal, the scene understanding, the sequence of planned actions, and the action execution history. Each node reads the state, performs its function (query the LLM, call an MCP tool, issue a robot command), and writes its output back. On a safety violation (joint limit breach, unexpected contact force), the safety-filter node writes an interrupt flag and execution pauses for human review.

The key difference from DealLens: the humanoid planner runs in a real-time loop at ~1–5 Hz for task planning (not the 50 Hz inner control loop, which is handled by GR00T's DiT or SmolVLA directly). LangGraph's asyncio backend supports this loop execution pattern — the graph can contain a cycle from the action node back to the perception node, enabling continuous replanning. The checkpoint frequency must be tuned: checkpointing after every 1-second planning step adds 10–50ms of state-serialization overhead, which is acceptable; checkpointing after every 20ms control step is not.

No tool use should be irreversible without an interrupt

This is not a LangGraph feature — it is an architectural law for production agentic systems. If your agent can send an email, post to Slack, or write to a database without a human-confirmable interrupt point, you will eventually send the wrong thing to the wrong person at the wrong time.

DealLens LangGraph State MachineSTARTdeal_id → stateretrieve→ retrieved_chunksscore→ scoredraft-memo→ memo_draftinterrupt()human reviewsend-emailirreversible actionrejectscore < thresholdscore < θ
Figure 22.1DealLens deal-screening state machine: retrieve → score → conditional edge (reject if score below threshold, draft-memo otherwise) → interrupt() for human review → send-email. Each node writes to typed state; checkpointer persists after each node.
Retrieve before you continue

Three questions on what you just read

Q1 Factual What does LangGraph's checkpointer do and what are the production storage options?
Q2 Conceptual Why is the chain abstraction (linear sequence of LLM calls) insufficient for production agentic systems that write to external services?
Q3 Synthetic Design the LangGraph state machine for DealLens: what are the nodes, edges, interrupt points, and what state fields does the TypedDict contain?