Senior Architect Interview Series

LangGraph & Agentic AI
Complete Interview Prep Guide

10 chapters · From ReAct patterns to production agents · 2026 Edition

📅 March 28, 2026⏱ ~90 min read🎯 Senior Engineer / Architect level
Chapter 6 of 10Memory: In-Context, Session & Long-Term

LangGraph Chapter 6 — Memory: In-Context, Session & Long-Term

Senior Architect Interview Series — LangGraph & Agentic AI


Navigation

Chapter 5 — Routing | Chapter 7 — Multi-Agent →


6.0 What This Chapter Covers

Memory is what separates a truly useful agent from a stateless answering machine. This chapter covers:

  1. The four levels of agent memory with specific examples
  2. How your project implements each layer
  3. In-context memory: AgentState.messages and the context window
  4. Session memory: PostgreSQL ChatHistory via memory.py
  5. Long-term memory: cross-session storage and retrieval
  6. Semantic memory: RAG as a form of external agent memory
  7. LangGraph's built-in checkpointing system

6.1 The Four Levels of Agent Memory

┌─────────────────────────────────────────────────────────────────┐
│                       MEMORY HIERARCHY                          │
│                                                                 │
│  L1 — IN-CONTEXT     AgentState.messages (current execution)   │
│  ├─ Lifespan:        One graph invocation                       │
│  ├─ Capacity:        LLM context window (~128K tokens)          │
│  └─ Location:        RAM, in the AgentState dict                │
│                                                                 │
│  L2 — SESSION        ChatHistory table (PostgreSQL)            │
│  ├─ Lifespan:        One chat session (hours/days)              │
│  ├─ Capacity:        Unlimited (last N messages loaded)         │
│  └─ Location:        Your relational database                   │
│                                                                 │
│  L3 — LONG-TERM      Persistent user preferences/facts          │
│  ├─ Lifespan:        Cross-session, potentially forever         │
│  ├─ Capacity:        Unlimited                                  │
│  └─ Location:        Database with semantic search              │
│                                                                 │
│  L4 — SEMANTIC       ChromaDB vector store (RAG)               │
│  ├─ Lifespan:        Persistent, updated via ingestion          │
│  ├─ Capacity:        Scales to millions of documents            │
│  └─ Location:        chroma_db/ (your project)                  │
└─────────────────────────────────────────────────────────────────┘

6.2 L1 — In-Context Memory: AgentState.messages

In-context memory is the messages currently loaded in AgentState. This is what the LLM "sees" in its context window during one graph invocation.

What Goes In Here

# From run_agent() in agent/agent.py
history = load_history(session_id, db)      # L2 → L1: load from DB
history.append(HumanMessage(content=question))  # add this turn's question

initial_state = {"messages": history}
final_state   = agent.invoke(initial_state)  # L1 exists for this call
# After return: L1 is discarded

During the invocation:

  • The full history from the database is loaded into AgentState.messages
  • Tool results get appended (ToolMessages)
  • LLM responses get appended (AIMessages)
  • At the end, the last message is the answer

After return: The AgentState object ceases to exist — so does L1 memory. The answer is saved back to the database (L2).

Context Window Limits

Modern LLMs support large contexts (~128K tokens for gpt-4o-mini) but:

  • More tokens = more cost (charged per token)
  • More tokens = slower inference (attention is quadratic in tokens)
  • Quality can degrade with very long contexts ("lost in the middle" problem)

Production strategy: Load only the last N turns from the database:

def load_history(session_id: str, db: Session, max_turns: int = 10) -> list[BaseMessage]:
    rows = (
        db.query(ChatHistory)
        .filter(ChatHistory.session_id == session_id)
        .order_by(ChatHistory.id.desc())  # newest first
        .limit(max_turns * 2)             # N turns = N*2 rows (human + ai)
        .all()
    )
    rows.reverse()   # back to chronological order
    messages = []
    for row in rows:
        messages.append(HumanMessage(content=row.user_message))
        messages.append(AIMessage(content=row.assistant_message))
    return messages

6.3 L2 — Session Memory: Your PostgreSQL Implementation

Session memory persists the conversation across calls within the same session. Your project implements this with a ChatHistory SQLAlchemy model stored in PostgreSQL.

The Model (models.py)

# Typical ChatHistory model in the project
class ChatHistory(Base):
    __tablename__ = "chat_history"
    
    id         = Column(Integer, primary_key=True, index=True)
    session_id = Column(String, index=True)
    user_message      = Column(Text)
    assistant_message = Column(Text)
    created_at = Column(DateTime, default=datetime.utcnow)

Loading History (memory.py)

def load_history(session_id: str, db: Session) -> list[BaseMessage]:
    rows = (
        db.query(ChatHistory)
          .filter(ChatHistory.session_id == session_id)
          .order_by(ChatHistory.id)   # chronological order
          .all()
    )
    messages = []
    for row in rows:
        messages.append(HumanMessage(content=row.user_message))
        messages.append(AIMessage(content=row.assistant_message))
    return messages

This converts database rows into LangChain BaseMessage objects that LangGraph can include in AgentState.messages.

Saving History (memory.py)

def save_history(
    session_id: str,
    question: str,
    answer: str,
    db: Session
) -> None:
    row = ChatHistory(
        session_id=session_id,
        user_message=question,
        assistant_message=answer
    )
    db.add(row)
    db.commit()

After the agent returns, run_agent() calls save_history() to persist the new turn:

# From run_agent():
final_state = agent.invoke({"messages": history})
answer      = final_state["messages"][-1].content
save_history(session_id, question, answer, db)  # persist to L2
return answer

Session Lifecycle

First Request (new session_id):
  load_history() → [] (empty)
  append HumanMessage("hello")
  agent.invoke → AIMessage("Hi there!")
  save_history("hello", "Hi there!")
  DB: [{session_id, "hello", "Hi there!"}]

Second Request (same session_id):
  load_history() → [HumanMessage("hello"), AIMessage("Hi there!")]
  append HumanMessage("who are you?")
  agent.invoke → sees full history → contextual reply
  save_history("who are you?", "I'm the Agent Factory assistant...")
  DB: [{...}, {session_id, "who are you?", "I'm..."}]

6.4 L3 — Long-Term Memory: Cross-Session Persistence

Long-term memory stores facts about the user that persist indefinitely across sessions — their preferences, past decisions, important context.

Design Pattern — User Profile Store

class UserMemory(Base):
    __tablename__ = "user_memory"
    
    id         = Column(Integer, primary_key=True)
    user_id    = Column(String, index=True)
    key        = Column(String)      # e.g., "preferred_format", "team", "last_project"
    value      = Column(Text)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

# Usage:
def get_user_context(user_id: str, db: Session) -> str:
    memories = db.query(UserMemory).filter(UserMemory.user_id == user_id).all()
    if not memories:
        return ""
    return "\n".join(f"- {m.key}: {m.value}" for m in memories)

Injecting Long-Term Memory Into AgentState

# In run_agent():
user_context = get_user_context(user_id, db)
if user_context:
    # Add as SystemMessage at the start of context
    history.insert(0, SystemMessage(content=f"User context:\n{user_context}"))

Auto-Extracting Memories

A memory extractor node can automatically save new facts:

def extract_and_save_memories(state: AgentState, user_id: str, db: Session):
    """After each turn, check if any new user facts were revealed."""
    extraction_prompt = f"""Extract key facts about the user from this conversation.
    Only extract concrete facts: preferences, team, role, projects, constraints.
    Return as JSON: {{"key": "value"}} pairs, or empty dict if nothing new.
    
    Conversation: {format_messages(state["messages"])}"""
    
    response = llm.invoke([HumanMessage(content=extraction_prompt)])
    try:
        facts = json.loads(response.content)
        for key, value in facts.items():
            upsert_user_memory(user_id, key, value, db)
    except json.JSONDecodeError:
        pass  # skip if LLM didn't return valid JSON

6.5 L4 — Semantic Memory: ChromaDB / RAG

Your ChromaDB vector store at chroma_db/ is a form of external semantic memory for the agent — it stores domain knowledge that the agent can retrieve on demand.

# From rag/retrieve.py
def retrieve(query: str) -> list[str]:
    """Retrieve relevant chunks from ChromaDB."""
    collection = get_collection()
    results = collection.query(
        query_texts=[query],
        n_results=5,
        include=["documents", "distances"]
    )
    return results["documents"][0]   # list of relevant text chunks

def build_prompt(query: str, results: list[str]) -> str:
    """Format retrieved chunks as context for the LLM."""
    context = "\n---\n".join(results)
    return f"""Context from knowledge base:
{context}

Question: {query}"""

This is called by the rag_search tool:

@tool
def rag_search(query: str) -> str:
    """Search the knowledge base for Agent Factory information."""
    results = retrieve(query)
    return build_prompt(query, results)

The mental model: ChromaDB is the agent's long-term semantic memory. When the agent needs to know something about Agent Factory, it "recalls" related passages by similarity search — exactly as humans recall relevant knowledge from memory.


6.6 LangGraph's Built-In Checkpointing

LangGraph has a built-in memory system via checkpointers — they automatically save the graph state at each step, enabling:

  • Resumable conversations without database code
  • Human-in-the-loop interruptions (Chapter 8)
  • Fault tolerance (resume after crash)

Setup

from langgraph.checkpoint.sqlite import SqliteSaver
from langgraph.checkpoint.postgres import PostgresSaver

# SQLite (development)
checkpointer = SqliteSaver.from_conn_string("checkpoints.db")

# PostgreSQL (production)
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)

# Attach to graph at compile time
agent = graph.compile(checkpointer=checkpointer)

Using Checkpointed Memory

# Each thread_id is a separate conversation
config = {"configurable": {"thread_id": session_id}}

# First call — starts a new thread
result1 = agent.invoke(
    {"messages": [HumanMessage("hello")]},
    config=config
)

# Second call — same thread_id, LangGraph auto-loads history!
result2 = agent.invoke(
    {"messages": [HumanMessage("who are you?")]},
    config=config
)
# No manual load_history() needed — checkpointer handles it

Checkpointer vs. Your Custom memory.py

FeatureLangGraph CheckpointerYour memory.py
Code requiredMinimalCustom load/save functions
Storage formatFull serialized stateHuman/AI message pairs
Works with HITLBuilt-inManual implementation
PortabilityLangGraph-specificPortable to any system
CustomizationLimitedFull control
Schema migrationsAutoManual

For most production use cases, switching to PostgresSaver checkpointer eliminates the need for custom memory.py code.


6.7 Migrating memory.py to LangGraph Checkpointer

Here's how to upgrade your project to use the built-in checkpointer:

# config.py — add checkpointer setup
from langgraph.checkpoint.postgres import PostgresSaver
from database import DATABASE_URL

checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
checkpointer.setup()  # creates checkpoint tables

# agent/agent.py — compile with checkpointer
from config import checkpointer

agent = graph.compile(checkpointer=checkpointer)

# main.py / run_agent() — simpler memory management
def run_agent(question: str, session_id: str) -> str:
    # No need for load_history() or save_history()!
    config = {"configurable": {"thread_id": session_id}}
    
    # Guardrails still apply
    if not check_input(question):
        return "I can't assist with that request."
    
    # Checkpointer auto-loads and saves conversation history
    result = agent.invoke(
        {"messages": [HumanMessage(content=question)]},
        config=config
    )
    return result["messages"][-1].content

6.8 Memory Windowing — Managing Context Costs

For long conversations, you need to limit context window usage:

Strategy 1 — Load Last N Turns

def load_history(session_id: str, db: Session, max_turns: int = 10):
    rows = db.query(ChatHistory).filter(...).order_by(desc).limit(max_turns * 2).all()
    # ...

Strategy 2 — Summarize Old Turns

def summarize_history(messages: list[BaseMessage]) -> list[BaseMessage]:
    """Compress old messages into a summary SystemMessage."""
    if len(messages) <= 10:
        return messages  # no compression needed
    
    old_messages = messages[:-10]    # messages to summarize
    recent_messages = messages[-10:]  # keep last 10 verbatim
    
    summary_prompt = f"""Summarize this conversation:\n{format_messages(old_messages)}"""
    summary = llm.invoke([HumanMessage(content=summary_prompt)]).content
    
    return [
        SystemMessage(content=f"Previous conversation summary: {summary}"),
        *recent_messages
    ]

Strategy 3 — Token Budget

import tiktoken

def trim_to_token_budget(messages: list[BaseMessage], budget: int = 8000):
    """Keep the most recent messages within a token budget."""
    enc = tiktoken.encoding_for_model("gpt-4o-mini")
    total_tokens = 0
    kept = []
    
    for msg in reversed(messages):
        tokens = len(enc.encode(str(msg.content)))
        if total_tokens + tokens > budget:
            break
        kept.insert(0, msg)
        total_tokens += tokens
    
    return kept

6.9 Interview Q&A

Q: Explain the memory architecture in your LangGraph agent. What are the different layers?

Our agent has four memory layers. L1 — in-context memory is the AgentState.messages list during one graph invocation: the full history plus tool messages plus LLM responses. It exists only for the duration of the invocation. L2 — session memory is PostgreSQL's ChatHistory table, accessed via load_history() and save_history() in memory.py. Messages from prior turns in the same session are loaded into L1 at the start of each invocation. L3 would be long-term cross-session memory (user preferences, etc.) — not yet implemented but would use a similar DB table queryed by user_id. L4 — semantic memory is ChromaDB, the vector store holding Agent Factory documentation, retrieved on demand via the rag_search tool.


Q: Why did you implement custom memory.py instead of using LangGraph's built-in checkpointer?

The custom memory.py gives us a clean human-readable chat_history table that's easy to query, audit, and display in a chat UI. The stored pairs (user_message, assistant_message) are directly usable by the REST API without deserialization. LangGraph's checkpointer stores the complete serialized state, which is harder to inspect or integrate with non-LangGraph clients. For future iterations, PostgresSaver would replace the custom code — but the custom approach gives us full schema control right now.


Q: What happens if the session has 1000 turns? How do you handle context window limits?

We'd implement windowing in load_history. The simplest approach is loading only the last N turns (e.g., 10). A more sophisticated approach is summarization: compress older turns into a SystemMessage summary, keep recent turns verbatim. Token counting via tiktoken gives us the most precise control. In production, we'd also store a running summary in the database so we don't need to re-summarize on every request.


Q: How does RAG relate to agent memory?

RAG is the agent's semantic long-term memory — external knowledge that the agent can retrieve by similarity rather than exact lookup. Unlike session memory (time-ordered conversation), semantic memory allows cross-document retrieval: the agent can find relevant information from thousands of documents instantly. In our system, ChromaDB stores Agent Factory documentation. When the agent needs domain knowledge, it calls rag_search, which is conceptually the same as "recalling" relevant information from semantic memory. The difference between RAG and an LLM's parametric memory (training knowledge) is that RAG is dynamic — you can add new documents without retraining.


Q: How would you implement cross-session long-term memory?

I'd add a UserMemory table with user_id, key, and value columns. After each agent turn, a memory extraction step would call the LLM to identify new facts about the user (role, preferences, project context). These facts would be upserted into UserMemory. At the start of each session, get_user_context(user_id) would query these facts and inject them as a SystemMessage into the initial context. An optional embedding-based retrieval layer would scale this to thousands of memory items per user by finding the most relevant ones for each new question.


6.10 Key One-Liners to Memorize

"Four memory layers: in-context (RAM), session (DB), long-term (persistent), semantic (vector store)."

"load_history() → L2 → L1 at start; save_history() → L1 → L2 at end."

"The LLM sees AgentState.messages — put relevant history there before invoking."

"Window the history: load last N turns to control cost and latency."

"LangGraph checkpointer = auto-save state after every node, keyed by thread_id."

"RAG is the agent's semantic long-term memory — retrieve by meaning, not by lookup."

Next: Chapter 7 — Multi-Agent: Supervisor Pattern & Handoffs

Header Logo