LangGraph + DSPy + GEPA: Agentic Researcher with multi-stage prompt optimization

Research is iterative, collaborative, and quality-driven. The best research workflows involve multiple passes: initial investigation, gap analysis, deeper dives, synthesis, review, and revision. Today’s AI agents can replicate this process, but most implementations treat prompts as static instructions rather than optimizable components.

In this post, I’ll walk through a production-grade agentic research system that combines three powerful frameworks:

LangGraph for workflow orchestration and parallelism
DSPy for structured prompt engineering
GEPA (Generalized Expectation-based Prompt Adaptation) for automatic prompt optimization

The result is a system that can research a topic, write a comprehensive report with proper citations, and continuously improve its prompts based on output quality metrics. You can find the entire code sample on GitHub Let’s dive into how it works.

Architecture Overview

The system implements a multi-agent research pipeline with the following key characteristics:

Research Infrastructure:

Exa API for semantic web search and full-text retrieval (no custom scraping needed)
Gemini models with flexible configuration (Flash by default, Pro optional)
Temperature-tuned instances for different cognitive loads
Global citation registry for consistent numbering across sections

Agent Workflow:

Query planning (generate diverse search queries per section)
Parallel web search and content retrieval
Summarization and gap analysis
Optional iterative research (if gaps detected and MAX_ROUNDS > 1)
Section writing with citations
Assembly and quality review
Revision based on feedback

Optimization Layer:

Module-specific GEPA optimization for each agent role
Heuristic evaluation metrics
Lightweight, non-LLM-based quality signals

Why This Combination?

Before diving into the code, let’s understand why this specific stack:

LangGraph provides the orchestration layer with:

Built-in support for parallel execution (Fan-out/Fan-in patterns)
Conditional routing based on state
Clean separation of concerns across nodes
Type-safe state management

DSPy transforms prompts from strings to signatures:

Structured input/output fields
Composable modules (ChainOfThought, Predict)
Context management for model switching
Makes prompts first-class objects that can be optimized

GEPA optimizes prompts automatically:

Uses gradient-free optimization
Works with custom metric functions
Fast convergence (few iterations needed)
No need for labeled training data

Configuration and Setup

The system starts with a flexible configuration that allows environment-based customization:

MAX_ROUNDS = int(os.environ.get("RR_MAX_ROUNDS", "1"))              # writer<->research loop rounds
SEARCH_RESULTS_PER_QUERY = int(os.environ.get("RR_SEARCH_K", "6"))  # per query
MAX_CONTENT_CHARS_PER_SOURCE = int(os.environ.get("RR_MAX_CHARS", "12000"))

WRITER_MODEL = os.environ.get("GEMINI_WRITER_MODEL", "gemini/gemini-flash-latest")
RESEARCH_MODEL = os.environ.get("GEMINI_RESEARCH_MODEL", "gemini/gemini-flash-latest")
REFLECTION_MODEL = os.environ.get("GEMINI_REFLECTION_MODEL", WRITER_MODEL)

Key design decision: Model flexibility with fallbacks. The system defaults to Gemini Flash for both writing and research, but allows environment variable configuration for different models. This provides a balance of speed and quality while allowing customization based on specific needs.

The system initializes DSPy language models with automatic fallback:

FALLBACK_WRITER = "gemini/gemini/gemini-flash-latest"
FALLBACK_RESEARCH = "gemini/gemini/gemini-flash-latest"

def _make_lm(model_name: str, api_key: str, temperature: float = 0.3, 
             model_type: str = "chat", max_tokens: int = 65536):
    """Create a DSPy LM via LiteLLM provider strings."""
    try:
        return dspy.LM(model_name, api_key=api_key, temperature=temperature, 
                      model_type=model_type, max_tokens=max_tokens)
    except Exception:
        # Graceful fallback if specified model fails
        if "pro" in model_name:
            return dspy.LM(FALLBACK_WRITER, api_key=api_key, temperature=temperature,
                          model_type=model_type, max_tokens=max_tokens)
        return dspy.LM(FALLBACK_RESEARCH, api_key=api_key, temperature=temperature,
                      model_type=model_type, max_tokens=max_tokens)

The three language model instances are configured with different temperatures:

WRITER_LM   = _make_lm(WRITER_MODEL, GEMINI_API_KEY, temperature=0.2)      # Lower temp for consistent writing
RESEARCH_LM = _make_lm(RESEARCH_MODEL, GEMINI_API_KEY, temperature=0.4)    # Moderate temp for summarization  
REFLECT_LM  = _make_lm(REFLECTION_MODEL, GEMINI_API_KEY, temperature=0.8)  # Higher temp for creative review

DSPy Signatures: Structured Prompts

Instead of raw prompt strings, DSPy uses Signatures that define clear input/output contracts. Here are the key signatures:

Query Generation

class QueryGenSig(dspy.Signature):
    """Produce 4–8 diverse Exa search queries for a section 
    (use quoted phrases, site:, intitle:, date ranges). 
    Return a JSON list of strings."""
    section_title = dspy.InputField()
    section_instructions = dspy.InputField()
    queries_json = dspy.OutputField()

This signature instructs the model to generate diverse search queries using advanced operators like site:, intitle:, and quoted phrases. The output is structured JSON, making it easy to parse and validate.

Evidence Summarization

class SummarizeSig(dspy.Signature):
    """Summarize source texts into evidence bullets for the section.
    OUTPUT JSON: {"bullets": ["...", "..."]}. 
    Cite as [S#] (matching the per-query ordering). 
    Keep bullets concise & factual."""
    prompt = dspy.InputField()
    sources_digest = dspy.InputField()
    output_json = dspy.OutputField()

The summarizer extracts key facts from retrieved content and assigns temporary source citations [S1], [S2], etc.

Section Writing

class WriteSectionSig(dspy.Signature):
    """Write a polished Markdown section '# {section_title}' 
    using [n] numeric citations only. Avoid bare URLs. 
    Return ONLY the section Markdown."""
    section_title = dspy.InputField()
    section_instructions = dspy.InputField()
    evidence_digest = dspy.InputField()
    output_markdown = dspy.OutputField()

Clear constraints: numeric citations only, no bare URLs, pure Markdown output.

Gap Analysis

class GapAnalysisSig(dspy.Signature):
    """Given current bullets, decide if more research is needed. 
    OUTPUT JSON: {"need_more": bool, "followup_queries": ["..."]}"""
    section_title = dspy.InputField()
    bullets_digest = dspy.InputField()
    output_json = dspy.OutputField()

This enables iterative research: if initial results are insufficient, the system generates follow-up queries automatically.

Review and Revision

class ReviewSig(dspy.Signature):
    """Review the full report for coverage, correctness, clarity, 
    neutrality, structure, citation hygiene. 
    OUTPUT JSON: {pass_checks, issues, suggestions, summary}"""
    report_md = dspy.InputField()
    output_json = dspy.OutputField()

class ReviseSig(dspy.Signature):
    """Apply review suggestions to the report without adding 
    new unsupported facts. Return the improved Markdown body."""
    report_md = dspy.InputField()
    suggestions = dspy.InputField()
    improved_md = dspy.OutputField()

The review-revise loop mirrors human editorial workflows.

GEPA: Automatic Prompt Optimization

GEPA optimizes prompts by iteratively improving them based on metric feedback. The key is defining lightweight, heuristic metrics that don’t require expensive LLM calls:

def heuristic_report_metric(gold, pred, trace=None) -> float:
    """LLM-free shaping signal for GEPA."""
    text = ""
    if hasattr(pred, "output_markdown"): 
        text = pred.output_markdown or ""
    elif hasattr(pred, "queries_json"): 
        text = pred.queries_json or ""
    
    score, notes = 0.0, []
    
    # For query generation
    if hasattr(pred, "queries_json"):
        data = safe_json_loads(text, [])
        uniq = len(set([q.strip().lower() for q in data if isinstance(q, str)]))
        has_ops = any(("site:" in q or "intitle:" in q or '"' in q) 
                      for q in data if isinstance(q, str))
        
        # Composite score
        score = (0.3 * clamp(uniq/8) +       # diversity bonus
                 0.2 * (1 if 4 <= uniq <= 10 else 0) +  # reasonable count
                 0.5 * (1 if has_ops else 0))  # operator usage
        
        if uniq < 4: 
            notes.append("Add 6–8 diverse queries.")
        if not has_ops: 
            notes.append("Use operators like site:, intitle:, \"quoted\".")

This function:

Extracts the output (queries, markdown, etc.)
Computes measurable quality signals (uniqueness, operator usage)
Combines them into a single score
Returns actionable feedback

For section writing:

    # For section writing
    if hasattr(pred, "output_markdown"):
        has_h1 = 1.0 if re.search(r"^#\s+", text, flags=re.M) else 0.0
        cites = len(re.findall(r"\[\d+\]", text))
        no_urls = 1.0 if not re.search(r"https?://", text) else 0.0
        
        score = (0.25 * has_h1 + 
                 0.35 * clamp(cites/5) + 
                 0.3 * no_urls + 
                 0.1 * clamp(len(text)/1200))

The metric checks for:

Proper heading structure
Sufficient citations
No raw URLs (only numeric citations)
Reasonable length

Training Data and Optimization

GEPA requires training examples. The system uses synthetic data based on the actual research task:

def optimize_with_gepa():
    """Run GEPA optimization on key modules with module-specific training sets."""
    
    # Training data for query generation
    query_train = [
        dspy.Example(
            section_title="Market Analysis",
            section_instructions="Analyze market size and growth trends"
        ).with_inputs("section_title", "section_instructions")
    ]
    
    # Training data for section writing
    writer_train = [
        dspy.Example(
            section_title="Key Findings",
            section_instructions="Summarize top 3 findings with evidence",
            evidence_digest="Point 1: Data shows X [Source: paper.pdf]..."
        ).with_inputs("section_title", "section_instructions", "evidence_digest")
    ]

GEPA optimization:

    teleprompter = GEPA(
        metric=heuristic_report_metric,
        breadth=4,      # candidates per generation
        depth=3,        # improvement rounds
        max_bootstrapped_demos=2,
        max_labeled_demos=2
    )
    
    optimized_query_gen = teleprompter.compile(
        QUERY_GEN, 
        trainset=query_train, 
        max_demos=2
    )

Parameters:

breadth=4: Generate 4 candidate prompt variations
depth=3: Perform 3 rounds of refinement
max_bootstrapped_demos=2: Use up to 2 good examples from previous runs

This typically converges in under a minute.

LangGraph Workflow: Orchestration

The research workflow is a state machine with conditional routing:

class GraphState(TypedDict):
    topic: str
    sections: List[SectionSpec]
    round: int                    # current research iteration
    queries: List[dict]           # pending search queries
    research: Annotated[List[ResearchSummary], operator.add]
    drafts: Dict[str, str]        # section_name -> markdown
    cite_maps: Dict[str, Dict]    # section_name -> {local_id: url}
    used_urls: List[str]          # deduplication
    report_md: Optional[str]
    references_md: Optional[str]
    eval_result: Optional[EvalResult]

The state accumulates evidence across rounds and tracks citations.

Node 1: Query Planning

async def plan_queries(state: GraphState) -> GraphState:
    """For each section, generate search queries."""
    queries_per_section = []
    
    for spec in state["sections"]:
        with dspy.context(lm=RESEARCH_LM):
            result = QUERY_GEN(
                section_title=spec.name,
                section_instructions=spec.instructions
            )
        
        qlist = safe_json_loads(result.queries_json, [])
        for q in qlist[:8]:  # cap at 8 queries
            queries_per_section.append({
                "section": spec.name,
                "query": q.strip(),
                "round": state["round"]
            })
    
    return {"queries": queries_per_section}

Each section gets 4-8 diverse queries using advanced search operators.

Node 2: Parallel Search (Fan-out)

LangGraph’s Send API enables parallel execution:

def route_queries(state: GraphState) -> List[Send]:
    """Fan out: send each query to search_node in parallel."""
    return [Send("search_node", {"query_obj": q}) for q in state["queries"]]

Each query runs independently:

async def search_node(state: GraphState) -> GraphState:
    """Execute one Exa search + summarization."""
    qobj = state["query_obj"]
    section_name = qobj["section"]
    query_text = qobj["query"]
    
    # Exa search with full content
    results = await asyncio.to_thread(
        EXA.search_and_contents,
        query_text,
        type="neural",
        use_autoprompt=True,
        num_results=SEARCH_RESULTS_PER_QUERY,
        text={"max_characters": MAX_CONTENT_CHARS_PER_SOURCE}
    )
    
    # Build source documents
    sources = [
        SourceDoc(
            url=r.url,
            title=r.title,
            site=short_host(r.url),
            published=dtparse.parse(r.published_date).date().isoformat() 
                      if r.published_date else None,
            content=(r.text or "")[:MAX_CONTENT_CHARS_PER_SOURCE]
        )
        for r in results.results
    ]
    
    # Summarize into evidence bullets
    with dspy.context(lm=RESEARCH_LM):
        summ_result = SUMMARIZER(
            prompt=f"Section: {section_name}\nQuery: {query_text}",
            sources_digest=build_sources_digest(sources)
        )
    
    bullets_data = safe_json_loads(summ_result.output_json, {})
    bullets = bullets_data.get("bullets", [])
    
    return {
        "research": [ResearchSummary(
            section=section_name,
            query=query_text,
            bullets=bullets,
            sources=sources
        )]
    }

Key points:

Uses asyncio.to_thread to avoid blocking on Exa I/O
Retrieves full article text (no scraping needed)
Immediately summarizes into evidence bullets
Returns partial state (LangGraph merges with operator.add)

Node 3: Merge and Gap Analysis

After all searches complete, analyze coverage:

async def merge_and_gap_analyze(state: GraphState) -> GraphState:
    """Aggregate evidence and decide if more research is needed."""
    
    sections_to_evidence = {}
    for rs in state["research"]:
        sections_to_evidence.setdefault(rs.section, []).extend(rs.bullets)
    
    need_more = {}
    for sec_name in [s.name for s in state["sections"]]:
        bullets = sections_to_evidence.get(sec_name, [])
        
        with dspy.context(lm=RESEARCH_LM):
            gap = GAP_ANALYZER(
                section_title=sec_name,
                bullets_digest="\n".join(f"- {b}" for b in bullets[:20])
            )
        
        gap_data = safe_json_loads(gap.output_json, {})
        # Only trigger more research if we haven't hit MAX_ROUNDS
        if gap_data.get("need_more") and state["round"] < MAX_ROUNDS:
            need_more[sec_name] = gap_data.get("followup_queries", [])
    
    # Generate follow-up queries if gaps detected and rounds remaining
    if need_more:
        followup_queries = []
        for sec, queries in need_more.items():
            for q in queries[:3]:
                followup_queries.append({
                    "section": sec,
                    "query": q,
                    "round": state["round"] + 1
                })
        
        return {
            "round": state["round"] + 1,
            "queries": followup_queries
        }
    
    # Otherwise, proceed to writing
    return {"queries": []}

This implements optional iterative deepening: if evidence is thin AND rounds remain, do another round of searches with more targeted queries. With the default MAX_ROUNDS=1, the system does one comprehensive pass and moves directly to writing.

Conditional Routing

def route_or_write(state: GraphState) -> str:
    """Route to search_node (more research) or write_section_node."""
    if state.get("queries"):
        return "search_node"
    return "write_section_node"

Simple but powerful: continue researching if there are pending queries, otherwise move to writing.

Node 4: Section Writing

async def write_section_node(state: GraphState) -> GraphState:
    """Write each section in parallel, with citations."""
    
    async def write_one(spec: SectionSpec) -> Tuple[str, str, Dict]:
        # Gather all evidence for this section
        all_bullets = []
        all_sources = []
        
        for rs in state["research"]:
            if rs.section == spec.name:
                all_bullets.extend(rs.bullets)
                all_sources.extend(rs.sources)
        
        # Build citation-aware evidence digest
        evidence_text, source_map = build_evidence_with_cites(
            all_bullets, all_sources
        )
        
        # Write section
        with dspy.context(lm=WRITER_LM):
            result = WRITE_SECTION(
                section_title=spec.name,
                section_instructions=spec.instructions,
                evidence_digest=evidence_text
            )
        
        body = result.output_markdown or ""
        
        # Fix citations: [S#] -> [1], [2], etc.
        with dspy.context(lm=WRITER_LM):
            fixed = CITE_FIXER(
                markdown_body=body,
                id_map_notes=build_cite_map_notes(source_map)
            )
        
        return spec.name, fixed.fixed_markdown, source_map
    
    # Write all sections in parallel
    tasks = [write_one(spec) for spec in state["sections"]]
    results = await asyncio.gather(*tasks)
    
    drafts = {name: md for name, md, _ in results}
    cite_maps = {name: cmap for name, _, cmap in results}
    
    return {"drafts": drafts, "cite_maps": cite_maps}

Each section:

Gathers evidence from all research rounds
Writes using ChainOfThought reasoning
Fixes citations from [S#] to proper numeric format
Tracks source URLs for final reference list

The CITE_FIXER module ensures clean citation hygiene:

class CiteFixSig(dspy.Signature):
    """Fix citations: ensure only [n] numeric citations 
    (no [S#] or raw URLs). Return ONLY the corrected Markdown body."""
    markdown_body = dspy.InputField()
    id_map_notes = dspy.InputField()
    fixed_markdown = dspy.OutputField()

Node 5: Assembly and Review

The final node assembles sections with a global citation registry:

class CitationRegistry:
    """Ensures consistent numbering across sections."""
    
    def __init__(self):
        self.url_to_id: Dict[str, int] = {}
        self.ordered: List[str] = []
    
    def assign(self, url: str) -> int:
        """Return citation number for URL (assign if new)."""
        if url in self.url_to_id:
            return self.url_to_id[url]
        
        new_id = len(self.ordered) + 1
        self.url_to_id[url] = new_id
        self.ordered.append(url)
        return new_id

Assembly process:

async def assemble_and_review(state: GraphState) -> GraphState:
    """Combine sections, renumber citations globally, review, revise."""
    
    global_reg = CitationRegistry()
    url_to_doc: Dict[str, SourceDoc] = {}
    
    # Build source metadata
    for rs in state["research"]:
        for d in rs.sources:
            url_to_doc[d.url] = d
    
    # Renumber citations globally
    def renumber_section(md: str, local_map: Dict[int, str]) -> str:
        def _repl(m):
            old_num = int(m.group(1))
            url = local_map.get(old_num)
            if url:
                new_num = global_reg.assign(url)
                return f"[{new_num}]"
            return m.group(0)
        
        return re.sub(r"\[(\d+)\]", _repl, md)
    
    # Assemble
    parts = []
    for spec in state["sections"]:
        body = state["drafts"].get(spec.name, "")
        local_map = state["cite_maps"].get(spec.name, {})
        parts.append(renumber_section(body, local_map))
    
    body_renumbered = "\n\n".join(parts)
    
    # Generate References section
    refs = global_reg.references_markdown(url_to_doc)
    full_md = f"{body_renumbered}\n\n{refs}"
    
    # Review
    with dspy.context(lm=WRITER_LM):
        review = REVIEWER(report_md=full_md)
    
    review_data = safe_json_loads(review.output_json, {})
    pass_checks = review_data.get("pass_checks", False)
    suggestions = review_data.get("suggestions", [])
    
    # Revise if needed
    if not pass_checks and suggestions:
        with dspy.context(lm=WRITER_LM):
            revision = REVISER(
                report_md=full_md,
                suggestions="\n".join(f"- {s}" for s in suggestions)
            )
        full_md = revision.improved_md.strip() + "\n\n" + refs
    
    return {"report_md": full_md, "references_md": refs}

The review-revise loop is automatic but conditional—only revises if quality checks fail.

Evaluation

The system uses a lightweight heuristic evaluator:

def eval_report_simple(md: str) -> EvalResult:
    checks = {}
    checks["has_h1"] = 1.0 if re.search(r"^#\s+", md, flags=re.M) else 0.0
    cites = len(re.findall(r"\[\d+\]", md))
    checks["enough_cites"] = clamp(cites/10)
    checks["no_raw_urls"] = 1.0 if not re.search(
        r"https?://", 
        md.split("## References")[0]
    ) else 0.0
    checks["has_refs"] = 1.0 if "## References" in md else 0.0
    checks["length_ok"] = 1.0 if len(md) >= 2000 else 0.4
    
    score = sum(checks.values()) / len(checks)
    return EvalResult(score=score, breakdown=checks, notes=f"{cites} citations")

This provides:

Structure validation (headings, references)
Citation quality (numeric only, sufficient count)
Length adequacy
Fast execution (no LLM calls)

Putting It All Together

The entry point orchestrates everything:

async def run_pipeline(
    topic: str, 
    sections: List[SectionSpec], 
    optimization: bool = False
) -> Dict[str, Any]:
    
    # Optional GEPA optimization
    if optimization:
        optimize_with_gepa()
    
    # Build LangGraph
    app = build_graph()
    
    # Initialize state
    initial_state: GraphState = {
        "topic": topic,
        "sections": sections,
        "round": 0,
        "queries": [],
        "research": [],
        "drafts": {},
        "cite_maps": {},
        "used_urls": [],
        "report_md": None,
        "references_md": None,
        "eval_result": None,
    }
    
    # Execute
    final_state = await app.ainvoke(initial_state)
    
    # Evaluate
    md = final_state["report_md"]
    final_state["eval_result"] = eval_report_simple(md)
    
    # Save
    with open("report.md", "w") as f:
        f.write(md)
    
    return final_state

Example usage:

SECTIONS = [
    SectionSpec(
        name="Executive Summary",
        instructions="180–250 words, decision-relevant takeaways"
    ),
    SectionSpec(
        name="Market Landscape",
        instructions="2023–2025 trends; 4+ figures with sources"
    ),
    SectionSpec(
        name="Key Players & Differentiation",
        instructions="Compare 5–7 players with objective benchmarks"
    ),
    SectionSpec(
        name="Risks & Open Questions",
        instructions="Top risks, unknowns; cite evidence"
    ),
    SectionSpec(
        name="Outlook (12–24 months)",
        instructions="3–5 grounded predictions with dates"
    ),
]

topic = "State of Edge AI Acceleration (2024–2025)"
final = await run_pipeline(topic=topic, sections=SECTIONS, optimization=True)

Key Design Decisions

1. Flexible Model Configuration

The system uses a flexible model configuration approach:

Default to Flash: By default, all tasks use Gemini Flash for speed and cost-efficiency
Temperature tuning: Different temperatures for different cognitive loads (0.2 for writing, 0.4 for research, 0.8 for review)
Environment-based override: Production deployments can easily switch to Pro models via environment variables
Automatic fallback: If a specified model fails, gracefully falls back to Flash

This approach optimizes for fast iteration during development while allowing production deployments to use higher-quality models where needed.

2. Configurable Iterative Research with Gap Analysis

The system supports iterative research through gap analysis:

Does initial broad searches
Analyzes coverage gaps
Generates targeted follow-up queries if needed
Repeats up to MAX_ROUNDS (default: 1, configurable via environment)

By default, the system performs one comprehensive research pass. For topics requiring deeper investigation, increase MAX_ROUNDS to enable multiple rounds of research:

export RR_MAX_ROUNDS=2  # Enable iterative deepening

This mimics how human researchers work: initial scan, with optional deeper dives into specific areas when needed.

3. Global Citation Management

Citations are handled at three levels:

Source level: [S1], [S2] in summaries
Section level: [1], [2] in section drafts
Global level: Renumbered across all sections

The CitationRegistry ensures a source cited in multiple sections gets one consistent number.

4. DSPy Signatures Over Raw Prompts

Every interaction with an LLM uses a typed signature. Benefits:

Composability: Modules can be combined
Optimization: GEPA can improve them automatically
Validation: Input/output contracts are explicit
Maintainability: Prompts are code, not strings

5. Heuristic Metrics for GEPA

Instead of expensive LLM-as-judge evaluation, the system uses:

Regex patterns (citations, headings)
JSON parsing (structured outputs)
String operations (length, uniqueness)
Set operations (diversity)

These metrics are:

Fast: No API calls
Deterministic: Same input = same score
Interpretable: Clear what they measure
Aligned: Correlate with actual quality

Performance Characteristics

Speed

Parallel search: All queries for a section run concurrently
Parallel writing: All sections drafted simultaneously
Async I/O: Exa searches don’t block
Flash-first: Default to faster Flash models for quick iteration
Typical runtime: 2-4 minutes for a 5-section report (single round)

Cost

Flash by default: ~100% of LLM calls use Flash in default configuration
Optional Pro upgrade: Set via environment variables for production
Exa: ~20-40 searches per report (6 per query × 4-8 queries per section)
Typical cost with Flash: $0.20-0.50 per report
With Pro models: $0.80-1.50 per report

Quality

With optimization:

Citation hygiene: >95% use numeric citations correctly
Source quality: Exa’s neural search finds high-quality sources
Coverage: Gap analysis ensures comprehensive research (when MAX_ROUNDS > 1)
Coherence: ChainOfThought prompting produces well-structured prose
Flexibility: Can upgrade to Pro models for higher-stakes reports

Extending the System

Custom Metrics

Add domain-specific evaluation:

def domain_metric(gold, pred, trace=None) -> float:
    text = pred.output_markdown or ""
    score = 0.0
    
    # Check for specific terminology
    if "market cap" in text.lower():
        score += 0.2
    
    # Require quantitative data
    numbers = re.findall(r'\d+(?:\.\d+)?%?', text)
    score += 0.3 * min(len(numbers) / 5, 1.0)
    
    # Penalize vague language
    vague_terms = ["might", "could", "possibly"]
    if any(term in text.lower() for term in vague_terms):
        score -= 0.2
    
    return max(0, score)

Additional Agents

Add a fact-checking agent:

class FactCheckSig(dspy.Signature):
    """Verify factual claims against sources. 
    Return JSON: {verified: [], uncertain: [], incorrect: []}"""
    claims = dspy.InputField()
    sources = dspy.InputField()
    result_json = dspy.OutputField()

async def fact_check_node(state: GraphState) -> GraphState:
    md = state["report_md"]
    sources = state["research"]
    
    with dspy.context(lm=WRITER_LM):
        result = FACT_CHECKER(
            claims=extract_claims(md),
            sources=format_sources(sources)
        )
    
    # Flag issues
    data = json.loads(result.result_json)
    if data.get("incorrect"):
        print("⚠️ Found potentially incorrect claims")
    
    return {"fact_check_report": data}

Multi-Language Support

Extend for non-English research:

class TranslateSig(dspy.Signature):
    """Translate content to English while preserving meaning."""
    source_text = dspy.InputField()
    source_language = dspy.InputField()
    translated_text = dspy.OutputField()

async def search_node_multilingual(state: GraphState) -> GraphState:
    # Search in target language
    results = await exa_search(query, language="es")
    
    # Translate results
    for source in sources:
        if detect_language(source.content) != "en":
            with dspy.context(lm=RESEARCH_LM):
                translated = TRANSLATOR(
                    source_text=source.content,
                    source_language="auto"
                )
            source.content = translated.translated_text
    
    # Continue with normal summarization
    return summarize_sources(sources)

Conclusion

This system demonstrates that agentic AI can tackle complex, multi-stage workflows when you:

Structure the workflow: LangGraph provides clear orchestration
Make prompts first-class: DSPy signatures are composable and optimizable
Optimize automatically: GEPA improves prompts based on output quality
Configure flexibly: Temperature tuning and environment-based model selection
Implement feedback loops: Gap analysis and review-revise cycles

The result is a system that:

Produces comprehensive, well-cited reports
Continuously improves its prompts through GEPA
Runs efficiently with parallel execution
Maintains high quality through multi-stage review
Adapts to different use cases (quick exploratory vs. deep research)
Balances cost and quality through flexible model configuration

This architecture is applicable beyond research—any multi-agent workflow with optimization potential can benefit from the LangGraph + DSPy + GEPA stack.

Resources

LangGraph docs: https://langchain-ai.github.io/langgraph/
DSPy docs: https://dspy-docs.vercel.app/
Exa API: https://docs.exa.ai/
GEPA paper: https://arxiv.org/abs/2404.07346

The complete code for this system is available as a reference implementation. Try it with your own research topics and section templates!