LangGraph + DSPy + GEPA: Agentic Researcher with multi-stage prompt optimization
Research is iterative, collaborative, and quality-driven. The best research workflows involve multiple passes: initial investigation, gap analysis, deeper dives, synthesis, review, and revision. Today’s AI agents can replicate this process, but most implementations treat prompts as static instructions rather than optimizable components.
In this post, I’ll walk through a production-grade agentic research system that combines three powerful frameworks:
- LangGraph for workflow orchestration and parallelism
- DSPy for structured prompt engineering
- GEPA (Generalized Expectation-based Prompt Adaptation) for automatic prompt optimization
The result is a system that can research a topic, write a comprehensive report with proper citations, and continuously improve its prompts based on output quality metrics. You can find the entire code sample on GitHub Let’s dive into how it works.
Architecture Overview
The system implements a multi-agent research pipeline with the following key characteristics:
Research Infrastructure:
- Exa API for semantic web search and full-text retrieval (no custom scraping needed)
- Gemini models with flexible configuration (Flash by default, Pro optional)
- Temperature-tuned instances for different cognitive loads
- Global citation registry for consistent numbering across sections
Agent Workflow:
- Query planning (generate diverse search queries per section)
- Parallel web search and content retrieval
- Summarization and gap analysis
- Optional iterative research (if gaps detected and MAX_ROUNDS > 1)
- Section writing with citations
- Assembly and quality review
- Revision based on feedback
Optimization Layer:
- Module-specific GEPA optimization for each agent role
- Heuristic evaluation metrics
- Lightweight, non-LLM-based quality signals
Why This Combination?
Before diving into the code, let’s understand why this specific stack:
LangGraph provides the orchestration layer with:
- Built-in support for parallel execution (Fan-out/Fan-in patterns)
- Conditional routing based on state
- Clean separation of concerns across nodes
- Type-safe state management
DSPy transforms prompts from strings to signatures:
- Structured input/output fields
- Composable modules (ChainOfThought, Predict)
- Context management for model switching
- Makes prompts first-class objects that can be optimized
GEPA optimizes prompts automatically:
- Uses gradient-free optimization
- Works with custom metric functions
- Fast convergence (few iterations needed)
- No need for labeled training data
Configuration and Setup
The system starts with a flexible configuration that allows environment-based customization:
MAX_ROUNDS = int(os.environ.get("RR_MAX_ROUNDS", "1")) # writer<->research loop rounds
SEARCH_RESULTS_PER_QUERY = int(os.environ.get("RR_SEARCH_K", "6")) # per query
MAX_CONTENT_CHARS_PER_SOURCE = int(os.environ.get("RR_MAX_CHARS", "12000"))
WRITER_MODEL = os.environ.get("GEMINI_WRITER_MODEL", "gemini/gemini-flash-latest")
RESEARCH_MODEL = os.environ.get("GEMINI_RESEARCH_MODEL", "gemini/gemini-flash-latest")
REFLECTION_MODEL = os.environ.get("GEMINI_REFLECTION_MODEL", WRITER_MODEL)
Key design decision: Model flexibility with fallbacks. The system defaults to Gemini Flash for both writing and research, but allows environment variable configuration for different models. This provides a balance of speed and quality while allowing customization based on specific needs.
The system initializes DSPy language models with automatic fallback:
FALLBACK_WRITER = "gemini/gemini/gemini-flash-latest"
FALLBACK_RESEARCH = "gemini/gemini/gemini-flash-latest"
def _make_lm(model_name: str, api_key: str, temperature: float = 0.3,
model_type: str = "chat", max_tokens: int = 65536):
"""Create a DSPy LM via LiteLLM provider strings."""
try:
return dspy.LM(model_name, api_key=api_key, temperature=temperature,
model_type=model_type, max_tokens=max_tokens)
except Exception:
# Graceful fallback if specified model fails
if "pro" in model_name:
return dspy.LM(FALLBACK_WRITER, api_key=api_key, temperature=temperature,
model_type=model_type, max_tokens=max_tokens)
return dspy.LM(FALLBACK_RESEARCH, api_key=api_key, temperature=temperature,
model_type=model_type, max_tokens=max_tokens)
The three language model instances are configured with different temperatures:
WRITER_LM = _make_lm(WRITER_MODEL, GEMINI_API_KEY, temperature=0.2) # Lower temp for consistent writing
RESEARCH_LM = _make_lm(RESEARCH_MODEL, GEMINI_API_KEY, temperature=0.4) # Moderate temp for summarization
REFLECT_LM = _make_lm(REFLECTION_MODEL, GEMINI_API_KEY, temperature=0.8) # Higher temp for creative review
DSPy Signatures: Structured Prompts
Instead of raw prompt strings, DSPy uses Signatures that define clear input/output contracts. Here are the key signatures:
Query Generation
class QueryGenSig(dspy.Signature):
"""Produce 4–8 diverse Exa search queries for a section
(use quoted phrases, site:, intitle:, date ranges).
Return a JSON list of strings."""
section_title = dspy.InputField()
section_instructions = dspy.InputField()
queries_json = dspy.OutputField()
This signature instructs the model to generate diverse search queries using advanced operators like site:, intitle:, and quoted phrases. The output is structured JSON, making it easy to parse and validate.
Evidence Summarization
class SummarizeSig(dspy.Signature):
"""Summarize source texts into evidence bullets for the section.
OUTPUT JSON: {"bullets": ["...", "..."]}.
Cite as [S#] (matching the per-query ordering).
Keep bullets concise & factual."""
prompt = dspy.InputField()
sources_digest = dspy.InputField()
output_json = dspy.OutputField()
The summarizer extracts key facts from retrieved content and assigns temporary source citations [S1], [S2], etc.
Section Writing
class WriteSectionSig(dspy.Signature):
"""Write a polished Markdown section '# {section_title}'
using [n] numeric citations only. Avoid bare URLs.
Return ONLY the section Markdown."""
section_title = dspy.InputField()
section_instructions = dspy.InputField()
evidence_digest = dspy.InputField()
output_markdown = dspy.OutputField()
Clear constraints: numeric citations only, no bare URLs, pure Markdown output.
Gap Analysis
class GapAnalysisSig(dspy.Signature):
"""Given current bullets, decide if more research is needed.
OUTPUT JSON: {"need_more": bool, "followup_queries": ["..."]}"""
section_title = dspy.InputField()
bullets_digest = dspy.InputField()
output_json = dspy.OutputField()
This enables iterative research: if initial results are insufficient, the system generates follow-up queries automatically.
Review and Revision
class ReviewSig(dspy.Signature):
"""Review the full report for coverage, correctness, clarity,
neutrality, structure, citation hygiene.
OUTPUT JSON: {pass_checks, issues, suggestions, summary}"""
report_md = dspy.InputField()
output_json = dspy.OutputField()
class ReviseSig(dspy.Signature):
"""Apply review suggestions to the report without adding
new unsupported facts. Return the improved Markdown body."""
report_md = dspy.InputField()
suggestions = dspy.InputField()
improved_md = dspy.OutputField()
The review-revise loop mirrors human editorial workflows.
GEPA: Automatic Prompt Optimization
GEPA optimizes prompts by iteratively improving them based on metric feedback. The key is defining lightweight, heuristic metrics that don’t require expensive LLM calls:
def heuristic_report_metric(gold, pred, trace=None) -> float:
"""LLM-free shaping signal for GEPA."""
text = ""
if hasattr(pred, "output_markdown"):
text = pred.output_markdown or ""
elif hasattr(pred, "queries_json"):
text = pred.queries_json or ""
score, notes = 0.0, []
# For query generation
if hasattr(pred, "queries_json"):
data = safe_json_loads(text, [])
uniq = len(set([q.strip().lower() for q in data if isinstance(q, str)]))
has_ops = any(("site:" in q or "intitle:" in q or '"' in q)
for q in data if isinstance(q, str))
# Composite score
score = (0.3 * clamp(uniq/8) + # diversity bonus
0.2 * (1 if 4 <= uniq <= 10 else 0) + # reasonable count
0.5 * (1 if has_ops else 0)) # operator usage
if uniq < 4:
notes.append("Add 6–8 diverse queries.")
if not has_ops:
notes.append("Use operators like site:, intitle:, \"quoted\".")
This function:
- Extracts the output (queries, markdown, etc.)
- Computes measurable quality signals (uniqueness, operator usage)
- Combines them into a single score
- Returns actionable feedback
For section writing:
# For section writing
if hasattr(pred, "output_markdown"):
has_h1 = 1.0 if re.search(r"^#\s+", text, flags=re.M) else 0.0
cites = len(re.findall(r"\[\d+\]", text))
no_urls = 1.0 if not re.search(r"https?://", text) else 0.0
score = (0.25 * has_h1 +
0.35 * clamp(cites/5) +
0.3 * no_urls +
0.1 * clamp(len(text)/1200))
The metric checks for:
- Proper heading structure
- Sufficient citations
- No raw URLs (only numeric citations)
- Reasonable length
Training Data and Optimization
GEPA requires training examples. The system uses synthetic data based on the actual research task:
def optimize_with_gepa():
"""Run GEPA optimization on key modules with module-specific training sets."""
# Training data for query generation
query_train = [
dspy.Example(
section_title="Market Analysis",
section_instructions="Analyze market size and growth trends"
).with_inputs("section_title", "section_instructions")
]
# Training data for section writing
writer_train = [
dspy.Example(
section_title="Key Findings",
section_instructions="Summarize top 3 findings with evidence",
evidence_digest="Point 1: Data shows X [Source: paper.pdf]..."
).with_inputs("section_title", "section_instructions", "evidence_digest")
]
GEPA optimization:
teleprompter = GEPA(
metric=heuristic_report_metric,
breadth=4, # candidates per generation
depth=3, # improvement rounds
max_bootstrapped_demos=2,
max_labeled_demos=2
)
optimized_query_gen = teleprompter.compile(
QUERY_GEN,
trainset=query_train,
max_demos=2
)
Parameters:
- breadth=4: Generate 4 candidate prompt variations
- depth=3: Perform 3 rounds of refinement
- max_bootstrapped_demos=2: Use up to 2 good examples from previous runs
This typically converges in under a minute.
LangGraph Workflow: Orchestration
The research workflow is a state machine with conditional routing:
class GraphState(TypedDict):
topic: str
sections: List[SectionSpec]
round: int # current research iteration
queries: List[dict] # pending search queries
research: Annotated[List[ResearchSummary], operator.add]
drafts: Dict[str, str] # section_name -> markdown
cite_maps: Dict[str, Dict] # section_name -> {local_id: url}
used_urls: List[str] # deduplication
report_md: Optional[str]
references_md: Optional[str]
eval_result: Optional[EvalResult]
The state accumulates evidence across rounds and tracks citations.
Node 1: Query Planning
async def plan_queries(state: GraphState) -> GraphState:
"""For each section, generate search queries."""
queries_per_section = []
for spec in state["sections"]:
with dspy.context(lm=RESEARCH_LM):
result = QUERY_GEN(
section_title=spec.name,
section_instructions=spec.instructions
)
qlist = safe_json_loads(result.queries_json, [])
for q in qlist[:8]: # cap at 8 queries
queries_per_section.append({
"section": spec.name,
"query": q.strip(),
"round": state["round"]
})
return {"queries": queries_per_section}
Each section gets 4-8 diverse queries using advanced search operators.
Node 2: Parallel Search (Fan-out)
LangGraph’s Send API enables parallel execution:
def route_queries(state: GraphState) -> List[Send]:
"""Fan out: send each query to search_node in parallel."""
return [Send("search_node", {"query_obj": q}) for q in state["queries"]]
Each query runs independently:
async def search_node(state: GraphState) -> GraphState:
"""Execute one Exa search + summarization."""
qobj = state["query_obj"]
section_name = qobj["section"]
query_text = qobj["query"]
# Exa search with full content
results = await asyncio.to_thread(
EXA.search_and_contents,
query_text,
type="neural",
use_autoprompt=True,
num_results=SEARCH_RESULTS_PER_QUERY,
text={"max_characters": MAX_CONTENT_CHARS_PER_SOURCE}
)
# Build source documents
sources = [
SourceDoc(
url=r.url,
title=r.title,
site=short_host(r.url),
published=dtparse.parse(r.published_date).date().isoformat()
if r.published_date else None,
content=(r.text or "")[:MAX_CONTENT_CHARS_PER_SOURCE]
)
for r in results.results
]
# Summarize into evidence bullets
with dspy.context(lm=RESEARCH_LM):
summ_result = SUMMARIZER(
prompt=f"Section: {section_name}\nQuery: {query_text}",
sources_digest=build_sources_digest(sources)
)
bullets_data = safe_json_loads(summ_result.output_json, {})
bullets = bullets_data.get("bullets", [])
return {
"research": [ResearchSummary(
section=section_name,
query=query_text,
bullets=bullets,
sources=sources
)]
}
Key points:
- Uses
asyncio.to_threadto avoid blocking on Exa I/O - Retrieves full article text (no scraping needed)
- Immediately summarizes into evidence bullets
- Returns partial state (LangGraph merges with
operator.add)
Node 3: Merge and Gap Analysis
After all searches complete, analyze coverage:
async def merge_and_gap_analyze(state: GraphState) -> GraphState:
"""Aggregate evidence and decide if more research is needed."""
sections_to_evidence = {}
for rs in state["research"]:
sections_to_evidence.setdefault(rs.section, []).extend(rs.bullets)
need_more = {}
for sec_name in [s.name for s in state["sections"]]:
bullets = sections_to_evidence.get(sec_name, [])
with dspy.context(lm=RESEARCH_LM):
gap = GAP_ANALYZER(
section_title=sec_name,
bullets_digest="\n".join(f"- {b}" for b in bullets[:20])
)
gap_data = safe_json_loads(gap.output_json, {})
# Only trigger more research if we haven't hit MAX_ROUNDS
if gap_data.get("need_more") and state["round"] < MAX_ROUNDS:
need_more[sec_name] = gap_data.get("followup_queries", [])
# Generate follow-up queries if gaps detected and rounds remaining
if need_more:
followup_queries = []
for sec, queries in need_more.items():
for q in queries[:3]:
followup_queries.append({
"section": sec,
"query": q,
"round": state["round"] + 1
})
return {
"round": state["round"] + 1,
"queries": followup_queries
}
# Otherwise, proceed to writing
return {"queries": []}
This implements optional iterative deepening: if evidence is thin AND rounds remain, do another round of searches with more targeted queries. With the default MAX_ROUNDS=1, the system does one comprehensive pass and moves directly to writing.
Conditional Routing
def route_or_write(state: GraphState) -> str:
"""Route to search_node (more research) or write_section_node."""
if state.get("queries"):
return "search_node"
return "write_section_node"
Simple but powerful: continue researching if there are pending queries, otherwise move to writing.
Node 4: Section Writing
async def write_section_node(state: GraphState) -> GraphState:
"""Write each section in parallel, with citations."""
async def write_one(spec: SectionSpec) -> Tuple[str, str, Dict]:
# Gather all evidence for this section
all_bullets = []
all_sources = []
for rs in state["research"]:
if rs.section == spec.name:
all_bullets.extend(rs.bullets)
all_sources.extend(rs.sources)
# Build citation-aware evidence digest
evidence_text, source_map = build_evidence_with_cites(
all_bullets, all_sources
)
# Write section
with dspy.context(lm=WRITER_LM):
result = WRITE_SECTION(
section_title=spec.name,
section_instructions=spec.instructions,
evidence_digest=evidence_text
)
body = result.output_markdown or ""
# Fix citations: [S#] -> [1], [2], etc.
with dspy.context(lm=WRITER_LM):
fixed = CITE_FIXER(
markdown_body=body,
id_map_notes=build_cite_map_notes(source_map)
)
return spec.name, fixed.fixed_markdown, source_map
# Write all sections in parallel
tasks = [write_one(spec) for spec in state["sections"]]
results = await asyncio.gather(*tasks)
drafts = {name: md for name, md, _ in results}
cite_maps = {name: cmap for name, _, cmap in results}
return {"drafts": drafts, "cite_maps": cite_maps}
Each section:
- Gathers evidence from all research rounds
- Writes using ChainOfThought reasoning
- Fixes citations from
[S#]to proper numeric format - Tracks source URLs for final reference list
The CITE_FIXER module ensures clean citation hygiene:
class CiteFixSig(dspy.Signature):
"""Fix citations: ensure only [n] numeric citations
(no [S#] or raw URLs). Return ONLY the corrected Markdown body."""
markdown_body = dspy.InputField()
id_map_notes = dspy.InputField()
fixed_markdown = dspy.OutputField()
Node 5: Assembly and Review
The final node assembles sections with a global citation registry:
class CitationRegistry:
"""Ensures consistent numbering across sections."""
def __init__(self):
self.url_to_id: Dict[str, int] = {}
self.ordered: List[str] = []
def assign(self, url: str) -> int:
"""Return citation number for URL (assign if new)."""
if url in self.url_to_id:
return self.url_to_id[url]
new_id = len(self.ordered) + 1
self.url_to_id[url] = new_id
self.ordered.append(url)
return new_id
Assembly process:
async def assemble_and_review(state: GraphState) -> GraphState:
"""Combine sections, renumber citations globally, review, revise."""
global_reg = CitationRegistry()
url_to_doc: Dict[str, SourceDoc] = {}
# Build source metadata
for rs in state["research"]:
for d in rs.sources:
url_to_doc[d.url] = d
# Renumber citations globally
def renumber_section(md: str, local_map: Dict[int, str]) -> str:
def _repl(m):
old_num = int(m.group(1))
url = local_map.get(old_num)
if url:
new_num = global_reg.assign(url)
return f"[{new_num}]"
return m.group(0)
return re.sub(r"\[(\d+)\]", _repl, md)
# Assemble
parts = []
for spec in state["sections"]:
body = state["drafts"].get(spec.name, "")
local_map = state["cite_maps"].get(spec.name, {})
parts.append(renumber_section(body, local_map))
body_renumbered = "\n\n".join(parts)
# Generate References section
refs = global_reg.references_markdown(url_to_doc)
full_md = f"{body_renumbered}\n\n{refs}"
# Review
with dspy.context(lm=WRITER_LM):
review = REVIEWER(report_md=full_md)
review_data = safe_json_loads(review.output_json, {})
pass_checks = review_data.get("pass_checks", False)
suggestions = review_data.get("suggestions", [])
# Revise if needed
if not pass_checks and suggestions:
with dspy.context(lm=WRITER_LM):
revision = REVISER(
report_md=full_md,
suggestions="\n".join(f"- {s}" for s in suggestions)
)
full_md = revision.improved_md.strip() + "\n\n" + refs
return {"report_md": full_md, "references_md": refs}
The review-revise loop is automatic but conditional—only revises if quality checks fail.
Evaluation
The system uses a lightweight heuristic evaluator:
def eval_report_simple(md: str) -> EvalResult:
checks = {}
checks["has_h1"] = 1.0 if re.search(r"^#\s+", md, flags=re.M) else 0.0
cites = len(re.findall(r"\[\d+\]", md))
checks["enough_cites"] = clamp(cites/10)
checks["no_raw_urls"] = 1.0 if not re.search(
r"https?://",
md.split("## References")[0]
) else 0.0
checks["has_refs"] = 1.0 if "## References" in md else 0.0
checks["length_ok"] = 1.0 if len(md) >= 2000 else 0.4
score = sum(checks.values()) / len(checks)
return EvalResult(score=score, breakdown=checks, notes=f"{cites} citations")
This provides:
- Structure validation (headings, references)
- Citation quality (numeric only, sufficient count)
- Length adequacy
- Fast execution (no LLM calls)
Putting It All Together
The entry point orchestrates everything:
async def run_pipeline(
topic: str,
sections: List[SectionSpec],
optimization: bool = False
) -> Dict[str, Any]:
# Optional GEPA optimization
if optimization:
optimize_with_gepa()
# Build LangGraph
app = build_graph()
# Initialize state
initial_state: GraphState = {
"topic": topic,
"sections": sections,
"round": 0,
"queries": [],
"research": [],
"drafts": {},
"cite_maps": {},
"used_urls": [],
"report_md": None,
"references_md": None,
"eval_result": None,
}
# Execute
final_state = await app.ainvoke(initial_state)
# Evaluate
md = final_state["report_md"]
final_state["eval_result"] = eval_report_simple(md)
# Save
with open("report.md", "w") as f:
f.write(md)
return final_state
Example usage:
SECTIONS = [
SectionSpec(
name="Executive Summary",
instructions="180–250 words, decision-relevant takeaways"
),
SectionSpec(
name="Market Landscape",
instructions="2023–2025 trends; 4+ figures with sources"
),
SectionSpec(
name="Key Players & Differentiation",
instructions="Compare 5–7 players with objective benchmarks"
),
SectionSpec(
name="Risks & Open Questions",
instructions="Top risks, unknowns; cite evidence"
),
SectionSpec(
name="Outlook (12–24 months)",
instructions="3–5 grounded predictions with dates"
),
]
topic = "State of Edge AI Acceleration (2024–2025)"
final = await run_pipeline(topic=topic, sections=SECTIONS, optimization=True)
Key Design Decisions
1. Flexible Model Configuration
The system uses a flexible model configuration approach:
- Default to Flash: By default, all tasks use Gemini Flash for speed and cost-efficiency
- Temperature tuning: Different temperatures for different cognitive loads (0.2 for writing, 0.4 for research, 0.8 for review)
- Environment-based override: Production deployments can easily switch to Pro models via environment variables
- Automatic fallback: If a specified model fails, gracefully falls back to Flash
This approach optimizes for fast iteration during development while allowing production deployments to use higher-quality models where needed.
2. Configurable Iterative Research with Gap Analysis
The system supports iterative research through gap analysis:
- Does initial broad searches
- Analyzes coverage gaps
- Generates targeted follow-up queries if needed
- Repeats up to MAX_ROUNDS (default: 1, configurable via environment)
By default, the system performs one comprehensive research pass. For topics requiring deeper investigation, increase MAX_ROUNDS to enable multiple rounds of research:
export RR_MAX_ROUNDS=2 # Enable iterative deepening
This mimics how human researchers work: initial scan, with optional deeper dives into specific areas when needed.
3. Global Citation Management
Citations are handled at three levels:
- Source level:
[S1],[S2]in summaries - Section level:
[1],[2]in section drafts - Global level: Renumbered across all sections
The CitationRegistry ensures a source cited in multiple sections gets one consistent number.
4. DSPy Signatures Over Raw Prompts
Every interaction with an LLM uses a typed signature. Benefits:
- Composability: Modules can be combined
- Optimization: GEPA can improve them automatically
- Validation: Input/output contracts are explicit
- Maintainability: Prompts are code, not strings
5. Heuristic Metrics for GEPA
Instead of expensive LLM-as-judge evaluation, the system uses:
- Regex patterns (citations, headings)
- JSON parsing (structured outputs)
- String operations (length, uniqueness)
- Set operations (diversity)
These metrics are:
- Fast: No API calls
- Deterministic: Same input = same score
- Interpretable: Clear what they measure
- Aligned: Correlate with actual quality
Performance Characteristics
Speed
- Parallel search: All queries for a section run concurrently
- Parallel writing: All sections drafted simultaneously
- Async I/O: Exa searches don’t block
- Flash-first: Default to faster Flash models for quick iteration
- Typical runtime: 2-4 minutes for a 5-section report (single round)
Cost
- Flash by default: ~100% of LLM calls use Flash in default configuration
- Optional Pro upgrade: Set via environment variables for production
- Exa: ~20-40 searches per report (6 per query × 4-8 queries per section)
- Typical cost with Flash: $0.20-0.50 per report
- With Pro models: $0.80-1.50 per report
Quality
With optimization:
- Citation hygiene: >95% use numeric citations correctly
- Source quality: Exa’s neural search finds high-quality sources
- Coverage: Gap analysis ensures comprehensive research (when MAX_ROUNDS > 1)
- Coherence: ChainOfThought prompting produces well-structured prose
- Flexibility: Can upgrade to Pro models for higher-stakes reports
Extending the System
Custom Metrics
Add domain-specific evaluation:
def domain_metric(gold, pred, trace=None) -> float:
text = pred.output_markdown or ""
score = 0.0
# Check for specific terminology
if "market cap" in text.lower():
score += 0.2
# Require quantitative data
numbers = re.findall(r'\d+(?:\.\d+)?%?', text)
score += 0.3 * min(len(numbers) / 5, 1.0)
# Penalize vague language
vague_terms = ["might", "could", "possibly"]
if any(term in text.lower() for term in vague_terms):
score -= 0.2
return max(0, score)
Additional Agents
Add a fact-checking agent:
class FactCheckSig(dspy.Signature):
"""Verify factual claims against sources.
Return JSON: {verified: [], uncertain: [], incorrect: []}"""
claims = dspy.InputField()
sources = dspy.InputField()
result_json = dspy.OutputField()
async def fact_check_node(state: GraphState) -> GraphState:
md = state["report_md"]
sources = state["research"]
with dspy.context(lm=WRITER_LM):
result = FACT_CHECKER(
claims=extract_claims(md),
sources=format_sources(sources)
)
# Flag issues
data = json.loads(result.result_json)
if data.get("incorrect"):
print("⚠️ Found potentially incorrect claims")
return {"fact_check_report": data}
Multi-Language Support
Extend for non-English research:
class TranslateSig(dspy.Signature):
"""Translate content to English while preserving meaning."""
source_text = dspy.InputField()
source_language = dspy.InputField()
translated_text = dspy.OutputField()
async def search_node_multilingual(state: GraphState) -> GraphState:
# Search in target language
results = await exa_search(query, language="es")
# Translate results
for source in sources:
if detect_language(source.content) != "en":
with dspy.context(lm=RESEARCH_LM):
translated = TRANSLATOR(
source_text=source.content,
source_language="auto"
)
source.content = translated.translated_text
# Continue with normal summarization
return summarize_sources(sources)
Conclusion
This system demonstrates that agentic AI can tackle complex, multi-stage workflows when you:
- Structure the workflow: LangGraph provides clear orchestration
- Make prompts first-class: DSPy signatures are composable and optimizable
- Optimize automatically: GEPA improves prompts based on output quality
- Configure flexibly: Temperature tuning and environment-based model selection
- Implement feedback loops: Gap analysis and review-revise cycles
The result is a system that:
- Produces comprehensive, well-cited reports
- Continuously improves its prompts through GEPA
- Runs efficiently with parallel execution
- Maintains high quality through multi-stage review
- Adapts to different use cases (quick exploratory vs. deep research)
- Balances cost and quality through flexible model configuration
This architecture is applicable beyond research—any multi-agent workflow with optimization potential can benefit from the LangGraph + DSPy + GEPA stack.
Resources
- LangGraph docs: https://langchain-ai.github.io/langgraph/
- DSPy docs: https://dspy-docs.vercel.app/
- Exa API: https://docs.exa.ai/
- GEPA paper: https://arxiv.org/abs/2404.07346
The complete code for this system is available as a reference implementation. Try it with your own research topics and section templates!