GEPA + DSPy for provable de‑identification: evolve prompts, enforce structure, ship with tests
This post shows how to turn de‑identification into a contract you can verify: zero PII leaks while preserving structure in incident reports. We’ll use DSPy to write the program and GEPA (Generalizable Evolving Prompt Agents) to evolve the program’s textual instructions from feedback - no gradient training, tiny data, any model provider. In practice: you write a metric that returns a score and textual feedback, and dspy.GEPA uses that feedback to rewrite your module’s instructions until constraints pass. (DSPy)
You can find all the code in this GitHub Repo.
TL;DR
- Goal: redact emails/phones/names → placeholders while keeping sections like
Root cause:/Action items:intact. - Method: a metric with feedback (not just a number) guides GEPA to rewrite your DSPy module’s instructions; GEPA searches a space of instruction variants and keeps Pareto‑better candidates. (DSPy)
- Why now: The GEPA paper reports stronger sample‑efficiency than popular RL approaches (e.g., GRPO) and better results than top prompt optimizers (e.g., MIPROv2) in several tasks - with far fewer rollouts. (arXiv)
What we’ll build
A small DSPy program that rewrites incident reports. It:
- Replaces PII with
[EMAIL],[PHONE],[NAME] - Preserves required sections and bullet structure
- Is automatically optimized by GEPA from a handful of examples
If you haven’t used DSPy: you declare a Signature (inputs/outputs) and wrap it in a Module (e.g., ChainOfThought) that adds a reasoning field. Then you compile it with an optimizer like GEPA. (DSPy)
Why GEPA (and not more prompt tweaking or RL)?
- Language‑native optimization: GEPA reads your metric’s feedback (“PII leaked: email; keep sections”) and proposes better instructions. No reward shaping or hand‑rolled RL loops. (DSPy)
- Pareto search: It explores multiple candidates and retains those that are not dominated (e.g., fewer leaks and better structure retention). You can also surface best outputs per input at inference‑time. (DSPy)
- Empirical edge: The 2025 preprint finds GEPA can beat RL baselines with up to 35× fewer rollouts and outperform MIPROv2 across tasks/models. (You still need to validate in your domain.) (arXiv)
A runnable minimal example
Dependencies
uv add dspy-ai gepa
# Use any LiteLLM-compatible provider; here we show OpenAI names
export OPENAI_API_KEY=... # or pass api_key in code
Configure your LM once; DSPy accepts provider/model strings like 'openai/gpt-4o-mini' and 'openai/gpt-4o'. (DSPy)
import re
import dspy
# 0) Pick task + reflection models (reflection ≈ stronger)
task_lm = dspy.LM("openai/gpt-4o-mini")
reflect_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=task_lm) # global default LM for modules :contentReference[oaicite:8]{index=8}
# 1) Signature: what the module does (not how to prompt)
class DeIDSignature(dspy.Signature):
"""Rewrite an incident report to remove PII while preserving causal structure and action items."""
report = dspy.InputField(desc="Raw incident report text.")
rules = dspy.InputField(desc="Redaction rules and required output format.")
clean_report = dspy.OutputField(
desc="Redacted report using [EMAIL], [PHONE], [NAME]. Keep 'Root cause:' + 'Action items:' and bullets."
)
# 2) Module: we’ll let GEPA evolve its internal instructions
class DeIDProgram(dspy.Module):
def __init__(self):
super().__init__()
self.rewriter = dspy.ChainOfThought(DeIDSignature) # adds .reasoning field to the prediction :contentReference[oaicite:9]{index=9}
def forward(self, report, rules):
return self.rewriter(report=report, rules=rules)
student = DeIDProgram()
# 3) Tiny “dataset”: GEPA doesn’t require labels, just examples to evaluate on
RULES = """Redact emails, phone numbers, and full names. Use placeholders [EMAIL], [PHONE], [NAME].
Keep section headers and bullets. Output format:
Root cause: ...
Action items: ...
- bullets for action items"""
trainset = [
dspy.Example(
report="Root cause: Alice Chen emailed ops (alice.chen@acme.io).\nAction items:\n- Call +1 (415) 555-0199 to notify vendor.",
rules=RULES
).with_inputs("report", "rules"),
dspy.Example(
report="Root cause: Misconfigured S3 bucket by Bob A.\nAction items:\n- Rotate keys\n- email secops@company.com with incident ID 12345",
rules=RULES
).with_inputs("report", "rules"),
]
devset = [
dspy.Example(
report="Root cause: OT sensor alert phoned to 212-555-0101 by Carol Q.\nAction items:\n- File ticket\n- email ops@example.org",
rules=RULES
).with_inputs("report", "rules"),
]
# Note: .with_inputs tells DSPy which fields are inputs for evaluation/compilation. :contentReference[oaicite:10]{index=10}
# 4) Metric with feedback: score + *text* guidance for GEPA
EMAIL = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")
PHONE = re.compile(r"(?:\\+?\\d{1,3}[-. (]*)?\\d{3}[-. )]*\\d{3}[-. ]*\\d{4}")
NAME = re.compile(r"\\b([A-Z][a-z]+ [A-Z][a-z]+)\\b")
def pii_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
text = (pred.clean_report or "").strip()
leaks = []
if EMAIL.search(text):
leaks.append("email")
if PHONE.search(text):
leaks.append("phone")
if NAME.search(gold.report) and "[NAME]" not in text:
leaks.append("name")
keeps_root = "Root cause:" in text
keeps_actions = "Action items:" in text
# Score ∈ [0,1]: 0.6 for zero leaks + 0.2 each for keeping the two sections
score = (0.6 if not leaks else 0.0) + (0.2 if keeps_root else 0.0) + (0.2 if keeps_actions else 0.0)
feedback = []
if leaks:
feedback.append(f"PII leaked: {', '.join(leaks)}. Replace PII with [EMAIL], [PHONE], [NAME].")
if not keeps_root or not keeps_actions:
missing = []
if not keeps_root:
missing.append("keep 'Root cause:'")
if not keeps_actions:
missing.append("keep 'Action items:'")
feedback.append("Also " + " and ".join(missing) + ".")
if not feedback:
feedback.append("Great: no PII and structure preserved. Prefer succinct edits; avoid adding facts.")
return dspy.Prediction(score=score, feedback=" ".join(feedback)) # GEPA reads this feedback to evolve instructions.
# 5) Run GEPA (reflection model must be provided)
gepa = dspy.GEPA(
metric=pii_metric,
auto="light",
reflection_lm=reflect_lm,
track_stats=True,
track_best_outputs=True # also useful as an inference-time search to surface best candidates per input
) # See GEPA API for params like candidate_selection_strategy='pareto'. :contentReference[oaicite:12]{index=12}
optimized = gepa.compile(student, trainset=trainset, valset=devset)
# 6) Try it
test_report = (
"Root cause: Dave Miller called 650-555-0000 to report breach.\n"
"Action items:\n- email dave@contoso.com\n- notify legal"
)
print(optimized(report=test_report, rules=RULES).clean_report)
# Optional: Inspect the Pareto/best outputs per instance
# print(optimized.detailed_results.best_outputs_valset) # requires track_best_outputs=True :contentReference[oaicite:13]{index=13}
How it works: ChainOfThought(DeIDSignature) adds a reasoning field to each prediction, and GEPA uses execution traces plus your metric’s text feedback to propose new instructions for that internal predictor. You don’t hand‑tune prompts; you declare a metric, and GEPA does the legwork. (DSPy)
Designing the metric: treat privacy as testable constraints
The example metric is deliberately simple: it checks for absence of obvious emails/phones and presence of required headings, then summarizes violations in plain English. That text is the “teacher’s note” GEPA consumes to propose better instructions.
Extensions you might add:
- High‑recall PII checks. Swap regex for a hybrid: deterministic patterns + a lightweight NER (e.g., names, orgs) + an LM‑as‑judge to catch edge cases. (GEPA supports multi‑component feedback; just keep the message specific.) (DSPy)
- Semantic invariants. Penalize if the rewrite changes causal claims (e.g., negate root cause). This can be a second sub‑score described in the feedback string (“preserve causal statement; don’t add new actors”).
- Formatting constraints. Require bullets under
Action items:and cap length (tokens or characters). - Adversarial tests. Include tricky inputs (e.g., obfuscated emails) in
valsetto harden the instruction set.
Because DSPy Example objects make inputs explicit (via .with_inputs), building and mutating tiny evaluation sets is straightforward. (DSPy)
Why this approach is unusually practical
- Provider‑agnostic. DSPy lets you configure any LiteLLM‑compatible provider/model with
dspy.LM('provider/model'), and swap models globally viadspy.configure. Start small (cheap mini models) for iteration; use a stronger model for reflection. (DSPy) - Few examples. You don’t need labels - just inputs the metric can score.
- Observable search. Set
track_best_outputs=Trueand keep the “best so far” outputs or inspect the Pareto frontier to understand trade‑offs across inputs. (DSPy)
What to expect when you run it
With a handful of examples, GEPA typically converges in a few dozen metric calls. You’ll often see the evolved instructions start to:
- Explicitly call out replacements (“replace emails with
[EMAIL]”), - Repeat structure requirements (“keep the
Root cause:header verbatim”), - Add guard‑phrases about not inventing facts.
If a specific input still leaks, add it to valset and rerun compile - the new failure will be folded into the reflective updates.
Productionizing checklist
- Trace & metrics: Turn on DSPy/MLflow or other tracing to log prompts, costs, and metric scores across candidates; GEPA exposes detailed run metadata, including the number of metric calls and best‑per‑instance outputs. (DSPy)
- Hard fail gates: Keep a final deterministic redactor (regex/NER) after the model as a belt‑and‑suspenders guard.
- Data governance: Don’t ship user PII to third‑party endpoints; prefer local LMs for redaction or use privacy‑preserving gateways. (DSPy makes swapping endpoints trivial.) (DSPy)
- Known rough edges: GEPA is evolving quickly; multimodal runs have had memory‑leak issues reported in the wild - test budgets and monitor. (GitHub)
Beyond incident reports: other ideas
- Safety checklists: Evolve instructions until every output contains mandatory bullets (e.g., “lockout/tagout”, “PPE”).
- Terms-to-Plain‑English: Rewrite clauses for readability but assert a semantic fidelity metric.
- Medical device logs: Mask identifiers while preserving error codes and timelines (regulatory review friendliness).
All reuse the same pattern: declare a Signature → write a feedback‑rich metric → run GEPA.
Appendix: a slightly stricter composite metric (drop‑in)
def composite_pii_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
text = (pred.clean_report or "").strip()
issues = []
# 1) PII leak checks (extend with better detectors as needed)
leaks = []
if EMAIL.search(text): leaks.append("email")
if PHONE.search(text): leaks.append("phone")
if NAME.search(gold.report) and "[NAME]" not in text: leaks.append("name")
if leaks: issues.append(f"PII leaked: {', '.join(leaks)}; replace with placeholders.")
# 2) Structure invariants
if "Root cause:" not in text: issues.append("Missing header: 'Root cause:'.")
if "Action items:" not in text: issues.append("Missing header: 'Action items:'.")
# 3) Formatting: require bullets for action items
if "Action items:" in text:
after = text.split("Action items:", 1)[1]
if "-" not in after and "\n•" not in after:
issues.append("Action items must be bulleted with '-' or '•'.")
# 4) No fabrication: forbid adding new emails/phones beyond placeholders
hallucination = EMAIL.findall(text) or PHONE.findall(text)
if hallucination: issues.append("Do not introduce new PII; use placeholders only.")
# Score scheme
base = 1.0
penalty = 0.25 * len(issues) # tune per your tolerance
score = max(0.0, base - penalty)
feedback = " ".join(issues) if issues else (
"Great: no leaks, headers intact, bullets present; keep edits minimal and factual."
)
return dspy.Prediction(score=score, feedback=feedback)
Swap this into dspy.GEPA(metric=composite_pii_metric, ...) when you need tighter guarantees. (GEPA’s API supports exactly this kind of textual feedback; see docs.) (DSPy)
References & further reading
- GEPA in DSPy: API and usage, including feedback metric signature, Pareto selection, and
track_best_outputs. (DSPy) - DSPy LM configuration: model/provider strings and
dspy.configure. (DSPy) - DSPy
ChainOfThought: adds areasoningfield and encourages stepwise outputs. (DSPy) - DSPy Examples /
.with_inputs: mark which fields are inputs for evaluation/compilation. (DSPy) - GEPA preprint: reflective prompt evolution outperforms RL baselines with fewer rollouts; also stronger than MIPROv2 in reported settings. (Validate for your task.) (arXiv)
- Tutorials: DSPy’s GEPA tutorials (AIME, structured extraction, privacy‑conscious delegation). (DSPy)
Closing thought
We’ve been treating “privacy” as a soft request to the model. With DSPy + GEPA you can make it a hard requirement: design the contract (your metric), let GEPA evolve the instructions to satisfy it, and keep a deterministic last pass as a guard. That’s not vibes - that’s engineering.