GEPA + DSPy for provable de‑identification: evolve prompts, enforce structure, ship with tests

This post shows how to turn de‑identification into a contract you can verify: zero PII leaks while preserving structure in incident reports. We’ll use DSPy to write the program and GEPA (Generalizable Evolving Prompt Agents) to evolve the program’s textual instructions from feedback - no gradient training, tiny data, any model provider. In practice: you write a metric that returns a score and textual feedback, and dspy.GEPA uses that feedback to rewrite your module’s instructions until constraints pass. (DSPy)

You can find all the code in this GitHub Repo.

TL;DR

Goal: redact emails/phones/names → placeholders while keeping sections like Root cause: / Action items: intact.
Method: a metric with feedback (not just a number) guides GEPA to rewrite your DSPy module’s instructions; GEPA searches a space of instruction variants and keeps Pareto‑better candidates. (DSPy)
Why now: The GEPA paper reports stronger sample‑efficiency than popular RL approaches (e.g., GRPO) and better results than top prompt optimizers (e.g., MIPROv2) in several tasks - with far fewer rollouts. (arXiv)

What we’ll build

A small DSPy program that rewrites incident reports. It:

Replaces PII with [EMAIL], [PHONE], [NAME]
Preserves required sections and bullet structure
Is automatically optimized by GEPA from a handful of examples

If you haven’t used DSPy: you declare a Signature (inputs/outputs) and wrap it in a Module (e.g., ChainOfThought) that adds a reasoning field. Then you compile it with an optimizer like GEPA. (DSPy)

Why GEPA (and not more prompt tweaking or RL)?

Language‑native optimization: GEPA reads your metric’s feedback (“PII leaked: email; keep sections”) and proposes better instructions. No reward shaping or hand‑rolled RL loops. (DSPy)
Pareto search: It explores multiple candidates and retains those that are not dominated (e.g., fewer leaks and better structure retention). You can also surface best outputs per input at inference‑time. (DSPy)
Empirical edge: The 2025 preprint finds GEPA can beat RL baselines with up to 35× fewer rollouts and outperform MIPROv2 across tasks/models. (You still need to validate in your domain.) (arXiv)

A runnable minimal example

Dependencies

uv add dspy-ai gepa
# Use any LiteLLM-compatible provider; here we show OpenAI names
export OPENAI_API_KEY=...   # or pass api_key in code

Configure your LM once; DSPy accepts provider/model strings like 'openai/gpt-4o-mini' and 'openai/gpt-4o'. (DSPy)

import re
import dspy


# 0) Pick task + reflection models (reflection ≈ stronger)
task_lm = dspy.LM("openai/gpt-4o-mini")
reflect_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=task_lm)  # global default LM for modules :contentReference[oaicite:8]{index=8}

# 1) Signature: what the module does (not how to prompt)
class DeIDSignature(dspy.Signature):
    """Rewrite an incident report to remove PII while preserving causal structure and action items."""
    report = dspy.InputField(desc="Raw incident report text.")
    rules  = dspy.InputField(desc="Redaction rules and required output format.")
    clean_report = dspy.OutputField(
        desc="Redacted report using [EMAIL], [PHONE], [NAME]. Keep 'Root cause:' + 'Action items:' and bullets."
    )

# 2) Module: we’ll let GEPA evolve its internal instructions
class DeIDProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.rewriter = dspy.ChainOfThought(DeIDSignature)  # adds .reasoning field to the prediction :contentReference[oaicite:9]{index=9}
    def forward(self, report, rules):
        return self.rewriter(report=report, rules=rules)

student = DeIDProgram()

# 3) Tiny “dataset”: GEPA doesn’t require labels, just examples to evaluate on
RULES = """Redact emails, phone numbers, and full names. Use placeholders [EMAIL], [PHONE], [NAME].
Keep section headers and bullets. Output format:
Root cause: ...
Action items: ...
- bullets for action items"""

trainset = [
    dspy.Example(
        report="Root cause: Alice Chen emailed ops (alice.chen@acme.io).\nAction items:\n- Call +1 (415) 555-0199 to notify vendor.",
        rules=RULES
    ).with_inputs("report", "rules"),
    dspy.Example(
        report="Root cause: Misconfigured S3 bucket by Bob A.\nAction items:\n- Rotate keys\n- email secops@company.com with incident ID 12345",
        rules=RULES
    ).with_inputs("report", "rules"),
]

devset = [
    dspy.Example(
        report="Root cause: OT sensor alert phoned to 212-555-0101 by Carol Q.\nAction items:\n- File ticket\n- email ops@example.org",
        rules=RULES
    ).with_inputs("report", "rules"),
]
# Note: .with_inputs tells DSPy which fields are inputs for evaluation/compilation. :contentReference[oaicite:10]{index=10}

# 4) Metric with feedback: score + *text* guidance for GEPA
EMAIL = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")
PHONE = re.compile(r"(?:\\+?\\d{1,3}[-. (]*)?\\d{3}[-. )]*\\d{3}[-. ]*\\d{4}")
NAME  = re.compile(r"\\b([A-Z][a-z]+ [A-Z][a-z]+)\\b")

def pii_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    text = (pred.clean_report or "").strip()
    leaks = []
    if EMAIL.search(text): 
        leaks.append("email")
    if PHONE.search(text): 
        leaks.append("phone")
    if NAME.search(gold.report) and "[NAME]" not in text:
        leaks.append("name")

    keeps_root = "Root cause:" in text
    keeps_actions = "Action items:" in text

    # Score ∈ [0,1]: 0.6 for zero leaks + 0.2 each for keeping the two sections
    score = (0.6 if not leaks else 0.0) + (0.2 if keeps_root else 0.0) + (0.2 if keeps_actions else 0.0)

    feedback = []
    if leaks:
        feedback.append(f"PII leaked: {', '.join(leaks)}. Replace PII with [EMAIL], [PHONE], [NAME].")
    if not keeps_root or not keeps_actions:
        missing = []
        if not keeps_root: 
            missing.append("keep 'Root cause:'")
        if not keeps_actions:
            missing.append("keep 'Action items:'")
        feedback.append("Also " + " and ".join(missing) + ".")
    if not feedback:
        feedback.append("Great: no PII and structure preserved. Prefer succinct edits; avoid adding facts.")

    return dspy.Prediction(score=score, feedback=" ".join(feedback))  # GEPA reads this feedback to evolve instructions.


# 5) Run GEPA (reflection model must be provided)
gepa = dspy.GEPA(
    metric=pii_metric,
    auto="light",
    reflection_lm=reflect_lm,
    track_stats=True,
    track_best_outputs=True  # also useful as an inference-time search to surface best candidates per input
)  # See GEPA API for params like candidate_selection_strategy='pareto'. :contentReference[oaicite:12]{index=12}

optimized = gepa.compile(student, trainset=trainset, valset=devset)

# 6) Try it
test_report = (
    "Root cause: Dave Miller called 650-555-0000 to report breach.\n"
    "Action items:\n- email dave@contoso.com\n- notify legal"
)
print(optimized(report=test_report, rules=RULES).clean_report)

# Optional: Inspect the Pareto/best outputs per instance
# print(optimized.detailed_results.best_outputs_valset)  # requires track_best_outputs=True :contentReference[oaicite:13]{index=13}

How it works: ChainOfThought(DeIDSignature) adds a reasoning field to each prediction, and GEPA uses execution traces plus your metric’s text feedback to propose new instructions for that internal predictor. You don’t hand‑tune prompts; you declare a metric, and GEPA does the legwork. (DSPy)

Designing the metric: treat privacy as testable constraints

The example metric is deliberately simple: it checks for absence of obvious emails/phones and presence of required headings, then summarizes violations in plain English. That text is the “teacher’s note” GEPA consumes to propose better instructions.

Extensions you might add:

High‑recall PII checks. Swap regex for a hybrid: deterministic patterns + a lightweight NER (e.g., names, orgs) + an LM‑as‑judge to catch edge cases. (GEPA supports multi‑component feedback; just keep the message specific.) (DSPy)
Semantic invariants. Penalize if the rewrite changes causal claims (e.g., negate root cause). This can be a second sub‑score described in the feedback string (“preserve causal statement; don’t add new actors”).
Formatting constraints. Require bullets under Action items: and cap length (tokens or characters).
Adversarial tests. Include tricky inputs (e.g., obfuscated emails) in valset to harden the instruction set.

Because DSPy Example objects make inputs explicit (via .with_inputs), building and mutating tiny evaluation sets is straightforward. (DSPy)

Why this approach is unusually practical

Provider‑agnostic. DSPy lets you configure any LiteLLM‑compatible provider/model with dspy.LM('provider/model'), and swap models globally via dspy.configure. Start small (cheap mini models) for iteration; use a stronger model for reflection. (DSPy)
Few examples. You don’t need labels - just inputs the metric can score.
Observable search. Set track_best_outputs=True and keep the “best so far” outputs or inspect the Pareto frontier to understand trade‑offs across inputs. (DSPy)

What to expect when you run it

With a handful of examples, GEPA typically converges in a few dozen metric calls. You’ll often see the evolved instructions start to:

Explicitly call out replacements (“replace emails with [EMAIL]”),
Repeat structure requirements (“keep the Root cause: header verbatim”),
Add guard‑phrases about not inventing facts.

If a specific input still leaks, add it to valset and rerun compile - the new failure will be folded into the reflective updates.

Productionizing checklist

Trace & metrics: Turn on DSPy/MLflow or other tracing to log prompts, costs, and metric scores across candidates; GEPA exposes detailed run metadata, including the number of metric calls and best‑per‑instance outputs. (DSPy)
Hard fail gates: Keep a final deterministic redactor (regex/NER) after the model as a belt‑and‑suspenders guard.
Data governance: Don’t ship user PII to third‑party endpoints; prefer local LMs for redaction or use privacy‑preserving gateways. (DSPy makes swapping endpoints trivial.) (DSPy)
Known rough edges: GEPA is evolving quickly; multimodal runs have had memory‑leak issues reported in the wild - test budgets and monitor. (GitHub)

Beyond incident reports: other ideas

Safety checklists: Evolve instructions until every output contains mandatory bullets (e.g., “lockout/tagout”, “PPE”).
Terms-to-Plain‑English: Rewrite clauses for readability but assert a semantic fidelity metric.
Medical device logs: Mask identifiers while preserving error codes and timelines (regulatory review friendliness).

All reuse the same pattern: declare a Signature → write a feedback‑rich metric → run GEPA.

Appendix: a slightly stricter composite metric (drop‑in)

def composite_pii_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    text = (pred.clean_report or "").strip()
    issues = []

    # 1) PII leak checks (extend with better detectors as needed)
    leaks = []
    if EMAIL.search(text): leaks.append("email")
    if PHONE.search(text): leaks.append("phone")
    if NAME.search(gold.report) and "[NAME]" not in text: leaks.append("name")
    if leaks: issues.append(f"PII leaked: {', '.join(leaks)}; replace with placeholders.")

    # 2) Structure invariants
    if "Root cause:" not in text:  issues.append("Missing header: 'Root cause:'.")
    if "Action items:" not in text: issues.append("Missing header: 'Action items:'.")

    # 3) Formatting: require bullets for action items
    if "Action items:" in text:
        after = text.split("Action items:", 1)[1]
        if "-" not in after and "\n•" not in after:
            issues.append("Action items must be bulleted with '-' or '•'.")

    # 4) No fabrication: forbid adding new emails/phones beyond placeholders
    hallucination = EMAIL.findall(text) or PHONE.findall(text)
    if hallucination: issues.append("Do not introduce new PII; use placeholders only.")

    # Score scheme
    base = 1.0
    penalty = 0.25 * len(issues)  # tune per your tolerance
    score = max(0.0, base - penalty)
    feedback = " ".join(issues) if issues else (
        "Great: no leaks, headers intact, bullets present; keep edits minimal and factual."
    )
    return dspy.Prediction(score=score, feedback=feedback)

Swap this into dspy.GEPA(metric=composite_pii_metric, ...) when you need tighter guarantees. (GEPA’s API supports exactly this kind of textual feedback; see docs.) (DSPy)

References & further reading

GEPA in DSPy: API and usage, including feedback metric signature, Pareto selection, and track_best_outputs. (DSPy)
DSPy LM configuration: model/provider strings and dspy.configure. (DSPy)
DSPy ChainOfThought: adds a reasoning field and encourages stepwise outputs. (DSPy)
DSPy Examples / .with_inputs: mark which fields are inputs for evaluation/compilation. (DSPy)
GEPA preprint: reflective prompt evolution outperforms RL baselines with fewer rollouts; also stronger than MIPROv2 in reported settings. (Validate for your task.) (arXiv)
Tutorials: DSPy’s GEPA tutorials (AIME, structured extraction, privacy‑conscious delegation). (DSPy)

Closing thought

We’ve been treating “privacy” as a soft request to the model. With DSPy + GEPA you can make it a hard requirement: design the contract (your metric), let GEPA evolve the instructions to satisfy it, and keep a deterministic last pass as a guard. That’s not vibes - that’s engineering.