Prompt-Level Game Theory: GEPA-Optimized Adversarial Debate

What happens when you point a prompt optimizer at three agents that are competing with each other?

Most prompt optimization demos show a single agent getting better at a single task — classify this email, answer this question, extract these fields. That’s useful, but it sidesteps the harder problem: what happens when multiple agents interact, and improving one changes the game for the others?

I built a multi-agent adversarial debate system with DSPy and GEPA to find out. Repo with full code example is here: GitHub Repo.

The Setup

Three agents, two rounds, one topic:

PRO Advocate argues in favor of a proposition
CON Advocate argues against
Judge evaluates each round and delivers a final verdict

Round 1: Opening Arguments
  PRO → opening(topic, "PRO")
  CON → opening(topic, "CON")
  Judge → evaluate("opening", pro_arg, con_arg)

Round 2: Rebuttals
  PRO → rebuttal(topic, "PRO", own_opening, opponent_opening)
  CON → rebuttal(topic, "CON", own_opening, opponent_opening)
  Judge → evaluate("rebuttal", pro_rebuttal, con_rebuttal)

Final: Judge → verdict(topic, full_transcript)

Seven LM calls per debate. All six predictors are optimizable by GEPA. The whole thing runs as a single dspy.Module — GEPA sees the full pipeline and can reflect on any component.

Why This Is Different

In a single-agent optimization, the feedback loop is simple: “your output was wrong, here’s why, do better.” The search space is one set of instructions.

In adversarial debate, the dynamics shift:

Improving PRO’s arguments puts pressure on CON’s rebuttals. If PRO starts making stronger evidence-based claims, CON’s generic counter-arguments stop working. CON must now engage with specific claims.
The judge must recalibrate. A judge that always favors verbose arguments becomes a liability when both sides get verbose. The metric penalizes lopsided judging, so GEPA must find instructions that produce fair evaluations, not just decisive ones.
The equilibrium matters more than any individual agent. A debate where PRO scores 10/10 and CON scores 2/10 is worse (by our metric) than one where both score 7/10. The metric rewards both sides being strong.

This is prompt-level game theory. GEPA is searching for a Nash equilibrium across six instruction sets simultaneously.

The Metric: No LLM-as-Judge

We deliberately avoided using an LLM to evaluate the debate. The metric is entirely heuristic — regex patterns, term overlap, and structural checks. This keeps evaluation deterministic, fast, and cheap.

Four dimensions, weighted:

Dimension	Weight	How It Works
Argument Quality	25%	Does the argument have a thesis statement, evidence, and a conclusion? Is it within the target word count?
Logical Soundness	30%	Regex-based fallacy detection — ad hominem, strawman, false dilemma, slippery slope, circular reasoning. Each hit applies a -0.08 penalty.
Rebuttal Engagement	20%	Term overlap between the opponent’s argument and the rebuttal. If you don’t reference what they actually said, you’re not rebutting — you’re just re-arguing.
Judge Fairness	25%	Is the score spread reasonable? Does the reasoning mention both sides? Are all required JSON keys present?

The key design choice: per-predictor feedback via pred_name. When GEPA reflects on why a debate scored poorly, each predictor gets targeted guidance. If PRO’s rebuttal didn’t engage with CON’s arguments, only the PRO rebuttal predictor hears about it — not the judge, not CON.

def debate_metric(example, pred, trace=None, pred_name=None, pred_trace=None):
    # ... compute scores ...

    if pred_name is None:
        return float(total_score)  # scalar for dspy.Evaluate

    # Route feedback to the specific predictor GEPA is reflecting on
    if "rebut" in pred_name and "pro" in pred_name:
        fb = "\n".join(aq_notes + logic_notes + [engagement_fb])
        return dspy.Prediction(score=total_score, feedback=fb)
    # ... etc for each predictor

Sample Output: Before and After

Here’s what a GEPA-optimized debate looks like on the topic “Open-source software is more reliable than proprietary software for critical infrastructure”:

PRO Opening (optimized)

Open-source software should be treated as more reliable than proprietary software for critical infrastructure because transparency and broad scrutiny improve both security and quality. First, when the source code is publicly available, defects and vulnerabilities can be found by independent experts rather than only by the original vendor. That accelerates discovery, shortens the window of exposure, and encourages timely patching — vital when critical systems face constant attack attempts…

CON Opening (optimized)

A strong critical-infrastructure policy should not presume open source is inherently more reliable than proprietary software. Reliability is determined by engineering discipline, governance, and operational practices — not by whether source code is publicly available.

First, reliability hinges on rigorous verification and validation (test coverage, regression testing, hazard analysis, and formal change control). Proprietary vendors frequently invest in systematic testing, certification support…

PRO Rebuttal (optimized)

The opponent is right that reliability depends on engineering discipline — but they overdraw the conclusion that this nullifies the “open source is more reliable” claim. Transparency changes the default environment for governance: with public code, independent verification, targeted testing, and faster vulnerability discovery are structurally easier than in closed systems…

CON Rebuttal (optimized)

Transparency and “broad scrutiny” can improve security, but they don’t automatically make open-source more reliable than proprietary. Your argument assumes the availability of source code translates into faster, safer outcomes; in practice, reliability is dominated by whether an organization can enforce disciplined release processes, controlled updates, and validated deployments…

Judge Verdict

{
  "pro_total": 6.7,
  "con_total": 7.8,
  "winner": "CON",
  "reasoning": "CON more convincingly linked reliability to operational governance
    (release discipline, validated deployments, accountability, and maintenance
    stability) and rebutted PRO's shift from verifiability/security discovery
    to true critical-infrastructure reliability..."
}

Notice what GEPA achieved: both sides make substantive, evidence-based arguments. The rebuttals directly address the opponent’s specific claims (PRO acknowledges CON’s governance point before countering it; CON engages with PRO’s transparency argument head-on). The judge produces a reasoned, balanced verdict with specific justification.

What We Learned

1. Feedback granularity matters more than budget. With six predictors, even auto="light" generated hundreds of iterations. Switching to max_full_evals=12 (about 2 optimization rounds per predictor) produced meaningful improvement without burning through API credits. The per-predictor feedback did the heavy lifting — GEPA doesn’t need many iterations when the feedback is precise.

2. Heuristic metrics are underrated. Regex-based fallacy detection sounds crude, but it gives GEPA something concrete to fix. “Possible strawman near ‘clearly they believe’” is more actionable than “your argument could be stronger.” The term-overlap engagement score is similarly simple but effective — it catches the common failure mode where a rebuttal just restates the opening instead of addressing the opponent.

3. Adversarial pressure creates better agents than isolated optimization would. If you optimized PRO alone, you’d get a good arguer. But optimized against a simultaneously-improving CON, you get an arguer that anticipates counterarguments, preempts objections, and engages with specifics. The adversarial structure forces a kind of robustness that single-agent optimization can’t achieve.

4. Judge calibration is the hardest part. The advocates improved quickly — better structure, fewer fallacies, more engagement. The judge took longer to calibrate. Early iterations produced judges that were either too generous (both sides get 8+) or too decisive (one side always wins by a large margin). The fairness metric had to balance rewarding decisive judgment against penalizing bias.

Try It Yourself

git clone <repo>
cd dspy-gepa-debate
uv add dspy
export OPENAI_API_KEY=sk-...
uv run adversarial_debate.py

The default configuration uses gpt-5.4-nano as the task model and gpt-5.4-mini for GEPA reflection. You can override these with PRIMARY_MODEL and REFLECTION_MODEL environment variables.

The full implementation is a single 280-line Python file. No frameworks beyond DSPy. No external datasets. No LLM-as-judge. Just signatures, modules, heuristic metrics, and GEPA doing what it does best — evolving prompts through reflection.

What’s Next

Some directions we’re thinking about:

More rounds. Three rounds with closing statements would add another adversarial pressure point — do you save your strongest argument for the close?
Audience scoring. Replace the single judge with a panel of judges with different evaluation criteria (logic-focused, evidence-focused, rhetoric-focused). GEPA would optimize for Pareto dominance across judge types.
Cross-topic transfer. Do optimized debate instructions generalize to unseen topics, or are they topic-specific? Early signs suggest the structural improvements (better thesis statements, more direct rebuttals) transfer well.
Asymmetric information. Give one side access to a knowledge base (RAG) and the other side pure reasoning. How does GEPA balance the information asymmetry?

The broader point: prompt optimization isn’t just about making a single agent better at a single task. The interesting frontier is multi-agent dynamics — competition, cooperation, delegation, negotiation. Adversarial debate is the simplest version of that story, and even here, the optimization landscape is richer than you’d expect.