Intro to DSPy for Programming LLMs with Declarative Python

For the past few years, building with language models has meant endless prompt engineering: tweaking strings, adding examples, crossing fingers. Stanford’s DSPy framework flips this paradigm on its head with a radical idea—program your AI systems, don’t prompt them. In this deep dive, we’ll explore DSPy’s core concepts and build working examples that demonstrate its power.

What is DSPy?
Core Concepts
Use Case 1: Text Classification with Optimization
Use Case 2: Chain of Thought Reasoning
Use Case 3: Retrieval-Augmented Generation (RAG)
Use Case 4: Multi-Step ReAct Agent
Best Practices
Conclusion

What is DSPy?

DSPy (Declarative Self-improving Python) is an open-source framework created by Stanford NLP that fundamentally changes how we build AI systems. Instead of brittle prompt strings that break with every model update, DSPy lets you write compositional Python code that describes what you want your AI to do.

The magic happens through compilation: DSPy’s optimizers automatically synthesize effective prompts, generate few-shot examples, and even fine-tune model weights—all while you focus on system architecture rather than prompt archaeology.

Since its release in late 2023, DSPy has exploded to over 28,000 GitHub stars and 160,000+ monthly pip downloads. It supports any LLM through LiteLLM integration, from OpenAI and Anthropic to local models via Ollama.

Why DSPy?

Traditional prompt engineering has fundamental problems:

Fragility: Change the model, change the prompts
Maintenance nightmare: Scattered prompt strings across codebases
No optimization: Manual trial and error instead of systematic improvement
Hard to compose: Chaining prompts becomes increasingly messy

DSPy solves these by separating interface (what you want) from implementation (how to prompt the model).

Core Concepts

DSPy’s programming model consists of three key abstractions:

Signatures

Signatures define the input/output behavior of a module. Think of them as type hints for AI tasks:

import dspy

# Simple string signature
"question -> answer"

# Structured signature with types
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="often between 1 and 5 words")

Signatures tell DSPy what transformation you want, without specifying how to prompt the model.

Modules

Modules are reusable components that implement different prompting strategies:

dspy.Predict: Basic prediction
dspy.ChainOfThought: Generates reasoning before answers
dspy.ReAct: Combines reasoning with tool use
dspy.ProgramOfThought: Generates and executes code

You compose modules into pipelines:

class QA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.generate_answer(question=question)

Optimizers

Optimizers (aka teleprompters) automatically improve your modules by:

Synthesizing few-shot examples
Optimizing instruction prompts
Fine-tuning model weights

Key optimizers include:

BootstrapFewShot: Generates examples from your data
MIPROv2: Iteratively improves prompts
BootstrapFinetune: Creates fine-tuning datasets

Now let’s see DSPy in action with practical examples.

Use Case 1: Text Classification with Optimization

Let’s build a sentiment classifier that improves itself through optimization. First, install DSPy:

pip install dspy

Basic Classification

import dspy
from typing import Literal

# Configure the language model
lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_API_KEY')
dspy.configure(lm=lm)

# Define the signature
class ClassifySentiment(dspy.Signature):
    """Classify the sentiment of a given text."""

    text: str = dspy.InputField()
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()

# Create a predictor
classifier = dspy.Predict(ClassifySentiment)

# Test it
result = classifier(text="This product exceeded my expectations!")
print(f"Sentiment: {result.sentiment}")
# Output: Sentiment: positive

Adding Optimization

Now let’s use DSPy’s optimizer to automatically improve our classifier with examples:

import dspy
from dspy.teleprompt import BootstrapFewShot
from typing import Literal

# Training data
train_examples = [
    dspy.Example(text="I love this product!", sentiment="positive").with_inputs("text"),
    dspy.Example(text="Terrible experience.", sentiment="negative").with_inputs("text"),
    dspy.Example(text="It's okay, nothing special.", sentiment="neutral").with_inputs("text"),
    dspy.Example(text="Absolutely amazing!", sentiment="positive").with_inputs("text"),
    dspy.Example(text="Waste of money.", sentiment="negative").with_inputs("text"),
    dspy.Example(text="Pretty standard.", sentiment="neutral").with_inputs("text"),
]

# Validation data
val_examples = [
    dspy.Example(text="Best purchase ever!", sentiment="positive").with_inputs("text"),
    dspy.Example(text="Not worth it.", sentiment="negative").with_inputs("text"),
]

# Define metric
def validate_sentiment(example, pred, trace=None):
    return example.sentiment == pred.sentiment

# Create module with Chain of Thought for better reasoning
class SentimentClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classify = dspy.ChainOfThought(ClassifySentiment)

    def forward(self, text):
        return self.classify(text=text)

# Optimize with BootstrapFewShot
optimizer = BootstrapFewShot(
    metric=validate_sentiment,
    max_bootstrapped_demos=3,
    max_labeled_demos=3
)

# Compile the optimized classifier
unoptimized = SentimentClassifier()
optimized_classifier = optimizer.compile(
    SentimentClassifier(),
    trainset=train_examples,
    valset=val_examples
)

# Test optimized version
test_text = "This exceeded all my expectations and I couldn't be happier!"
result = optimized_classifier(text=test_text)
print(f"Text: {test_text}")
print(f"Sentiment: {result.sentiment}")
print(f"Reasoning: {result.rationale}")

The optimizer automatically selects the best examples to include in the prompt and generates effective instructions. The ChainOfThought module makes the model explain its reasoning, improving accuracy.

Use Case 2: Chain of Thought Reasoning

Chain of Thought (CoT) prompting improves performance on complex reasoning tasks. DSPy makes CoT easy:

import dspy

# Configure LM
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

# Define signature for math problems
class SolveMath(dspy.Signature):
    """Solve a math word problem step by step."""

    problem: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="numeric answer")

# Create a CoT module
class MathSolver(dspy.Module):
    def __init__(self):
        super().__init__()
        self.solve = dspy.ChainOfThought(SolveMath)

    def forward(self, problem):
        return self.solve(problem=problem)

# Test it
solver = MathSolver()

problem = """A store has 48 apples in the morning. They sell 28 apples during the day.
In the evening, they receive a delivery of 35 more apples. How many apples does the store have now?"""

result = solver(problem=problem)
print(f"Problem: {problem}")
print(f"\nReasoning: {result.rationale}")
print(f"\nAnswer: {result.answer}")

Output:

Problem: A store has 48 apples in the morning...

Reasoning: Starting with 48 apples, selling 28 leaves 48 - 28 = 20 apples.
Adding the delivery of 35 apples gives 20 + 35 = 55 apples.

Answer: 55

The rationale field automatically contains the step-by-step reasoning. You can inspect it for debugging or display it to users.

Comparing with Basic Prediction

# Without CoT
basic_solver = dspy.Predict(SolveMath)
basic_result = basic_solver(problem=problem)
print(f"Basic answer: {basic_result.answer}")
# May be less reliable on complex problems

# With CoT - more reliable through explicit reasoning
cot_result = solver(problem=problem)
print(f"CoT answer: {cot_result.answer}")
print(f"With reasoning: {cot_result.rationale}")

Use Case 3: Retrieval-Augmented Generation (RAG)

RAG combines retrieval with generation for answering questions using external knowledge. Here’s a complete DSPy RAG implementation:

import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM

# Setup language model
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

# Setup retriever (using ChromaDB as an example)
# In practice, you'd populate this with your documents
retriever = ChromadbRM(
    collection_name='my_documents',
    persist_directory='./chromadb',
    k=3  # retrieve top 3 passages
)
dspy.configure(rm=retriever)

# Define RAG signature
class GenerateAnswer(dspy.Signature):
    """Answer questions based on provided context."""

    context: list[str] = dspy.InputField(desc="relevant passages")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="concise answer based on context")

# Build RAG module
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        # Retrieve relevant passages
        context = self.retrieve(question).passages

        # Generate answer using context
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(
            context=context,
            answer=prediction.answer,
            rationale=prediction.rationale
        )

# Create and test RAG system
rag = RAG(num_passages=3)
question = "What is the capital of France?"
result = rag(question=question)

print(f"Question: {question}")
print(f"\nRetrieved Context:")
for i, passage in enumerate(result.context, 1):
    print(f"{i}. {passage[:100]}...")
print(f"\nAnswer: {result.answer}")
print(f"Reasoning: {result.rationale}")

Optimizing RAG with BootstrapFewShot

from dspy.teleprompt import BootstrapFewShot

# Prepare training examples
train_qa = [
    dspy.Example(
        question="What is the capital of France?",
        answer="Paris"
    ).with_inputs("question"),
    dspy.Example(
        question="What is the largest planet in our solar system?",
        answer="Jupiter"
    ).with_inputs("question"),
    # Add more examples...
]

# Define evaluation metric
def validate_context_and_answer(example, pred, trace=None):
    # Check if answer matches
    answer_match = example.answer.lower() in pred.answer.lower()
    # Check if context is relevant (simple heuristic)
    has_context = len(pred.context) > 0
    return answer_match and has_context

# Optimize the RAG system
optimizer = BootstrapFewShot(
    metric=validate_context_and_answer,
    max_bootstrapped_demos=4
)

optimized_rag = optimizer.compile(
    RAG(),
    trainset=train_qa
)

# Test optimized system
result = optimized_rag(question="What is the capital of France?")
print(f"Optimized Answer: {result.answer}")

DSPy’s optimizer learns which retrieved passages are most useful and how to best combine them in the prompt.

Use Case 4: Multi-Step ReAct Agent

ReAct (Reasoning + Acting) combines reasoning with tool use. Let’s build an agent that can search Wikipedia and answer questions:

import dspy

# Configure LM
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

# Define a simple Wikipedia search tool
def search_wikipedia(query: str) -> str:
    """Search Wikipedia and return a summary."""
    # In practice, use the wikipedia library:
    # import wikipedia
    # return wikipedia.summary(query, sentences=3)

    # Simulated response for example
    results = {
        "Paris": "Paris is the capital and most populous city of France. Located in north-central France, it is situated on the Seine River.",
        "Python": "Python is a high-level, interpreted programming language with dynamic semantics and simple, easy-to-learn syntax.",
    }
    return results.get(query, "No information found.")

# Create ReAct agent
class WikiAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        # ReAct combines reasoning with tool use
        self.react = dspy.ReAct(
            signature="question -> answer",
            tools=[search_wikipedia],
            max_iters=5
        )

    def forward(self, question):
        return self.react(question=question)

# Test the agent
agent = WikiAgent()
result = agent(question="What is the capital of France and what language do they speak there?")

print(f"Question: {result.question}")
print(f"Answer: {result.answer}")
print(f"\nThought process:")
# ReAct modules expose their reasoning trajectory
if hasattr(result, 'trajectory'):
    for step in result.trajectory:
        print(f"  - {step}")

The ReAct module automatically:

Reasons about what information it needs
Acts by calling the search tool
Observes the results
Repeats until it can answer the question

Advanced: Multi-Tool ReAct

import dspy
import math

# Define multiple tools
def search_wikipedia(query: str) -> str:
    """Search Wikipedia."""
    return f"Wikipedia info about {query}..."

def calculator(expression: str) -> float:
    """Evaluate a math expression safely."""
    try:
        # In production, use a proper math parser
        return eval(expression, {"__builtins__": {}}, {"math": math})
    except:
        return "Invalid expression"

def get_current_date() -> str:
    """Get the current date."""
    from datetime import datetime
    return datetime.now().strftime("%Y-%m-%d")

# Multi-tool agent
class SmartAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        self.react = dspy.ReAct(
            signature="question -> answer",
            tools=[search_wikipedia, calculator, get_current_date],
            max_iters=7
        )

    def forward(self, question):
        return self.react(question=question)

# Test with a question requiring multiple tools
agent = SmartAgent()
question = "What is 15% of 240, and when is today's date?"
result = agent(question=question)

print(f"Question: {question}")
print(f"Answer: {result.answer}")

The agent automatically determines which tools to use and in what order, all while maintaining a coherent reasoning process.

Best Practices

1. Start Simple, Then Optimize

# Start with basic Predict
basic = dspy.Predict("question -> answer")

# Move to ChainOfThought for complex tasks
cot = dspy.ChainOfThought("question -> answer")

# Optimize when you have data
optimized = optimizer.compile(MyModule(), trainset=train_data)

2. Use Descriptive Signatures

# Bad: Vague signature
class Bad(dspy.Signature):
    input: str
    output: str

# Good: Clear intent and constraints
class Good(dspy.Signature):
    """Extract named entities from text and classify them."""

    text: str = dspy.InputField(desc="input text to analyze")
    entities: list[str] = dspy.OutputField(desc="list of named entities found")
    categories: list[str] = dspy.OutputField(desc="entity categories: PERSON, ORG, LOCATION")

3. Design Meaningful Metrics

def robust_metric(example, pred, trace=None):
    # Multiple criteria
    correct_answer = example.answer.lower() in pred.answer.lower()
    reasonable_length = 10 < len(pred.answer) < 200
    has_reasoning = len(getattr(pred, 'rationale', '')) > 20

    # Weighted score
    score = (
        0.6 * correct_answer +
        0.2 * reasonable_length +
        0.2 * has_reasoning
    )
    return score

4. Save and Load Optimized Models

# Save after optimization
optimized_module.save('models/my_optimized_rag.json')

# Load later
loaded_module = RAG()
loaded_module.load('models/my_optimized_rag.json')

5. Use Different LMs for Different Tasks

# Fast, cheap model for simple classification
cheap_lm = dspy.LM('openai/gpt-4o-mini')

# Powerful model for complex reasoning
smart_lm = dspy.LM('openai/gpt-4')

# Switch context for specific operations
with dspy.context(lm=smart_lm):
    complex_result = complex_module(hard_question)

with dspy.context(lm=cheap_lm):
    simple_result = simple_module(easy_question)

6. Monitor and Debug

# Inspect LM history
print(lm.history)

# Examine module internals
print(module.named_predictors())

# Trace execution
with dspy.settings.trace():
    result = my_module(input)
    print(dspy.settings.trace_history)

7. Evaluation is Critical

from dspy.evaluate import Evaluate

# Create evaluator
evaluator = Evaluate(
    devset=test_examples,
    metric=my_metric,
    num_threads=4,
    display_progress=True
)

# Evaluate model
score = evaluator(my_module)
print(f"Score: {score}")

Conclusion

DSPy represents a paradigm shift in how we build with language models. Instead of fragile prompt strings, we write composable, testable, optimizable Python code. The framework’s three core abstractions—signatures, modules, and optimizers—provide a solid foundation for building production-grade AI systems.

Key takeaways:

Signatures separate interface from implementation
Modules provide reusable, composable components
Optimizers automatically improve your systems with data
Evaluation is built-in and systematic

Whether you’re building classifiers, RAG systems, or complex multi-agent workflows, DSPy provides the tools to move from prompt hacking to systematic AI engineering.

The examples in this post are just the beginning. DSPy supports advanced patterns like:

Multi-hop reasoning chains
Self-refinement loops
Ensemble methods
Fine-tuning with synthetic data

Start with simple modules, add optimization when you have data, and iterate based on metrics. Your future self (and your codebase) will thank you for choosing programming over prompting.

Resources

Happy programming (not prompting)!

Table of Contents