Prompt Engineering and Evaluation: Building Reliable LLM Systems
Principles of Effective Prompting
Be deliberate and specific when prompting:
- Be explicit about role, format, and constraints
- Provide examples (few-shot) and counter-examples
- Use structured outputs (JSON schemas) when integrating with code
- Separate system, developer, and user messages
Reducing Hallucinations
Ground prompts with retrieved context, force citations, and penalize unsupported claims. Prefer closed-book QA with verifiable sources.
Evaluation Harness
Build an evaluation loop: define tasks, gold references, and automatic metrics (exact match, BLEU/ROUGE/BERTScore) plus human rubrics (accuracy, helpfulness, safety).
Guardrails and Safety
Enforce safety policies with content filters, regex/JSON schema validators, and rejection sampling. Log decisions for audits.
Continuous Improvement
Collect real prompts, label failure modes, and iterate. Use A/B testing to ship improvements with confidence.
Prompt Patterns That Work
Role + Objective + Constraints: "You are a …; Your goal is …; Obey: …"
- Step-by-Step (CoT) with Verify: Ask for reasoning then require a brief final answer and a self-check.
- Decomposition: Break big tasks into subtasks (extract → plan → solve → validate).
- Delimiters: Fence inputs with triple backticks to reduce prompt injection and parsing errors.
- Schema-first: Ask for JSON that conforms to a provided schema; reject if invalid.
Structured Outputs with JSON Schema
Constrain outputs for reliable downstream use. Validate and retry on parse errors.
{
"type": "object",
"required": ["title", "summary", "facts"],
"properties": {
"title": {"type": "string", "minLength": 5},
"summary": {"type": "string"},
"facts": {"type": "array", "items": {"type": "string"}}
}
}
import json
from jsonschema import validate, ValidationError
def parse_or_retry(raw, schema):
try:
data = json.loads(raw)
validate(data, schema)
return data
except (json.JSONDecodeError, ValidationError):
return None # trigger retry with a stricter instruction
Evaluation Pipeline (Offline + Online)
- Dataset: Curate prompts with gold answers or acceptable ranges.
- Offline metrics: EM/F1 for QA, ROUGE/BLEU for summarization, judge LLMs for style and safety.
- Canary set: Small fast suite to catch regressions on each change.
- Online: A/B test variants; log prompts, outputs, feedback, and guardrail events.
- Scorecards: Track accuracy, refusal rate, toxicity, latency, cost, and context usage.
Building the Harness (Example)
from collections import defaultdict
def evaluate(model, dataset, metric_fn):
results = []
for item in dataset:
out = model(item['prompt'])
score = metric_fn(out, item['gold'])
results.append({'id': item['id'], 'score': score})
by_bucket = defaultdict(list)
for r in results:
by_bucket[item['bucket']].append(r['score'])
return {
'mean': sum(r['score'] for r in results)/len(results),
'buckets': {k: sum(v)/len(v) for k, v in by_bucket.items()}
}
Failure Mode Taxonomy
- Hallucination (unsupported claims, fabricated entities)
- Policy/safety violations (PII leak, harmful content)
- Schema violations (malformed JSON, missing fields)
- Tool misuse (wrong function/arguments)
- Context issues (truncation, wrong doc retrieved)
- Reasoning errors (math, logic, multi-hop)
Prompt Debugging Checklist
- Reduce temperature; ensure deterministic seed where available
- Add role + constraints; ask for chain-of-thought but return only final answer
- Provide counter-examples and negative constraints (what not to do)
- Use smaller context windows with summaries; pin citations to source spans
- Turn on function calling / tools for deterministic actions
- Validate with schema and retry with error messages
RAG + Prompting
Fuse retrieval and prompts: craft a retrieval-augmented template with a citation requirement and a statement to avoid answering when evidence is insufficient.
System: You are a factual assistant. Use only the context below. If the answer is not present, say "I don’t have enough evidence".
Context:
{retrieved_chunks}
User: {question}
Output format:
{
"answer": string,
"citations": [ {"chunk_id": string, "quote": string} ]
}
Production Tips
- Log everything: prompt hash, model, temperature, tokens, latency, guardrail outcomes
- Keep a replayable corpus for regression tests
- Use feature flags to roll out prompt/model variants
- Backpressure and timeouts to protect upstreams
- Create per-feature canary tests that run in CI
Further Reading
- OpenAI evals & guidance on evaluations
- HELM / BIG-bench style benchmarks
- Anthropic prompt engineering patterns
- Guardrails.ai / JSON schema validation approaches