Testing AI Systems
Non-deterministic doesn't mean untestable
🔑 Key Concepts
- Separate concerns — Deterministic code = exact assertions. LLM outputs = semantic assertions (is_similar, not is_equal).
- Mock LLM calls — Use respx or pytest-httpserver in unit tests. Real API calls only in scheduled integration tests.
- Semantic assertions — assert is_similar(response, expected, threshold=0.85). Use LLM-as-judge for evaluation.
- Regression suite — 50-100 representative queries with expected behaviour. Re-run on every pipeline change.
💡 Practice: Try implementing each concept yourself before moving on. Reading about RAG and building RAG are very different things.