Testing AI Systems

Non-deterministic doesn't mean untestable

🔑 Key Concepts

Separate concerns — Deterministic code = exact assertions. LLM outputs = semantic assertions (is_similar, not is_equal).
Mock LLM calls — Use respx or pytest-httpserver in unit tests. Real API calls only in scheduled integration tests.
Semantic assertions — assert is_similar(response, expected, threshold=0.85). Use LLM-as-judge for evaluation.
Regression suite — 50-100 representative queries with expected behaviour. Re-run on every pipeline change.

💡 Practice: Try implementing each concept yourself before moving on. Reading about RAG and building RAG are very different things.