Chunking Strategies
How you split documents determines RAG quality
🔑 Key Concepts
- Why chunk? — LLMs have context limits. You can't feed a 100-page PDF. Chunking creates retrievable pieces that fit the window.
- Recursive splitting — RecursiveCharacterTextSplitter: try double-newline, then single, then space. Best default for most text.
- Chunk size — 500-1000 tokens with 10-20% overlap. Too small = lost context. Too big = diluted relevance.
- Metadata — Always attach: source file, page number, section header, chunk index. Critical for citation and filtering.
💡 Practice: Try implementing each concept yourself before moving on. Reading about RAG and building RAG are very different things.