RAG App Cost Breakdown — Document Q&A at 5k Queries/Month
A RAG app serving 5k queries per month over a moderate document corpus runs $35–$240/month.
Scenario
A knowledge-base Q&A application. Users ask questions over a company document corpus (handbooks, product docs, past tickets). Each query retrieves the top 5-10 chunks (~10k tokens) and passes them with the user question to an LLM. The retrieval system prompt is 500 tokens. The same documents get retrieved repeatedly for similar queries, so 70% cache hit on the retrieval system + frequently-accessed chunks is realistic.
| Assumption | Value |
|---|---|
| Queries / month | 5,000 |
| Retrieved context | ~10,000 tokens / query |
| User question | ~500 tokens |
| Response length | ~800 tokens |
| Cache hit rate | 70% (retrieval system + hot chunks) |
Embedding costs (for indexing the corpus once) are a one-time $5-20 and not included here. Vector database costs (Pinecone, Qdrant) are also separate — budget $0-25/month for self-hosted, $70+ for managed.
Monthly cost across recommended models
Calculated at 53M input tokens + 4.0M output tokens, with 70% prompt cache hit rate.
| Model | Input cost | Output cost | Cache savings | Total / mo |
|---|---|---|---|---|
| Deepseek Chat Cheapest | $14.70 | $1.68 | −$9.26 | $7.12 |
| Gpt 5 Mini | $13.13 | $8.00 | −$8.27 | $12.86 |
| Gemini 2.5 Flash | $15.75 | $10.00 | −$9.92 | $15.83 |
| Claude Haiku 4 5 | $52.50 | $20.00 | −$33.08 | $39.42 |
💡 Switching from Claude Haiku 4 5 to Deepseek Chat saves $32.31/month (82% reduction).
Why these models
RAG needs long input context and benefits massively from caching. Gemini 2.5 Flash wins on context length (1M tokens) — useful when retrieved chunks are large. Claude Haiku 4.5 wins on caching efficiency (90% off cached input). GPT-5 Mini balances quality and price. DeepSeek wins on raw cost but has shorter context (~64k typical).
Key insights
- 1. Caching is the single biggest cost lever for RAG. At 70% hit rate the bill drops by 50-60%.
- 2. Consider hybrid setups: use Gemini Flash for long-context queries, Claude Haiku for cached system prompts, route per query.
- 3. Token usage is dominated by retrieved chunks (10× the user question). Improving retrieval quality (smaller, more relevant chunks) cuts costs faster than switching models.
- 4. For high-volume RAG (>100k queries/month), evaluate self-hosting an open model on Fireworks or Together — can be 2-5× cheaper.
Cost at different scales
| Scale | Deepseek Chat | Gpt 5 Mini | Gemini 2.5 Flash | Claude Haiku 4 5 |
|---|---|---|---|---|
| Small team (500 queries) | $0.71 | $1.29 | $1.58 | $3.94 |
| Baseline (5k queries) | $7.12 | $12.86 | $15.83 | $39.42 |
| Growing product (50k queries) | $71.19 | $129 | $158 | $394 |
| Enterprise (500k queries) | $712 | $1286 | $1583 | $3943 |
Try your own scenario
The numbers above use our best-guess assumptions. For your actual workflow, use the interactive calculator to plug in your real token volumes and quality requirements.
All cost figures are estimates based on publicly-listed pricing as of the data refresh date. Verify with the provider's official pricing page before making business decisions. Embedding costs, vector database costs, and infrastructure costs are not included unless explicitly noted.