📚

RAG App Cost Breakdown — Document Q&A at 5k Queries/Month

A RAG app serving 5k queries per month over a moderate document corpus runs $35–$240/month.

Cost range: $35–$240/mo

Scenario

A knowledge-base Q&A application. Users ask questions over a company document corpus (handbooks, product docs, past tickets). Each query retrieves the top 5-10 chunks (~10k tokens) and passes them with the user question to an LLM. The retrieval system prompt is 500 tokens. The same documents get retrieved repeatedly for similar queries, so 70% cache hit on the retrieval system + frequently-accessed chunks is realistic.

Assumption	Value
Queries / month	5,000
Retrieved context	~10,000 tokens / query
User question	~500 tokens
Response length	~800 tokens
Cache hit rate	70% (retrieval system + hot chunks)

Embedding costs (for indexing the corpus once) are a one-time $5-20 and not included here. Vector database costs (Pinecone, Qdrant) are also separate — budget $0-25/month for self-hosted, $70+ for managed.

Monthly cost across recommended models

Calculated at 53M input tokens + 4.0M output tokens, with 70% prompt cache hit rate.

Model	Input cost	Output cost	Cache savings	Total / mo
Deepseek Chat Cheapest	$14.70	$1.68	−$9.26	$7.12
Gpt 5 Mini	$13.13	$8.00	−$8.27	$12.86
Gemini 2.5 Flash	$15.75	$10.00	−$9.92	$15.83
Claude Haiku 4 5	$52.50	$20.00	−$33.08	$39.42

💡 Switching from Claude Haiku 4 5 to Deepseek Chat saves $32.31/month (82% reduction).

Why these models

RAG needs long input context and benefits massively from caching. Gemini 2.5 Flash wins on context length (1M tokens) — useful when retrieved chunks are large. Claude Haiku 4.5 wins on caching efficiency (90% off cached input). GPT-5 Mini balances quality and price. DeepSeek wins on raw cost but has shorter context (~64k typical).

Key insights

1. Caching is the single biggest cost lever for RAG. At 70% hit rate the bill drops by 50-60%.
2. Consider hybrid setups: use Gemini Flash for long-context queries, Claude Haiku for cached system prompts, route per query.
3. Token usage is dominated by retrieved chunks (10× the user question). Improving retrieval quality (smaller, more relevant chunks) cuts costs faster than switching models.
4. For high-volume RAG (>100k queries/month), evaluate self-hosting an open model on Fireworks or Together — can be 2-5× cheaper.

Cost at different scales

Scale	Deepseek Chat	Gpt 5 Mini	Gemini 2.5 Flash	Claude Haiku 4 5
Small team (500 queries)	$0.71	$1.29	$1.58	$3.94
Baseline (5k queries)	$7.12	$12.86	$15.83	$39.42
Growing product (50k queries)	$71.19	$129	$158	$394
Enterprise (500k queries)	$712	$1286	$1583	$3943

Try your own scenario

The numbers above use our best-guess assumptions. For your actual workflow, use the interactive calculator to plug in your real token volumes and quality requirements.

All cost figures are estimates based on publicly-listed pricing as of the data refresh date. Verify with the provider's official pricing page before making business decisions. Embedding costs, vector database costs, and infrastructure costs are not included unless explicitly noted.