The Real Cost of LLMs in Production: A Model Comparison for Finished Products
Adam Dugan • February 09, 2026
Building an AI product is easy. Making it profitable is hard.
I've shipped multiple AI products: BalancingIQ (financial advisory platform), Handyman AI (image-based repair planning), and AI Administrative Assistant (voice-enabled phone system). Every single one hit the same wall: LLM costs can destroy your margins if you're not careful.
Here's what AI actually costs in production, broken down by real usage patterns, and how to make it economically viable.
The Models: Pricing Breakdown (January 2026)
First, let's establish the baseline costs. All prices are per 1 million tokens.
| Model | Input | Output | Best For |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, multi-step tasks |
| GPT-4o-mini | $0.15 | $0.60 | Simple tasks, high volume |
| GPT-4 Turbo | $10.00 | $30.00 | Highest quality, low latency |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Long context, code generation |
| Claude 3.5 Haiku | $0.25 | $1.25 | Fast responses, simple queries |
| GPT-3.5 Turbo | $0.50 | $1.50 | Legacy, being phased out |
Key insight: GPT-4o-mini is 17x cheaper than GPT-4 Turbo for input, and 50x cheaper for output. That difference compounds fast.
Real-World Cost Examples from Production Systems
Example 1: BalancingIQ Financial Advisory
BalancingIQ analyzes financial data from Xero/QuickBooks and generates actionable insights for SMBs.
Typical workflow:
- User connects their accounting software (one-time)
- System syncs financial data (monthly, automated)
- User requests analysis: "How's my cash flow?" or "What should I focus on?"
- LLM processes 3-6 months of transactions, generates insights
- User asks follow-up questions
Token usage per analysis:
- Input: 8,000 tokens (financial data + context + prompt)
- Output: 1,500 tokens (insights + recommendations)
- Follow-ups: Average 3 per session, ~3K input / 800 output each
Cost Per User Session:
| Model | Initial | Follow-ups | Total |
|---|---|---|---|
| GPT-4 Turbo | $0.125 | $0.162 | $0.287 |
| GPT-4o | $0.035 | $0.046 | $0.081 |
| GPT-4o-mini | $0.002 | $0.003 | $0.005 |
Reality check: If a user pays $50/month and uses the product 10 times, GPT-4 Turbo costs $2.87 (5.7% of revenue). GPT-4o-mini costs $0.05 (0.1% of revenue).
What I actually use: GPT-4o for complex analyses, GPT-4o-mini for simple queries and follow-ups. Average cost per session: ~$0.03.
Example 2: Handyman AI (Image Analysis)
Handyman AI analyzes photos of home repairs and generates detailed repair plans, material lists, and cost estimates.
Typical workflow:
- User uploads 2-5 photos of the repair
- LLM with vision analyzes images
- Generates repair plan, material list, cost breakdown
- User asks clarification questions
Token usage per request:
- Input: 12,000 tokens (3 images ~3K tokens each, plus prompt)
- Output: 2,000 tokens (detailed repair plan)
Cost Per Image Analysis:
- GPT-4 Turbo (Vision): $0.18 per request
- GPT-4o (Vision): $0.05 per request
- Claude 3.5 Sonnet (Vision): $0.066 per request
Reality check: If users pay $5 per analysis, GPT-4 Turbo takes 3.6% of revenue. GPT-4o takes 1%. At scale (1,000 requests/month), that's $180 vs $50.
What I actually use: GPT-4o for most analyses. The quality difference from GPT-4 Turbo is minimal for this use case, and the cost savings are significant.
Example 3: AI Administrative Assistant (Voice)
Voice AI is expensive. You're paying for TTS, STT, and LLM inference, all in real-time.
Typical 5-minute call:
- Twilio: $0.065 (phone infrastructure)
- Azure STT: $0.083 (speech-to-text)
- Azure TTS: $0.080 (text-to-speech)
- LLM (10K tokens): Variable, see below
Total Cost Per 5-Minute Call:
- Base (Twilio + TTS/STT): $0.228
- + GPT-4 Turbo: $0.528 total ($6.34/hour)
- + GPT-4o: $0.309 total ($3.71/hour)
- + GPT-4o-mini: $0.236 total ($2.83/hour)
Reality check: If you're running a customer support line with 500 calls/day, that's $158/day with GPT-4o-mini or $264/day with GPT-4 Turbo. Over a month: $4,740 vs $7,920.
What I actually use: GPT-4o-mini for most calls, GPT-4o for complex routing decisions or multi-step workflows. I also cache common responses (FAQs) to avoid LLM calls entirely.
Hidden Costs: Embeddings, Fine-Tuning, and Storage
LLM inference isn't the only cost. Here are the hidden expenses:
Embeddings
If you're using RAG (retrieval-augmented generation), you need embeddings for semantic search.
- OpenAI text-embedding-3-small: $0.02 per 1M tokens
- OpenAI text-embedding-3-large: $0.13 per 1M tokens
Example: Embedding a 100-page document (≈75K tokens) costs $0.0015 (small) or $0.01 (large). If you're indexing thousands of documents, this adds up.
Vector Database Storage
Storing embeddings requires a vector database (Pinecone, Weaviate, pgvector).
- Pinecone: $70/month for 100K vectors (1536 dimensions)
- pgvector (self-hosted): EC2/RDS costs, ~$20-50/month for small scale
Fine-Tuning
Fine-tuning models for domain-specific tasks has upfront costs:
- GPT-4o-mini training: $3.00 per 1M tokens
- GPT-4o-mini inference (fine-tuned): $0.30 input / $1.20 output (2-3x base cost)
When it's worth it: If you need very specific behavior and prompt engineering isn't enough. But usually, better prompts + RAG is cheaper than fine-tuning.
Cost Optimization Strategies That Actually Work
1. Aggressive Caching
The cheapest LLM call is the one you don't make.
What I cache:
- FAQ responses: "What are your hours?" → cached answer, zero cost
- Common analyses: "Show me cash flow" for similar businesses
- Embeddings: Never re-embed the same document
- Voice audio: Pre-generated TTS for greetings and common phrases
Impact: Caching reduces LLM calls by 40-60% in production. Use Redis with TTL for dynamic content, S3 for static responses.
2. Model Tiering
Route queries to the cheapest model that can handle them.
My Routing Strategy:
- GPT-4o-mini: Simple queries, follow-up questions, clarifications
- GPT-4o: Complex analysis, multi-step reasoning, vision tasks
- GPT-4 Turbo: Rare, only for highest-stakes decisions
How to decide: Start with a classifier (even a simple keyword check). Route 80% of queries to the cheap model, 20% to the expensive one. Monitor quality and adjust.
3. Shorter Context Windows
Input tokens cost money. Don't send more context than you need.
Bad: Sending 50K tokens of conversation history on every request.
Good: Summarize older messages, keep only the last 5-10 turns, use embeddings to retrieve relevant context.
Impact: In BalancingIQ, I reduced input tokens by 60% by summarizing financial data instead of sending raw transactions. Quality stayed the same, costs dropped dramatically.
4. Batch Processing
For non-urgent tasks, use OpenAI's Batch API: 50% discount, 24-hour turnaround.
Good use cases:
- Nightly reports
- Bulk data analysis
- Email summaries
- Content moderation backlogs
5. Prompt Optimization
Shorter, clearer prompts = fewer tokens = lower costs.
Example: I rewrote a 1,500-token system prompt down to 600 tokens by removing redundancy and being more concise. Quality improved (clearer instructions) and costs dropped 40%.
6. Usage Limits and Guardrails
Prevent abuse and runaway costs:
- Rate limits: 10 queries per user per day on free tier
- Max token limits: Cap output at 2K tokens to prevent infinite generation
- Cost alerts: CloudWatch alarms when daily spend exceeds threshold
- Per-user tracking: Flag users making 100+ requests/day for review
When to Use Which Model: Decision Framework
Use GPT-4o-mini When:
- Simple queries with clear inputs
- High-volume, low-complexity tasks
- Following instructions (not reasoning)
- Summarization, formatting, extraction
- You're cost-sensitive and quality is "good enough"
Use GPT-4o When:
- Complex reasoning and multi-step logic
- Vision tasks (image analysis)
- Technical content generation
- Code generation and debugging
- Balance between cost and quality matters
Use Claude 3.5 Sonnet When:
- Very long context windows (200K tokens)
- Code generation (often better than GPT-4)
- Research and analysis tasks
- You need detailed, thoughtful responses
- JSON parsing reliability (better adherence)
Use GPT-4 Turbo When:
- You need the absolute best quality
- Low latency is critical (faster than GPT-4o)
- High-stakes decisions (legal, medical, financial)
- Cost is not a primary concern
Real Cost Analysis: A $50/Month SaaS Product
Let's say you're building a SaaS product with a $50/month subscription. How much can you afford to spend on LLM costs?
Typical SaaS Margins:
- Gross revenue: $50/month
- Payment processing (3%): -$1.50
- Cloud infrastructure: -$5 (AWS, hosting, databases)
- Support and ops: -$3
- Target gross margin: 60% = $30 profit
Available for LLM costs: ~$10/month (20% of revenue)
If users make 20 requests per month at $0.05 each (GPT-4o), that's $1/month. Comfortable margin.
If users make 100 requests per month at $0.30 each (GPT-4 Turbo), that's $30/month.You're losing money on every customer.
The fix: Tier pricing based on usage, use cheaper models, or implement aggressive caching and rate limits.
Monitoring and Cost Tracking
You can't optimize what you don't measure. Here's what I track:
- Cost per user per month: Total LLM spend / active users
- Cost per request: By model, by feature, by user type
- Cache hit rate: % of requests served from cache vs LLM
- Model distribution: % of requests to mini vs standard vs turbo
- Token usage: Input vs output, per endpoint
Tools I use:
- OpenAI Dashboard: Real-time usage and costs
- CloudWatch: Custom metrics, alarms for cost spikes
- Custom logging: Log every LLM call with user_id, feature, tokens, cost
- Weekly reports: Automated summary of costs, trends, outliers
Key Takeaways
- GPT-4o-mini is 17-50x cheaper than GPT-4 Turbo. Use it as your default, upgrade only when necessary.
- Model tiering is essential: Route simple queries to cheap models, complex ones to expensive models. This alone can cut costs 60%.
- Caching is your best friend: 40-60% of requests can be cached. The cheapest LLM call is the one you don't make.
- Voice AI is expensive: $0.23-0.53 per 5-minute call adds up fast. Cache FAQs, use cheaper models, set time limits.
- Hidden costs matter: Embeddings, vector storage, and fine-tuning add to your bill. Budget for the full stack, not just inference.
- Monitor everything: Track cost per user, per request, per feature. Set alarms for spikes. Optimize continuously.
- For a $50/month SaaS, aim to spend <$10/month per user on LLM costs (20% of revenue). Adjust pricing or usage limits accordingly.
Building an AI product and worried about costs? I'd love to hear about your cost optimization challenges, model selection, or pricing strategy. Reach out at adamdugan6@gmail.com or connect with me on LinkedIn.
LLMs were used to help with research and article structure.