Model Cost Advisor

v1.0.0

Analyze any task and recommend the most cost-effective LLM — with live pricing data from 30+ models, tier analysis, token estimation, and projected cost. Per...

0· 51· 1 versions· 0 current· 0 all-time· Updated 2d ago· MIT-0

byMaya Tao@minirr890112-byte

🤖 Model Cost Advisor

Pick the most cost-effective LLM for any task — before you start spending.

Why pay Claude Opus prices for a task DeepSeek can handle? This skill analyzes your task, maps it to a capability tier, and finds the cheapest model that gets the job done well.

Quick Start

# 1. Fetch live pricing (one-time, auto-cached for 48h)
python scripts/fetch_pricing.py

# 2. Get a recommendation
echo "Write a REST API with FastAPI, handle auth and rate limiting" | python scripts/advise.py

# 3. Or pass task directly
python scripts/advise.py --task "Refactor a 2000-line Python class into smaller modules"

# 4. Compare all models side-by-side
python scripts/advise.py --compare

# 5. JSON output for scripting
python scripts/advise.py --task "Debug a race condition" --json

What It Does

Analyzes your task description for complexity signals (reasoning depth, code needs, context length, agentic loops, domain expertise)
Maps to one of 4 capability tiers: Budget → Standard → Advanced → Premium
Estimates token usage based on task complexity
Scores 30+ models using live pricing from litellm's community DB
Recommends the top 3 models with projected cost, rationale, and pitfalls

The Four Tiers

Tier	When to Use	Example Tasks	Typical Cost
💰 Budget	Simple Q&A, classification, formatting, basic scripts	"Summarize this text", "Format JSON"	<$0.01
📦 Standard	Multi-step reasoning, medium code, structured output	"Write a web scraper", "Explain a concept"	$0.01–$0.10
🚀 Advanced	Complex code, architecture design, agentic loops	"Build a full-stack app", "Debug concurrency"	$0.10–$1.00
👑 Premium	Frontier reasoning, research, >128K context	"Research paper analysis", "Safety-critical code"	$1.00+

Models Tracked

30+ models across 6 providers, updated from litellm's community DB:

Provider	Models
Anthropic	Claude Opus 4 / 4.1 / 4.5 / 4.6 / 4.7, Sonnet 4 / 4.5 / 4.6, Haiku 3.5
OpenAI	GPT-4o, GPT-4o-mini, GPT-4.1 / 4.1-mini / 4.1-nano, o3 / o3-mini / o4-mini
Google	Gemini 2.0 Flash, 2.5 Flash / Pro
DeepSeek	V3 / V3.1 / V3.2, R1 (with reasoning token warning)
Alibaba	Qwen Turbo / Plus / Max / Coder-Plus / 3-235B
Mistral	Ministral 3B / 8B / 14B

Example Output

╔══════════════════════════════════════════════════╗
║        🤖 Model Cost Advisor                      ║
╚══════════════════════════════════════════════════╝

🎯 Task Analysis
   Complexity Tier: 3 (Advanced)
   Est. Input:  ~24K tokens
   Est. Output: ~10K tokens
   Signals: multi_step_logic, complex_code, multi_turn_tools

💰 Top Recommendations
   Rank  Model                  Cost     Input $/M Output $/M
   ───── ────────────────────── ────────  ──────── ─────────
   🥇    deepseek-v3            $0.0175     0.28     0.42
   🥈    deepseek-v3.1          $0.0216     0.27     1.10
   🥉    gemini-2.5-flash       $0.0322     0.30     2.50

📋 Why deepseek-v3?
   Tier 3 task → best value in tier 1
   Estimated total cost: $0.0175

How the Agent Uses This Skill

When loaded by Hermes, the agent follows these steps:

Step 1: Analyze Task Requirements

Classify the task along these dimensions to determine the minimum capability tier needed:

Dimension	Weight	What to Assess
Reasoning Depth	High	Simple lookup → multi-step logic → deep chain-of-thought
Code Generation	Medium	None → simple scripts → multi-file complex → architecture design
Context Length	Medium	<4K → 4K-32K → 32K-128K → >128K tokens
Tool Use / Agentic	High	Single shot → multi-turn tools → autonomous agent loop
Domain Expertise	Low	General → specialized (math, legal, medical, Chinese content)
Output Quality	Medium	Draft OK → production → customer-facing critical
Latency	Low	Batch OK → real-time interactive

Step 2: Estimate Token Usage

Task Complexity	Input Tokens	Output Tokens
Trivial (single Q&A)	500 – 2K	200 – 1K
Simple (few exchanges)	2K – 8K	1K – 4K
Medium (multi-turn agent, 5-10 tools)	8K – 40K	4K – 16K
Complex (deep agent, 10-30 tools)	40K – 150K	16K – 50K
Heavy (autonomous loop, 30+ tools)	150K – 500K+	50K – 200K+

Step 3: Run Scripts

# Ensure pricing is fresh
python scripts/fetch_pricing.py

# Get recommendation
python scripts/advise.py --task "<user's task description>"

Step 4: Present Recommendation

Format the output with:

Task complexity analysis
Top 3 model picks with cost
Comparison vs user's current model (if known)
Any pitfalls (R1 reasoning tokens, context window limits, etc.)

Pitfalls to Warn Users About

Script internals (for maintenance):

Tier keys in pricing JSON are strings, not ints — pricing_cache dict uses "1" not 1. The advise script casts them internally, but direct lookups must match.
Keyword matching order matters — put longer-specific keywords (e.g., 'production') before shorter ambiguous ones ('pr') to avoid substring false positives. Split on word boundaries.

User-facing pitfalls:

R1/o3 reasoning tokens are hidden: Sticker price hides massive output consumption. Real cost is 3-5× higher for reasoning models.
Context is not free: Models with 1M context (Gemini) charge for every token in the window, used or not.
Tool calls compound cost: Every agentic round-trip adds system prompt + tool definitions + results. An agent task easily 5× the naive estimate.
Cached prefixes save money: System prompts and cached prefixes bill at 10-25% — factor in for repetitive tasks.
Chinese-language tasks: DeepSeek and Qwen outperform their price tier on Chinese content. Western models cost more for equivalent quality.
Pricing changes frequently: Run fetch_pricing.py before important decisions. Cache TTL is 48 hours.

Scripts

scripts/fetch_pricing.py — Fetches live pricing from litellm DB, normalizes to canonical model names, caches for 48h.
scripts/advise.py — Task complexity analysis + model recommendation engine with colorized terminal output.

Version tags

latestvk970sjff7xe2ct2dxgs3jdme5x85j5b5