Install
openclaw skills install @minirr890112-byte/model-watchBenchmark AI API models over time and detect quality degradation. 7 standardized tests (reasoning, coding, writing, instruction-following, hallucination). Alerts when scores drop >10% vs historical average. Because models silently get dumber.
openclaw skills install @minirr890112-byte/model-watchThe problem: AI companies silently degrade their models. "Opus 4.7 was hallucinating a lot today... shocking to see such degradation" — r/ClaudeAI (49↑). "Anthropic admits to have made hosted models more stupid" — r/LocalLLaMA (281↑). You're paying the same price for a dumber model and you don't even know it.
The solution: Standardized benchmark suite you run yourself. 7 tests across 5 categories. Scores stored locally. Alerts when recent scores drop >10% vs your historical average. Hard data, not vibes.
pip install git+https://github.com/minirr890112-byte/model-watch.git
model-watch demo # View benchmark questions
model-watch submit '{"reasoning_1":"...","coding_1":"...",...}' # Submit outputs
model-watch history # View score history
model-watch alert # Check for degradation
| Category | Tests | What it measures |
|---|---|---|
| Reasoning | 2 | Logic, multi-step deduction |
| Coding | 2 | Code generation, debugging |
| Writing | 1 | Quality, coherence |
| Instruction-following | 1 | Precision, constraint adherence |
| Hallucination detection | 2 | Factual accuracy |
model-watch submit~/.hermes/model-watch-history.jsonmodel-watch historymodel-watch alert flags when recent scores drop >10% vs historical average⭐ Star this repo if you've noticed your favorite model getting dumber: github.com/minirr890112-byte/model-watch