Install
openclaw skills install dingoEvaluate AI training and RAG data quality using rule-based or LLM-based metrics with Dingo's flexible, multi-format assessment framework and CLI/SDK support.
openclaw skills install dingoDingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool.
pip install dingo-python
pip install "dingo-python[agent]" # Agent-based evaluation (fact-checking)
pip install "dingo-python[hhem]" # HHEM hallucination detection
pip install "dingo-python[all]" # Everything
python -c "from dingo.config import InputArgs; print('Dingo OK')"
| Rule-based | LLM-based | |
|---|---|---|
| API key required | No | Yes (any OpenAI-compatible API) |
| Speed | Fast | Slower (API calls) |
| Cost | Zero | Per-token cost |
| Metrics | 50+ deterministic rules | Text quality, RAG, 3H, security |
| Best for | Format checks, PII, completeness | Semantic quality, faithfulness |
summary.json + per-item JSONL reports in output directoryDingo CLI takes a JSON config file as input:
dingo eval --input config.json
{
"input_path": "data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "content"},
"evals": [
{"name": "RuleColonEnd"},
{"name": "RuleSpecialCharacter"},
{"name": "RuleContentNull"}
]
}
]
}
{
"input_path": "data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "content"},
"evals": [
{
"name": "LLMTextRepeat",
"config": {
"model": "deepseek-chat",
"key": "${OPENAI_API_KEY}",
"api_url": "https://api.deepseek.com/v1"
}
}
]
}
]
}
RAG evaluation requires specific fields mapped from the dataset:
{
"input_path": "rag_output.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {
"user_input": "user_input",
"response": "response",
"retrieved_contexts": "retrieved_contexts",
"reference": "reference"
},
"evals": [
{"name": "Faithfulness", "config": {"model": "deepseek-chat", "key": "${OPENAI_API_KEY}", "api_url": "https://api.deepseek.com/v1"}},
{"name": "ContextPrecision", "config": {"model": "deepseek-chat", "key": "${OPENAI_API_KEY}", "api_url": "https://api.deepseek.com/v1"}}
]
}
]
}
Evaluate different columns with different rules:
{
"input_path": "qa_data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "answer"},
"evals": [{"name": "RuleColonEnd"}, {"name": "RuleSpecialCharacter"}]
},
{
"fields": {"content": "question"},
"evals": [{"name": "RuleContentNull"}]
}
]
}
For programmatic use inside Python scripts:
from dingo.config import InputArgs
from dingo.exec import Executor
if __name__ == '__main__':
input_data = {
"input_path": "data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "content"},
"evals": [
{"name": "RuleColonEnd"},
{"name": "RuleSpecialCharacter"}
]
}
]
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
| Field | Values | Description |
|---|---|---|
source | local, huggingface, s3, sql | Data source type |
format | jsonl, json, csv, plaintext, parquet | File format |
| Field | Default | Description |
|---|---|---|
max_workers | 1 | Parallel evaluation workers |
batch_size | 10 | Items per batch |
result_save.bad | true | Save items that fail evaluation |
result_save.good | false | Save items that pass evaluation |
result_save.merge | false | Merge all results into single file |
Each evaluator group has:
| Field | Required | Description |
|---|---|---|
fields | Yes | Maps Dingo fields to dataset columns |
evals | Yes | List of evaluators to apply |
evals[].name | Yes | Evaluator class name |
evals[].config | For LLM | LLM config: model, key, api_url |
The fields object maps Dingo's internal field names to your dataset's column names:
| Dingo field | Description | Used by |
|---|---|---|
content | Main text content to evaluate | Most rule/LLM evaluators |
prompt | Instruction/question field | Instruction quality evaluators |
image | Image path or URL | VLM evaluators |
user_input | User query | RAG evaluators |
response | Model response | RAG evaluators |
retrieved_contexts | Retrieved context list | RAG evaluators |
reference | Ground truth reference | RAG evaluators |
| Category | Examples |
|---|---|
| Content checks | RuleContentNull, RuleContentShort, RuleDocRepeat |
| Format checks | RuleColonEnd, RuleSpecialCharacter, RuleAbnormalChar |
| Quality checks | RuleLongWord, RuleHighPPL, RulePunctuation |
| PII detection | RulePII, RuleUrl, RuleEmail |
| Language | RuleChineseChaos, RuleChineseTraditional |
| Category | Evaluators |
|---|---|
| Text quality | LLMTextRepeat, LLMTextQualityV5 |
| RAG metrics | Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy, ContextRelevancy |
| Safety | LLMSecurityProhibition |
| 3H evaluation | LLMText3HHelpful, LLMText3HHarmless, LLMText3HHonest |
pip install "dingo-python[agent]")| Evaluator | Description |
|---|---|
ArticleFactChecker | Autonomous fact-checking with ArXiv/web search tools |
Dingo writes results to an output directory:
outputs/<timestamp>/
├── summary.json # Overall statistics
└── <field_group>/
├── QUALITY_BAD/
│ ├── RULE_COLON_END.jsonl # Failed items by metric
│ └── ...
└── QUALITY_GOOD/
└── ... # Passed items (if result_save.good=true)
{
"task_name": "...",
"total_count": 100,
"good_count": 85,
"bad_count": 15,
"good_ratio": 0.85,
"metric_detail": {
"RuleColonEnd": {"count": 5, "ratio": 0.05},
"RuleSpecialCharacter": {"count": 10, "ratio": 0.1}
}
}
| Variable | Description |
|---|---|
OPENAI_API_KEY | API key for LLM-based evaluation |
OPENAI_BASE_URL | Custom API endpoint (default: https://api.openai.com/v1) |
OPENAI_MODEL | Model name (default: gpt-4) |
| Format | Extension | Description |
|---|---|---|
| JSONL | .jsonl | One JSON object per line (recommended) |
| JSON | .json | Array of objects or single object |
| CSV | .csv | Comma-separated values |
| Plaintext | .txt | One item per line |
| Parquet | .parquet | Apache Parquet columnar format |
When using this skill on behalf of the user:
dingo eval --input "my config.json"if __name__ == '__main__': when writing Python scripts — Dingo uses multiprocessing internally, which fails on macOS without this guard..jsonl → jsonl, .json → json, .csv → csv, .txt → plaintext.${OPENAI_API_KEY} placeholder or environment variables.fields mapping must match actual column names in the dataset.RuleColonEnd, RuleContentNull, RuleSpecialCharacter)LLMTextQualityV5, LLMTextRepeat)Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy). Requires user_input, response, retrieved_contexts, reference fields.ArticleFactChecker (requires dingo-python[agent] extra)LLMSecurityProhibitionAfter evaluation completes, the agent should:
summary.json and report the key metrics: total items, good/bad counts, good ratioDingo includes a built-in MCP (Model Context Protocol) server, allowing AI agents (Cursor, Claude Desktop, etc.) to invoke Dingo's evaluation tools directly.
# SSE transport (default, for Cursor / remote agents)
dingo serve
# Custom port
dingo serve --port 9000
# stdio transport (for Claude Desktop / local agent spawn)
dingo serve --transport stdio
Cursor (~/.cursor/mcp.json):
{
"mcpServers": {
"dingo": {
"url": "http://localhost:8000/sse"
}
}
}
Claude Desktop (claude_desktop_config.json):
{
"mcpServers": {
"dingo": {
"command": "dingo",
"args": ["serve", "--transport", "stdio"],
"env": {
"OPENAI_API_KEY": "your-key",
"OPENAI_MODEL": "gpt-4o"
}
}
}
}
| Tool | Description |
|---|---|
run_dingo_evaluation | Run rule or LLM evaluation on a file |
list_dingo_components | List rule groups, LLM models, prompts |
get_rule_details | Get details about a specific rule |
get_llm_details | Get details about a specific LLM evaluator |
get_prompt_details | Get embedded prompt for an LLM |
run_quick_evaluation | Goal-based evaluation (auto-infer settings) |
For detailed MCP documentation, see: https://github.com/MigoXLab/dingo/blob/main/README_mcp.md
ModuleNotFoundError: No module named 'dingo': Run pip install dingo-python (note: the package name is dingo-python, not dingo)RuntimeError: An attempt has been made to start a new process...: Wrap your code in if __name__ == '__main__': — required on macOS due to multiprocessingOPENAI_API_KEY is set and api_url is correctfields mapping matches your dataset's actual column namesuser_input, response, retrieved_contexts, referenceoutputs/ directory by default (timestamped subdirectories)content field is the most commonly mapped field — it's the main text that most evaluators checkArticleFactChecker extracts all verifiable claims from an article and verifies each one using ArXiv academic search and web search. It runs as an autonomous agent and produces a structured verification report.
pip install "dingo-python[agent]"
python3 -c "from dingo.config import InputArgs; print('Dingo OK')"
Required: OPENAI_API_KEY
Optional (recommended for web search): TAVILY_API_KEY
The skill includes scripts/fact_check.py which handles all input preparation and configuration automatically:
python3 {baseDir}/scripts/fact_check.py path/to/article.md
Supported input formats: .md, .txt (auto-wrapped), .jsonl, .json
Optional arguments:
--model MODEL — LLM model (default: env OPENAI_MODEL or gpt-5.4-mini)--max-claims N — claims to extract, 1–200 (default: 50)--max-concurrent N — parallel verification slots, 1–20 (default: 5)The script outputs structured JSON to stdout. Parse and present:
For direct SDK integration without the script:
import json, os, tempfile
from dingo.config import InputArgs
from dingo.exec import Executor
# IMPORTANT: wrap article into JSONL — plaintext is read line-by-line otherwise
article_text = open("article.md", encoding="utf-8").read()
tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False, encoding="utf-8")
tmp.write(json.dumps({"content": article_text}, ensure_ascii=False) + "\n")
tmp.close()
config = {
"input_path": tmp.name,
"dataset": {"source": "local", "format": "jsonl"},
"executor": {"max_workers": 1},
"evaluator": [{
"fields": {"content": "content"},
"evals": [{
"name": "ArticleFactChecker",
"config": {
"key": os.environ["OPENAI_API_KEY"],
"model": os.getenv("OPENAI_MODEL", "gpt-5.4-mini"),
"api_url": os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
"parameters": {
"temperature": 0,
"agent_config": {
"max_concurrent_claims": 5,
"max_iterations": 50,
"tools": {
"claims_extractor": {
"api_key": os.environ["OPENAI_API_KEY"],
"model": os.getenv("OPENAI_MODEL", "gpt-5.4-mini"),
"base_url": os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
"max_claims": 50
},
"arxiv_search": {"max_results": 5},
**({"tavily_search": {"api_key": os.environ["TAVILY_API_KEY"]}}
if os.getenv("TAVILY_API_KEY") else {})
}
}
}
}
}]
}]
}
if __name__ == "__main__":
result = Executor.exec_map["local"](InputArgs(**config)).execute()
print(f"Score: {result.score:.1f}% | Output: {result.output_path}")
os.unlink(tmp.name)
Key requirement: Always use
if __name__ == "__main__":when running Dingo with multiprocessing — required on macOS, recommended everywhere.
The summary.json in the output directory contains overall stats. Detailed per-claim results are in content/QUALITY_BAD_*.jsonl (for articles with false claims).
Each result item's eval_details.content[0] has:
score: accuracy_score (0.0–1.0, ratio of verified-true claims)reason[0]: human-readable text summaryreason[1]: full structured report dict with detailed_findings and false_claims_comparisonFor advanced configuration (model selection, claim types, tuning), see references/advanced-config.md.