dingo data quality
Evaluate AI training and RAG data quality using rule-based or LLM-based metrics with Dingo's flexible, multi-format assessment framework and CLI/SDK support.
Like a lobster shell, security has layers — review code before you run it.
License
SKILL.md
Data Quality Evaluation with Dingo
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool.
- GitHub: https://github.com/MigoXLab/dingo
- SaaS Platform: https://dingo.openxlab.org.cn/ (free, no install needed)
- PyPI: https://pypi.org/project/dingo-python/
Installation
pip install dingo-python
Optional extras
pip install "dingo-python[agent]" # Agent-based evaluation (fact-checking)
pip install "dingo-python[hhem]" # HHEM hallucination detection
pip install "dingo-python[all]" # Everything
Verify installation
python -c "from dingo.config import InputArgs; print('Dingo OK')"
Two evaluation modes
| Rule-based | LLM-based | |
|---|---|---|
| API key required | No | Yes (any OpenAI-compatible API) |
| Speed | Fast | Slower (API calls) |
| Cost | Zero | Per-token cost |
| Metrics | 50+ deterministic rules | Text quality, RAG, 3H, security |
| Best for | Format checks, PII, completeness | Semantic quality, faithfulness |
Core workflow
- Prepare data: JSONL, JSON, CSV, plaintext, or Parquet file
- Choose evaluators: Rule-based (free, fast) or LLM-based (semantic understanding)
- Run evaluation: CLI with config file or Python SDK
- Review results:
summary.json+ per-item JSONL reports in output directory
CLI Usage
Dingo CLI takes a JSON config file as input:
dingo eval --input config.json
Minimal rule-based config
{
"input_path": "data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "content"},
"evals": [
{"name": "RuleColonEnd"},
{"name": "RuleSpecialCharacter"},
{"name": "RuleContentNull"}
]
}
]
}
LLM-based config
{
"input_path": "data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "content"},
"evals": [
{
"name": "LLMTextRepeat",
"config": {
"model": "deepseek-chat",
"key": "${OPENAI_API_KEY}",
"api_url": "https://api.deepseek.com/v1"
}
}
]
}
]
}
RAG evaluation config
RAG evaluation requires specific fields mapped from the dataset:
{
"input_path": "rag_output.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {
"user_input": "user_input",
"response": "response",
"retrieved_contexts": "retrieved_contexts",
"reference": "reference"
},
"evals": [
{"name": "Faithfulness", "config": {"model": "deepseek-chat", "key": "${OPENAI_API_KEY}", "api_url": "https://api.deepseek.com/v1"}},
{"name": "ContextPrecision", "config": {"model": "deepseek-chat", "key": "${OPENAI_API_KEY}", "api_url": "https://api.deepseek.com/v1"}}
]
}
]
}
Multi-field evaluation config
Evaluate different columns with different rules:
{
"input_path": "qa_data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "answer"},
"evals": [{"name": "RuleColonEnd"}, {"name": "RuleSpecialCharacter"}]
},
{
"fields": {"content": "question"},
"evals": [{"name": "RuleContentNull"}]
}
]
}
SDK Usage
For programmatic use inside Python scripts:
from dingo.config import InputArgs
from dingo.exec import Executor
if __name__ == '__main__':
input_data = {
"input_path": "data.jsonl",
"dataset": {"source": "local", "format": "jsonl"},
"evaluator": [
{
"fields": {"content": "content"},
"evals": [
{"name": "RuleColonEnd"},
{"name": "RuleSpecialCharacter"}
]
}
]
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
Config reference
Dataset configuration
| Field | Values | Description |
|---|---|---|
source | local, huggingface, s3, sql | Data source type |
format | jsonl, json, csv, plaintext, parquet | File format |
Executor configuration
| Field | Default | Description |
|---|---|---|
max_workers | 1 | Parallel evaluation workers |
batch_size | 10 | Items per batch |
result_save.bad | true | Save items that fail evaluation |
result_save.good | false | Save items that pass evaluation |
result_save.merge | false | Merge all results into single file |
Evaluator configuration
Each evaluator group has:
| Field | Required | Description |
|---|---|---|
fields | Yes | Maps Dingo fields to dataset columns |
evals | Yes | List of evaluators to apply |
evals[].name | Yes | Evaluator class name |
evals[].config | For LLM | LLM config: model, key, api_url |
Field mapping
The fields object maps Dingo's internal field names to your dataset's column names:
| Dingo field | Description | Used by |
|---|---|---|
content | Main text content to evaluate | Most rule/LLM evaluators |
prompt | Instruction/question field | Instruction quality evaluators |
image | Image path or URL | VLM evaluators |
user_input | User query | RAG evaluators |
response | Model response | RAG evaluators |
retrieved_contexts | Retrieved context list | RAG evaluators |
reference | Ground truth reference | RAG evaluators |
Available evaluators
Rule-based (no API key needed)
| Category | Examples |
|---|---|
| Content checks | RuleContentNull, RuleContentShort, RuleDocRepeat |
| Format checks | RuleColonEnd, RuleSpecialCharacter, RuleAbnormalChar |
| Quality checks | RuleLongWord, RuleHighPPL, RulePunctuation |
| PII detection | RulePII, RuleUrl, RuleEmail |
| Language | RuleChineseChaos, RuleChineseTraditional |
LLM-based (requires API key)
| Category | Evaluators |
|---|---|
| Text quality | LLMTextRepeat, LLMTextQualityV5 |
| RAG metrics | Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy, ContextRelevancy |
| Safety | LLMSecurityProhibition |
| 3H evaluation | LLMText3HHelpful, LLMText3HHarmless, LLMText3HHonest |
Agent-based (requires pip install "dingo-python[agent]")
| Evaluator | Description |
|---|---|
ArticleFactChecker | Autonomous fact-checking with ArXiv/web search tools |
Output structure
Dingo writes results to an output directory:
outputs/<timestamp>/
├── summary.json # Overall statistics
└── <field_group>/
├── QUALITY_BAD/
│ ├── RULE_COLON_END.jsonl # Failed items by metric
│ └── ...
└── QUALITY_GOOD/
└── ... # Passed items (if result_save.good=true)
summary.json format
{
"task_name": "...",
"total_count": 100,
"good_count": 85,
"bad_count": 15,
"good_ratio": 0.85,
"metric_detail": {
"RuleColonEnd": {"count": 5, "ratio": 0.05},
"RuleSpecialCharacter": {"count": 10, "ratio": 0.1}
}
}
Environment variables
| Variable | Description |
|---|---|
OPENAI_API_KEY | API key for LLM-based evaluation |
OPENAI_BASE_URL | Custom API endpoint (default: https://api.openai.com/v1) |
OPENAI_MODEL | Model name (default: gpt-4) |
Supported input formats
| Format | Extension | Description |
|---|---|---|
| JSONL | .jsonl | One JSON object per line (recommended) |
| JSON | .json | Array of objects or single object |
| CSV | .csv | Comma-separated values |
| Plaintext | .txt | One item per line |
| Parquet | .parquet | Apache Parquet columnar format |
General rules
When using this skill on behalf of the user:
- Always write a config file before running CLI evaluation. Don't try to pass complex JSON inline.
- Quote file paths with spaces in commands:
dingo eval --input "my config.json" - Wrap main code in
if __name__ == '__main__':when writing Python scripts — Dingo uses multiprocessing internally, which fails on macOS without this guard. - Infer format from extension:
.jsonl→jsonl,.json→json,.csv→csv,.txt→plaintext. - Default to rule-based when the user doesn't specify evaluation type — it's free, fast, and needs no API key.
- Ask for API key before using LLM-based evaluators. Never hardcode keys in config files; use
${OPENAI_API_KEY}placeholder or environment variables. - Check field names in the user's data before writing config. The
fieldsmapping must match actual column names in the dataset.
Choosing evaluators
- User wants basic quality checks → Use rule-based evaluators (e.g.,
RuleColonEnd,RuleContentNull,RuleSpecialCharacter) - User wants semantic quality assessment → Use LLM-based evaluators (e.g.,
LLMTextQualityV5,LLMTextRepeat) - User wants RAG pipeline evaluation → Use RAG metrics (
Faithfulness,ContextPrecision,ContextRecall,AnswerRelevancy). Requiresuser_input,response,retrieved_contexts,referencefields. - User wants fact-checking → Use
ArticleFactChecker(requiresdingo-python[agent]extra) - User wants safety/content moderation → Use
LLMSecurityProhibition - User doesn't know what to check → Start with common rule checks, show the summary, then suggest LLM-based evaluators if needed.
Post-evaluation guidance
After evaluation completes, the agent should:
- Read
summary.jsonand report the key metrics: total items, good/bad counts, good ratio - If there are failures, briefly explain what each failing metric means
- Suggest next steps (e.g., "15% of items have colon-ending issues — you may want to clean those")
MCP Server (AI Agent Integration)
Dingo includes a built-in MCP (Model Context Protocol) server, allowing AI agents (Cursor, Claude Desktop, etc.) to invoke Dingo's evaluation tools directly.
Start the server
# SSE transport (default, for Cursor / remote agents)
dingo serve
# Custom port
dingo serve --port 9000
# stdio transport (for Claude Desktop / local agent spawn)
dingo serve --transport stdio
Configure your AI agent
Cursor (~/.cursor/mcp.json):
{
"mcpServers": {
"dingo": {
"url": "http://localhost:8000/sse"
}
}
}
Claude Desktop (claude_desktop_config.json):
{
"mcpServers": {
"dingo": {
"command": "dingo",
"args": ["serve", "--transport", "stdio"],
"env": {
"OPENAI_API_KEY": "your-key",
"OPENAI_MODEL": "gpt-4o"
}
}
}
}
Available MCP tools
| Tool | Description |
|---|---|
run_dingo_evaluation | Run rule or LLM evaluation on a file |
list_dingo_components | List rule groups, LLM models, prompts |
get_rule_details | Get details about a specific rule |
get_llm_details | Get details about a specific LLM evaluator |
get_prompt_details | Get embedded prompt for an LLM |
run_quick_evaluation | Goal-based evaluation (auto-infer settings) |
For detailed MCP documentation, see: https://github.com/MigoXLab/dingo/blob/main/README_mcp.md
Troubleshooting
ModuleNotFoundError: No module named 'dingo': Runpip install dingo-python(note: the package name isdingo-python, notdingo)RuntimeError: An attempt has been made to start a new process...: Wrap your code inif __name__ == '__main__':— required on macOS due to multiprocessing- LLM evaluation returns errors: Check that
OPENAI_API_KEYis set andapi_urlis correct - Empty results: Verify
fieldsmapping matches your dataset's actual column names - RAG metrics all fail: Ensure your data has all required fields:
user_input,response,retrieved_contexts,reference
Notes
- Dingo supports any OpenAI-compatible API (OpenAI, DeepSeek, Anthropic via proxy, local vLLM, etc.)
- Rule-based evaluators run locally with zero API cost
- Results are written to the
outputs/directory by default (timestamped subdirectories) - The
contentfield is the most commonly mapped field — it's the main text that most evaluators check
Resources
- GitHub: https://github.com/MigoXLab/dingo
- SaaS Platform: https://dingo.openxlab.org.cn/
- PyPI: https://pypi.org/project/dingo-python/
- Metrics Documentation: https://github.com/MigoXLab/dingo/blob/main/docs/metrics.md
- RAG Evaluation Guide: https://github.com/MigoXLab/dingo/blob/main/docs/rag_evaluation_en.md
- Discord: https://discord.gg/Jhgb2eKWh8
Fact-Checking Articles with ArticleFactChecker
ArticleFactChecker extracts all verifiable claims from an article and verifies each one using ArXiv academic search and web search. It runs as an autonomous agent and produces a structured verification report.
Prerequisites
pip install "dingo-python[agent]"
python3 -c "from dingo.config import InputArgs; print('Dingo OK')"
Required: OPENAI_API_KEY
Optional (recommended for web search): TAVILY_API_KEY
Quick start — use the bundled script
The skill includes scripts/fact_check.py which handles all input preparation and configuration automatically:
python3 {baseDir}/scripts/fact_check.py path/to/article.md
Supported input formats: .md, .txt (auto-wrapped), .jsonl, .json
Optional arguments:
--model MODEL— LLM model (default: envOPENAI_MODELorgpt-5.4-mini)--max-claims N— claims to extract, 1–200 (default: 50)--max-concurrent N— parallel verification slots, 1–20 (default: 5)
The script outputs structured JSON to stdout. Parse and present:
- accuracy_score (0.0–1.0): fraction of claims verified true
- false_claims: list of contradicted claims with evidence
- all_claims: full breakdown with TRUE/FALSE/UNVERIFIABLE verdicts
Manual SDK usage
For direct SDK integration without the script:
import json, os, tempfile
from dingo.config import InputArgs
from dingo.exec import Executor
# IMPORTANT: wrap article into JSONL — plaintext is read line-by-line otherwise
article_text = open("article.md", encoding="utf-8").read()
tmp = tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False, encoding="utf-8")
tmp.write(json.dumps({"content": article_text}, ensure_ascii=False) + "\n")
tmp.close()
config = {
"input_path": tmp.name,
"dataset": {"source": "local", "format": "jsonl"},
"executor": {"max_workers": 1},
"evaluator": [{
"fields": {"content": "content"},
"evals": [{
"name": "ArticleFactChecker",
"config": {
"key": os.environ["OPENAI_API_KEY"],
"model": os.getenv("OPENAI_MODEL", "gpt-5.4-mini"),
"api_url": os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
"parameters": {
"temperature": 0,
"agent_config": {
"max_concurrent_claims": 5,
"max_iterations": 50,
"tools": {
"claims_extractor": {
"api_key": os.environ["OPENAI_API_KEY"],
"model": os.getenv("OPENAI_MODEL", "gpt-5.4-mini"),
"base_url": os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
"max_claims": 50
},
"arxiv_search": {"max_results": 5},
**({"tavily_search": {"api_key": os.environ["TAVILY_API_KEY"]}}
if os.getenv("TAVILY_API_KEY") else {})
}
}
}
}
}]
}]
}
if __name__ == "__main__":
result = Executor.exec_map["local"](InputArgs(**config)).execute()
print(f"Score: {result.score:.1f}% | Output: {result.output_path}")
os.unlink(tmp.name)
Key requirement: Always use
if __name__ == "__main__":when running Dingo with multiprocessing — required on macOS, recommended everywhere.
Interpreting the output
The summary.json in the output directory contains overall stats. Detailed per-claim results are in content/QUALITY_BAD_*.jsonl (for articles with false claims).
Each result item's eval_details.content[0] has:
score: accuracy_score (0.0–1.0, ratio of verified-true claims)reason[0]: human-readable text summaryreason[1]: full structured report dict withdetailed_findingsandfalse_claims_comparison
For advanced configuration (model selection, claim types, tuning), see references/advanced-config.md.
Files
5 totalComments
Loading comments…
