Install
openclaw skills install semantic-consistency-auditorUse semantic consistency auditor for academic writing workflows that need structured execution, explicit assumptions, and clear output boundaries.
openclaw skills install semantic-consistency-auditorID: 212
Name: semantic-consistency-auditor
Description: Introduces BERTScore and COMET algorithms to evaluate the semantic consistency between AI-generated clinical notes and expert gold standards from the "semantic entailment" level.
scripts/main.py.references/ for task-specific guidance.See ## Prerequisites above for related details.
Python: 3.10+. Repository baseline for current packaged skills.bert_score: unspecified. Declared in requirements.txt.comet: unspecified. Declared in requirements.txt.dataclasses: unspecified. Declared in requirements.txt.numpy: unspecified. Declared in requirements.txt.torch: unspecified. Declared in requirements.txt.yaml: unspecified. Declared in requirements.txt.See ## Usage above for related details.
cd "20260318/scientific-skills/Academic Writing/semantic-consistency-auditor"
python -m py_compile scripts/main.py
python scripts/main.py --help
Example run plan:
CONFIG block or documented parameters if the script uses fixed settings.python scripts/main.py with the validated inputs.See ## Workflow above for related details.
scripts/main.py.references/ contains supporting rules, prompts, or checklists.Use this command to verify that the packaged script entry point can be parsed before deeper execution.
python -m py_compile scripts/main.py
Use these concrete commands for validation. They are intentionally self-contained and avoid placeholder paths.
python -m py_compile scripts/main.py
python scripts/main.py --help
Semantic Consistency Auditor is a medical AI evaluation tool used to assess the semantic consistency between AI-generated clinical notes and expert-written gold standards from a semantic level. This tool is not limited to traditional string matching or bag-of-words models, but uses deep learning models to understand semantic entailment relationships, capable of identifying expressions with different wording but similar meaning.
BERTScore uses pre-trained BERT model contextual embeddings to calculate similarity between candidate text and reference text:
COMET is a neural network-based evaluation metric originally used for machine translation evaluation, applicable to semantic entailment tasks:
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# Or venv\Scripts\activate # Windows
# Install dependencies
pip install bertscore comet-ml transformers torch
Configure in ~/.openclaw/skills/semantic-consistency-auditor/config.yaml:
# BERTScore Configuration
bertscore:
model: "microsoft/deberta-xlarge-mnli" # Or "bert-base-chinese" for Chinese
lang: "zh" # Language code: zh, en, etc.
rescale_with_baseline: true
device: "auto" # auto, cpu, cuda
# COMET Configuration
comet:
model: "Unbabel/wmt22-comet-da" # COMET model
batch_size: 8
device: "auto"
# Evaluation Thresholds
thresholds:
bertscore_f1: 0.85
comet_score: 0.75
semantic_consistency: 0.80 # Comprehensive score threshold
# Evaluate single case pair
python scripts/main.py \
--ai-generated "Patient presented with fever for 3 days, highest temperature 39°C, accompanied by cough." \
--gold-standard "Patient chief complaint of fever for 3 days, highest temperature 39°C, accompanied by cough symptoms." \
--output results.json
# Batch evaluation from JSON file
python scripts/main.py \
--input-file batch_cases.json \
--output results.json \
--format detailed
# Use specific model
python scripts/main.py \
--ai-generated "..." \
--gold-standard "..." \
--bert-model "bert-base-chinese" \
--comet-model "Unbabel/wmt20-comet-da"
from semantic_consistency_auditor import SemanticConsistencyAuditor
# Initialize evaluator
auditor = SemanticConsistencyAuditor(
bert_model="microsoft/deberta-xlarge-mnli",
comet_model="Unbabel/wmt22-comet-da",
lang="zh"
)
# Evaluate single case
result = auditor.evaluate(
ai_text="Patient presented with fever for 3 days...",
gold_text="Patient chief complaint of fever for 3 days..."
)
print(f"BERTScore F1: {result['bertscore']['f1']:.4f}")
print(f"COMET Score: {result['comet']['score']:.4f}")
print(f"Consistency: {result['consistency']:.4f}")
print(f"Passed: {result['passed']}")
# Batch evaluation
results = auditor.evaluate_batch([
{"ai": "...", "gold": "..."},
{"ai": "...", "gold": "..."}
])
Pass text directly through --ai-generated and --gold-standard parameters.
[
{
"case_id": "CASE001",
"ai_generated": "Patient presented with fever for 3 days, highest temperature 39°C, accompanied by cough.",
"gold_standard": "Patient chief complaint of fever for 3 days, highest temperature 39°C, accompanied by cough symptoms.",
"metadata": {
"department": "Respiratory",
"disease_type": "Upper respiratory infection"
}
},
{
"case_id": "CASE002",
"ai_generated": "...",
"gold_standard": "..."
}
]
{
"overall": {
"total_cases": 100,
"passed_cases": 85,
"pass_rate": 0.85,
"avg_bertscore_f1": 0.8923,
"avg_comet_score": 0.8234,
"avg_consistency": 0.8579
},
"thresholds": {
"bertscore_f1": 0.85,
"comet_score": 0.75,
"semantic_consistency": 0.80
}
}
{
"cases": [
{
"case_id": "CASE001",
"ai_generated": "Patient presented with fever for 3 days...",
"gold_standard": "Patient chief complaint of fever for 3 days...",
"metrics": {
"bertscore": {
"precision": 0.9123,
"recall": 0.8934,
"f1": 0.9028
},
"comet": {
"score": 0.8234,
"system_score": 0.8156
},
"semantic_consistency": 0.8631
},
"passed": true,
"details": {
"semantic_gaps": [],
"matched_concepts": ["fever for 3 days", "temperature 39°C", "cough"]
}
}
],
"summary": { ... }
}
scripts/main.py fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback.
# Python dependencies
pip install -r requirements.txt
Every final response should make these items explicit when they are relevant:
This skill accepts requests that match the documented purpose of semantic-consistency-auditor and include enough context to complete the workflow safely.
Do not continue the workflow when the request is out of scope, missing a critical input, or would require unsupported assumptions. Instead respond:
semantic-consistency-auditoronly handles its documented workflow. Please provide the missing required inputs or switch to a more suitable skill.
Use the following fixed structure for non-trivial requests:
If the request is simple, you may compress the structure, but still keep assumptions and limits explicit when they affect correctness.