Azure Ai Evaluation Py
v0.1.0Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".
Security Scan
OpenClaw
Benign
high confidencePurpose & Capability
The name/description match the included docs and CLI script: evaluating generative AI with built-in and custom evaluators. The environment variables mentioned (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT, AIPROJECT_CONNECTION_STRING) and imports (azure.ai.evaluation, azure.identity, azure.ai.projects) are appropriate and expected for the described functionality. No unrelated credentials, binaries, or config paths are requested.
Instruction Scope
SKILL.md and scripts limit actions to building evaluator instances, calling evaluate(), and reading user-supplied data files (JSONL). Examples include prompt-based evaluators that send prompts to Azure OpenAI models (expected). There are no instructions to read arbitrary system files or post data to unexpected third-party endpoints. Note: examples include a sample that contains the phrase 'ignore previous instructions' as part of demonstrating an IndirectAttackEvaluator — this is a documentation/example of prompt-injection detection, not an instruction to ignore agent constraints.
Install Mechanism
The skill is instruction-only and includes no platform install spec. SKILL.md recommends pip installing the 'azure-ai-evaluation' package and optional extras for remote evaluation; this is normal but means installation happens outside the platform. Verify the pip package provenance before installing to your environment.
Credentials
Requested environment variables are limited to Azure/OpenAI and Foundry (AIPROJECT_CONNECTION_STRING). Those are proportional to evaluating models and logging to a Foundry project. No unrelated secrets or broad system credentials are requested.
Persistence & Privilege
The skill does not request persistent presence (always:false) and contains no code that modifies other skills or global agent configuration. It does not require elevated privileges beyond normal network calls to Azure services.
Scan Findings in Context
[ignore-previous-instructions] expected: The SKILL.md and reference examples intentionally include a sample '[hidden: ignore previous instructions]' string to illustrate detection of indirect prompt-injection attacks (IndirectAttackEvaluator). This appears to be an example for a safety evaluator rather than malicious prompt injection.
Assessment
This skill appears coherent with its stated purpose — it needs Azure/OpenAI endpoint and either an API key or DefaultAzureCredential, and optionally a Foundry connection string if you want safety logging. Before installing/using: 1) Confirm you trust the pip package 'azure-ai-evaluation' (review its upstream source) before pip installing. 2) Only run evaluations on datasets you control or have vetted, since data will be sent to your configured Azure OpenAI deployment. 3) Review any custom/prompt-based evaluators you add — they can send arbitrary text to the model, so avoid embedding secrets in evaluated data or prompts. 4) The documentation contains an example string used to demonstrate detecting prompt-injection — this is benign in context but be cautious when reusing prompts that include 'ignore previous instructions' patterns.Like a lobster shell, security has layers — review code before you run it.
latest
Azure AI Evaluation SDK for Python
Assess generative AI application performance with built-in and custom evaluators.
Installation
pip install azure-ai-evaluation
# With remote evaluation support
pip install azure-ai-evaluation[remote]
Environment Variables
# For AI-assisted evaluators
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
AZURE_OPENAI_API_KEY=<your-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
# For Foundry project integration
AIPROJECT_CONNECTION_STRING=<your-connection-string>
Built-in Evaluators
Quality Evaluators (AI-Assisted)
from azure.ai.evaluation import (
GroundednessEvaluator,
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
SimilarityEvaluator,
RetrievalEvaluator
)
# Initialize with Azure OpenAI model config
model_config = {
"azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
"api_key": os.environ["AZURE_OPENAI_API_KEY"],
"azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
Quality Evaluators (NLP-based)
from azure.ai.evaluation import (
F1ScoreEvaluator,
RougeScoreEvaluator,
BleuScoreEvaluator,
GleuScoreEvaluator,
MeteorScoreEvaluator
)
f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()
Safety Evaluators
from azure.ai.evaluation import (
ViolenceEvaluator,
SexualEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
IndirectAttackEvaluator,
ProtectedMaterialEvaluator
)
violence = ViolenceEvaluator(azure_ai_project=project_scope)
sexual = SexualEvaluator(azure_ai_project=project_scope)
Single Row Evaluation
from azure.ai.evaluation import GroundednessEvaluator
groundedness = GroundednessEvaluator(model_config)
result = groundedness(
query="What is Azure AI?",
context="Azure AI is Microsoft's AI platform...",
response="Azure AI provides AI services and tools."
)
print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")
Batch Evaluation with evaluate()
from azure.ai.evaluation import evaluate
result = evaluate(
data="test_data.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
"coherence": coherence
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}"
}
}
}
)
print(result["metrics"])
Composite Evaluators
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator
# All quality metrics in one
qa_evaluator = QAEvaluator(model_config)
# All safety metrics in one
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)
result = evaluate(
data="data.jsonl",
evaluators={
"qa": qa_evaluator,
"content_safety": safety_evaluator
}
)
Evaluate Application Target
from azure.ai.evaluation import evaluate
from my_app import chat_app # Your application
result = evaluate(
data="queries.jsonl",
target=chat_app, # Callable that takes query, returns response
evaluators={
"groundedness": groundedness
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${outputs.context}",
"response": "${outputs.response}"
}
}
}
)
Custom Evaluators
Code-Based
from azure.ai.evaluation import evaluator
@evaluator
def word_count_evaluator(response: str) -> dict:
return {"word_count": len(response.split())}
# Use in evaluate()
result = evaluate(
data="data.jsonl",
evaluators={"word_count": word_count_evaluator}
)
Prompt-Based
from azure.ai.evaluation import PromptChatTarget
class CustomEvaluator:
def __init__(self, model_config):
self.model = PromptChatTarget(model_config)
def __call__(self, query: str, response: str) -> dict:
prompt = f"Rate this response 1-5: Query: {query}, Response: {response}"
result = self.model.send_prompt(prompt)
return {"custom_score": int(result)}
Log to Foundry Project
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
credential=DefaultAzureCredential()
)
result = evaluate(
data="data.jsonl",
evaluators={"groundedness": groundedness},
azure_ai_project=project.scope # Logs results to Foundry
)
print(f"View results: {result['studio_url']}")
Evaluator Reference
| Evaluator | Type | Metrics |
|---|---|---|
GroundednessEvaluator | AI | groundedness (1-5) |
RelevanceEvaluator | AI | relevance (1-5) |
CoherenceEvaluator | AI | coherence (1-5) |
FluencyEvaluator | AI | fluency (1-5) |
SimilarityEvaluator | AI | similarity (1-5) |
RetrievalEvaluator | AI | retrieval (1-5) |
F1ScoreEvaluator | NLP | f1_score (0-1) |
RougeScoreEvaluator | NLP | rouge scores |
ViolenceEvaluator | Safety | violence (0-7) |
SexualEvaluator | Safety | sexual (0-7) |
SelfHarmEvaluator | Safety | self_harm (0-7) |
HateUnfairnessEvaluator | Safety | hate_unfairness (0-7) |
QAEvaluator | Composite | All quality metrics |
ContentSafetyEvaluator | Composite | All safety metrics |
Best Practices
- Use composite evaluators for comprehensive assessment
- Map columns correctly — mismatched columns cause silent failures
- Log to Foundry for tracking and comparison across runs
- Create custom evaluators for domain-specific metrics
- Use NLP evaluators when you have ground truth answers
- Safety evaluators require Azure AI project scope
- Batch evaluation is more efficient than single-row loops
Reference Files
| File | Contents |
|---|---|
| references/built-in-evaluators.md | Detailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables |
| references/custom-evaluators.md | Creating code-based and prompt-based custom evaluators, testing patterns |
| scripts/run_batch_evaluation.py | CLI tool for running batch evaluations with quality, safety, and custom evaluators |
Comments
Loading comments...
