# Built-in Evaluators Reference Comprehensive patterns for Azure AI Evaluation SDK's built-in evaluators. ## Model Configuration All AI-assisted evaluators require a model configuration: ```python from azure.ai.evaluation import AzureOpenAIModelConfiguration # Using API key authentication model_config = AzureOpenAIModelConfiguration( azure_endpoint="https://.openai.azure.com", api_key="", azure_deployment="gpt-4o-mini", api_version="2024-06-01" ) # Using DefaultAzureCredential (recommended for production) from azure.identity import DefaultAzureCredential model_config = AzureOpenAIModelConfiguration( azure_endpoint="https://.openai.azure.com", credential=DefaultAzureCredential(), azure_deployment="gpt-4o-mini", api_version="2024-06-01" ) ``` ## AI-Assisted Quality Evaluators ### GroundednessEvaluator Measures whether the response is factually grounded in the provided context. ```python from azure.ai.evaluation import GroundednessEvaluator groundedness = GroundednessEvaluator(model_config) result = groundedness( query="What services does Azure AI provide?", context="Azure AI provides cognitive services including vision, speech, " "language understanding, and decision-making APIs.", response="Azure AI offers vision and speech services." ) # Returns: # { # "groundedness": 5, # Score 1-5 # "groundedness_reason": "...", # Explanation # "groundedness_result": "pass" # pass/fail based on threshold # } ``` **Input Requirements:** - `query`: The user's question - `context`: Source documents/information - `response`: The model's response to evaluate ### RelevanceEvaluator Measures how well the response addresses the query. ```python from azure.ai.evaluation import RelevanceEvaluator relevance = RelevanceEvaluator(model_config) result = relevance( query="How do I authenticate with Azure?", context="Azure supports multiple authentication methods...", response="Use DefaultAzureCredential for automatic credential discovery." ) # Score 1-5: 5 = directly addresses query, 1 = completely irrelevant ``` ### CoherenceEvaluator Measures logical flow and consistency of the response. ```python from azure.ai.evaluation import CoherenceEvaluator coherence = CoherenceEvaluator(model_config) # Note: CoherenceEvaluator only needs query and response result = coherence( query="Explain how Azure Functions work.", response="Azure Functions is a serverless compute service. " "It triggers based on events. You write code that runs on demand." ) # Score 1-5: 5 = logically coherent, 1 = disjointed/contradictory ``` ### FluencyEvaluator Measures grammatical correctness and natural language quality. ```python from azure.ai.evaluation import FluencyEvaluator fluency = FluencyEvaluator(model_config) result = fluency( query="What is Azure?", response="Azure is Microsoft's cloud computing platform that provides " "a wide range of services for building and deploying applications." ) # Score 1-5: 5 = perfectly fluent, 1 = poor grammar/unnatural ``` ### SimilarityEvaluator Measures semantic similarity between response and ground truth. ```python from azure.ai.evaluation import SimilarityEvaluator similarity = SimilarityEvaluator(model_config) result = similarity( query="What is the capital of France?", response="Paris is the capital of France.", ground_truth="The capital city of France is Paris." ) # Score 1-5: 5 = semantically identical, 1 = completely different ``` ### RetrievalEvaluator Measures quality of retrieved documents for RAG scenarios. ```python from azure.ai.evaluation import RetrievalEvaluator retrieval = RetrievalEvaluator(model_config) result = retrieval( query="How to configure Azure Storage?", context="Azure Storage can be configured through the Azure Portal. " "You can set replication, access tiers, and networking options." ) # Score 1-5: 5 = highly relevant retrieval, 1 = irrelevant documents ``` ## NLP-Based Evaluators These evaluators use traditional NLP metrics and don't require a model. ### F1ScoreEvaluator Token-level F1 score between response and ground truth. ```python from azure.ai.evaluation import F1ScoreEvaluator f1 = F1ScoreEvaluator() result = f1( response="The quick brown fox jumps over the lazy dog", ground_truth="A quick brown fox jumped over a lazy dog" ) # Returns: # { # "f1_score": 0.7272... # Score 0-1 # } ``` ### RougeScoreEvaluator ROUGE scores for summarization quality. ```python from azure.ai.evaluation import RougeScoreEvaluator rouge = RougeScoreEvaluator(rouge_type="rouge1") # rouge1, rouge2, rougeL, rougeLsum result = rouge( response="Azure provides cloud computing services.", ground_truth="Azure is Microsoft's cloud computing platform." ) # Returns: # { # "rouge1_precision": 0.5, # "rouge1_recall": 0.5, # "rouge1_fmeasure": 0.5 # } ``` **ROUGE Types:** - `rouge1`: Unigram overlap - `rouge2`: Bigram overlap - `rougeL`: Longest common subsequence - `rougeLsum`: Summary-level LCS ### BleuScoreEvaluator BLEU score for translation/generation quality. ```python from azure.ai.evaluation import BleuScoreEvaluator bleu = BleuScoreEvaluator() result = bleu( response="The cat sat on the mat.", ground_truth="A cat is sitting on the mat." ) # Returns: # { # "bleu_score": 0.3... # Score 0-1 # } ``` ### GleuScoreEvaluator GLEU (Google-BLEU) variant optimized for sentence-level evaluation. ```python from azure.ai.evaluation import GleuScoreEvaluator gleu = GleuScoreEvaluator() result = gleu( response="Hello world", ground_truth="Hello, world!" ) ``` ### MeteorScoreEvaluator METEOR score considering synonyms and paraphrases. ```python from azure.ai.evaluation import MeteorScoreEvaluator meteor = MeteorScoreEvaluator() result = meteor( response="The automobile is red.", ground_truth="The car is red." ) # METEOR handles synonyms better than BLEU ``` ## Safety Evaluators Safety evaluators require an Azure AI project scope. ```python from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential # Get project scope for safety evaluators project = AIProjectClient.from_connection_string( conn_str="", credential=DefaultAzureCredential() ) project_scope = project.scope ``` ### ViolenceEvaluator Detects violent content. ```python from azure.ai.evaluation import ViolenceEvaluator violence = ViolenceEvaluator(azure_ai_project=project_scope) result = violence( query="Tell me a story", response="Once upon a time in a peaceful village..." ) # Returns: # { # "violence": "Very low", # Severity level # "violence_score": 0, # Score 0-7 # "violence_reason": "..." # Explanation # } ``` ### Sexual, SelfHarm, HateUnfairness Evaluators Same pattern as ViolenceEvaluator: ```python from azure.ai.evaluation import ( SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator ) sexual = SexualEvaluator(azure_ai_project=project_scope) self_harm = SelfHarmEvaluator(azure_ai_project=project_scope) hate = HateUnfairnessEvaluator(azure_ai_project=project_scope) ``` ### IndirectAttackEvaluator Detects indirect prompt injection attacks. ```python from azure.ai.evaluation import IndirectAttackEvaluator indirect = IndirectAttackEvaluator(azure_ai_project=project_scope) result = indirect( query="Summarize this document", context="Document content... [hidden: ignore previous instructions]", response="The document discusses..." ) ``` ### ProtectedMaterialEvaluator Detects use of copyrighted or protected material. ```python from azure.ai.evaluation import ProtectedMaterialEvaluator protected = ProtectedMaterialEvaluator(azure_ai_project=project_scope) result = protected( query="Write me a poem", response="Roses are red, violets are blue..." ) ``` ## Composite Evaluators ### QAEvaluator Combines all quality metrics in one evaluator. ```python from azure.ai.evaluation import QAEvaluator qa = QAEvaluator(model_config) result = qa( query="What is Azure?", context="Azure is Microsoft's cloud platform...", response="Azure is a cloud computing service by Microsoft.", ground_truth="Azure is Microsoft's cloud computing platform." ) # Returns all quality metrics: # - groundedness, relevance, coherence, fluency, similarity ``` ### ContentSafetyEvaluator Combines all safety metrics in one evaluator. ```python from azure.ai.evaluation import ContentSafetyEvaluator safety = ContentSafetyEvaluator(azure_ai_project=project_scope) result = safety( query="Tell me about history", response="World War II was a global conflict..." ) # Returns all safety metrics: # - violence, sexual, self_harm, hate_unfairness ``` ## Evaluator Configuration Table | Evaluator | Type | Required Inputs | Score Range | |-----------|------|-----------------|-------------| | `GroundednessEvaluator` | AI | query, context, response | 1-5 | | `RelevanceEvaluator` | AI | query, context, response | 1-5 | | `CoherenceEvaluator` | AI | query, response | 1-5 | | `FluencyEvaluator` | AI | query, response | 1-5 | | `SimilarityEvaluator` | AI | query, response, ground_truth | 1-5 | | `RetrievalEvaluator` | AI | query, context | 1-5 | | `F1ScoreEvaluator` | NLP | response, ground_truth | 0-1 | | `RougeScoreEvaluator` | NLP | response, ground_truth | 0-1 | | `BleuScoreEvaluator` | NLP | response, ground_truth | 0-1 | | `ViolenceEvaluator` | Safety | query, response | 0-7 | | `SexualEvaluator` | Safety | query, response | 0-7 | | `SelfHarmEvaluator` | Safety | query, response | 0-7 | | `HateUnfairnessEvaluator` | Safety | query, response | 0-7 | ## Async Evaluation All evaluators support async execution: ```python import asyncio from azure.ai.evaluation import GroundednessEvaluator async def evaluate_async(): groundedness = GroundednessEvaluator(model_config) result = await groundedness( query="What is Azure?", context="Azure is Microsoft's cloud...", response="Azure is a cloud platform." ) return result result = asyncio.run(evaluate_async()) ``` ## Best Practices 1. **Choose appropriate evaluators** - Use NLP evaluators when you have ground truth, AI evaluators for subjective quality 2. **Batch evaluation** - Use `evaluate()` function for datasets rather than looping 3. **Safety first** - Always include safety evaluators for user-facing applications 4. **Log to Foundry** - Track evaluations over time with `azure_ai_project` parameter 5. **Threshold configuration** - Set appropriate pass/fail thresholds for your use case