# Azure AI Voice Live SDK - Models Reference

## Table of Contents
- [Enums](#enums)
- [Client Events](#client-events)
- [Server Events](#server-events)
- [Session Models](#session-models)
- [Conversation Items](#conversation-items)
- [Content Parts](#content-parts)
- [Tools](#tools)
- [Voice Models](#voice-models)
- [Turn Detection](#turn-detection)
- [Response Models](#response-models)
- [Avatar Models](#avatar-models)

---

## Enums

### Modality
```python
class Modality(str, Enum):
    TEXT = "text"
    AUDIO = "audio"
    ANIMATION = "animation"
    AVATAR = "avatar"
```

### OpenAIVoiceName
```python
class OpenAIVoiceName(str, Enum):
    ALLOY = "alloy"
    ASH = "ash"
    BALLAD = "ballad"
    CORAL = "coral"
    ECHO = "echo"
    SAGE = "sage"
    SHIMMER = "shimmer"
    VERSE = "verse"
    MARIN = "marin"
    CEDAR = "cedar"
```

### InputAudioFormat
```python
class InputAudioFormat(str, Enum):
    PCM16 = "pcm16"           # 24kHz default
    G711_ULAW = "g711_ulaw"   # 8kHz
    G711_ALAW = "g711_alaw"   # 8kHz
```

### OutputAudioFormat
```python
class OutputAudioFormat(str, Enum):
    PCM16 = "pcm16"               # 24kHz
    PCM16_8000_HZ = "pcm16-8000hz"
    PCM16_16000_HZ = "pcm16-16000hz"
    G711_ULAW = "g711_ulaw"       # 8kHz
    G711_ALAW = "g711_alaw"       # 8kHz
```

### TurnDetectionType
```python
class TurnDetectionType(str, Enum):
    SERVER_VAD = "server_vad"
    AZURE_SEMANTIC_VAD = "azure_semantic_vad"
    AZURE_SEMANTIC_VAD_EN = "azure_semantic_vad_en"
    AZURE_SEMANTIC_VAD_MULTILINGUAL = "azure_semantic_vad_multilingual"
```

### MessageRole
```python
class MessageRole(str, Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"
```

### ItemType
```python
class ItemType(str, Enum):
    MESSAGE = "message"
    FUNCTION_CALL = "function_call"
    FUNCTION_CALL_OUTPUT = "function_call_output"
    MCP_LIST_TOOLS = "mcp_list_tools"
    MCP_CALL = "mcp_call"
    MCP_APPROVAL_REQUEST = "mcp_approval_request"
    MCP_APPROVAL_RESPONSE = "mcp_approval_response"
```

### ContentPartType
```python
class ContentPartType(str, Enum):
    INPUT_TEXT = "input_text"
    INPUT_AUDIO = "input_audio"
    INPUT_IMAGE = "input_image"
    TEXT = "text"
    AUDIO = "audio"
```

### ToolType
```python
class ToolType(str, Enum):
    FUNCTION = "function"
    MCP = "mcp"
```

### ToolChoiceLiteral
```python
class ToolChoiceLiteral(str, Enum):
    AUTO = "auto"
    NONE = "none"
    REQUIRED = "required"
```

### ResponseStatus
```python
class ResponseStatus(str, Enum):
    COMPLETED = "completed"
    CANCELLED = "cancelled"
    FAILED = "failed"
    INCOMPLETE = "incomplete"
    IN_PROGRESS = "in_progress"
```

### ClientEventType
```python
class ClientEventType(str, Enum):
    SESSION_UPDATE = "session.update"
    INPUT_AUDIO_BUFFER_APPEND = "input_audio_buffer.append"
    INPUT_AUDIO_BUFFER_COMMIT = "input_audio_buffer.commit"
    INPUT_AUDIO_BUFFER_CLEAR = "input_audio_buffer.clear"
    INPUT_AUDIO_TURN_START = "input_audio.turn.start"
    INPUT_AUDIO_TURN_APPEND = "input_audio.turn.append"
    INPUT_AUDIO_TURN_END = "input_audio.turn.end"
    INPUT_AUDIO_TURN_CANCEL = "input_audio.turn.cancel"
    INPUT_AUDIO_CLEAR = "input_audio.clear"
    CONVERSATION_ITEM_CREATE = "conversation.item.create"
    CONVERSATION_ITEM_RETRIEVE = "conversation.item.retrieve"
    CONVERSATION_ITEM_TRUNCATE = "conversation.item.truncate"
    CONVERSATION_ITEM_DELETE = "conversation.item.delete"
    RESPONSE_CREATE = "response.create"
    RESPONSE_CANCEL = "response.cancel"
    SESSION_AVATAR_CONNECT = "session.avatar.connect"
    MCP_APPROVAL_RESPONSE = "mcp_approval_response"
```

### ServerEventType
```python
class ServerEventType(str, Enum):
    ERROR = "error"
    SESSION_AVATAR_CONNECTING = "session.avatar.connecting"
    SESSION_CREATED = "session.created"
    SESSION_UPDATED = "session.updated"
    CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED = "conversation.item.input_audio_transcription.completed"
    CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA = "conversation.item.input_audio_transcription.delta"
    CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_FAILED = "conversation.item.input_audio_transcription.failed"
    CONVERSATION_ITEM_CREATED = "conversation.item.created"
    CONVERSATION_ITEM_RETRIEVED = "conversation.item.retrieved"
    CONVERSATION_ITEM_TRUNCATED = "conversation.item.truncated"
    CONVERSATION_ITEM_DELETED = "conversation.item.deleted"
    INPUT_AUDIO_BUFFER_COMMITTED = "input_audio_buffer.committed"
    INPUT_AUDIO_BUFFER_CLEARED = "input_audio_buffer.cleared"
    INPUT_AUDIO_BUFFER_SPEECH_STARTED = "input_audio_buffer.speech_started"
    INPUT_AUDIO_BUFFER_SPEECH_STOPPED = "input_audio_buffer.speech_stopped"
    RESPONSE_CREATED = "response.created"
    RESPONSE_DONE = "response.done"
    RESPONSE_OUTPUT_ITEM_ADDED = "response.output_item.added"
    RESPONSE_OUTPUT_ITEM_DONE = "response.output_item.done"
    RESPONSE_CONTENT_PART_ADDED = "response.content_part.added"
    RESPONSE_CONTENT_PART_DONE = "response.content_part.done"
    RESPONSE_TEXT_DELTA = "response.text.delta"
    RESPONSE_TEXT_DONE = "response.text.done"
    RESPONSE_AUDIO_TRANSCRIPT_DELTA = "response.audio_transcript.delta"
    RESPONSE_AUDIO_TRANSCRIPT_DONE = "response.audio_transcript.done"
    RESPONSE_AUDIO_DELTA = "response.audio.delta"
    RESPONSE_AUDIO_DONE = "response.audio.done"
    RESPONSE_FUNCTION_CALL_ARGUMENTS_DELTA = "response.function_call_arguments.delta"
    RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE = "response.function_call_arguments.done"
    # MCP events
    MCP_LIST_TOOLS_IN_PROGRESS = "mcp_list_tools.in_progress"
    MCP_LIST_TOOLS_COMPLETED = "mcp_list_tools.completed"
    MCP_LIST_TOOLS_FAILED = "mcp_list_tools.failed"
    RESPONSE_MCP_CALL_ARGUMENTS_DELTA = "response.mcp_call_arguments.delta"
    RESPONSE_MCP_CALL_ARGUMENTS_DONE = "response.mcp_call_arguments.done"
    RESPONSE_MCP_CALL_IN_PROGRESS = "response.mcp_call.in_progress"
    RESPONSE_MCP_CALL_COMPLETED = "response.mcp_call.completed"
    RESPONSE_MCP_CALL_FAILED = "response.mcp_call.failed"
    # Animation events
    RESPONSE_ANIMATION_BLENDSHAPES_DELTA = "response.animation_blendshapes.delta"
    RESPONSE_ANIMATION_BLENDSHAPES_DONE = "response.animation_blendshapes.done"
    RESPONSE_ANIMATION_VISEME_DELTA = "response.animation_viseme.delta"
    RESPONSE_ANIMATION_VISEME_DONE = "response.animation_viseme.done"
    RESPONSE_AUDIO_TIMESTAMP_DELTA = "response.audio_timestamp.delta"
    RESPONSE_AUDIO_TIMESTAMP_DONE = "response.audio_timestamp.done"
```

---

## Client Events

### ClientEventSessionUpdate
```python
class ClientEventSessionUpdate(Model):
    type: Literal["session.update"]
    event_id: Optional[str]
    session: RequestSession
```

### ClientEventInputAudioBufferAppend
```python
class ClientEventInputAudioBufferAppend(Model):
    type: Literal["input_audio_buffer.append"]
    event_id: Optional[str]
    audio: str  # Base64-encoded audio
```

### ClientEventInputAudioBufferCommit
```python
class ClientEventInputAudioBufferCommit(Model):
    type: Literal["input_audio_buffer.commit"]
    event_id: Optional[str]
```

### ClientEventInputAudioBufferClear
```python
class ClientEventInputAudioBufferClear(Model):
    type: Literal["input_audio_buffer.clear"]
    event_id: Optional[str]
```

### ClientEventConversationItemCreate
```python
class ClientEventConversationItemCreate(Model):
    type: Literal["conversation.item.create"]
    event_id: Optional[str]
    previous_item_id: Optional[str]
    item: ConversationRequestItem
```

### ClientEventConversationItemDelete
```python
class ClientEventConversationItemDelete(Model):
    type: Literal["conversation.item.delete"]
    event_id: Optional[str]
    item_id: str
```

### ClientEventConversationItemTruncate
```python
class ClientEventConversationItemTruncate(Model):
    type: Literal["conversation.item.truncate"]
    event_id: Optional[str]
    item_id: str
    content_index: int
    audio_end_ms: int
```

### ClientEventResponseCreate
```python
class ClientEventResponseCreate(Model):
    type: Literal["response.create"]
    event_id: Optional[str]
    response: Optional[ResponseCreateParams]
    additional_instructions: Optional[str]
```

### ClientEventResponseCancel
```python
class ClientEventResponseCancel(Model):
    type: Literal["response.cancel"]
    event_id: Optional[str]
    response_id: Optional[str]
```

---

## Server Events

### ServerEventSessionCreated
```python
class ServerEventSessionCreated(Model):
    type: Literal["session.created"]
    event_id: str
    session: ResponseSession
```

### ServerEventSessionUpdated
```python
class ServerEventSessionUpdated(Model):
    type: Literal["session.updated"]
    event_id: str
    session: ResponseSession
```

### ServerEventError
```python
class ServerEventError(Model):
    type: Literal["error"]
    event_id: str
    error: ServerEventErrorDetails

class ServerEventErrorDetails(Model):
    type: str
    code: Optional[str]
    message: str
    param: Optional[str]
```

### ServerEventInputAudioBufferSpeechStarted
```python
class ServerEventInputAudioBufferSpeechStarted(Model):
    type: Literal["input_audio_buffer.speech_started"]
    event_id: str
    audio_start_ms: int
    item_id: str
```

### ServerEventInputAudioBufferSpeechStopped
```python
class ServerEventInputAudioBufferSpeechStopped(Model):
    type: Literal["input_audio_buffer.speech_stopped"]
    event_id: str
    audio_end_ms: int
    item_id: str
```

### ServerEventConversationItemInputAudioTranscriptionCompleted
```python
class ServerEventConversationItemInputAudioTranscriptionCompleted(Model):
    type: Literal["conversation.item.input_audio_transcription.completed"]
    event_id: str
    item_id: str
    content_index: int
    transcript: str
```

### ServerEventConversationItemInputAudioTranscriptionDelta
```python
class ServerEventConversationItemInputAudioTranscriptionDelta(Model):
    type: Literal["conversation.item.input_audio_transcription.delta"]
    event_id: str
    item_id: str
    content_index: int
    delta: str
```

### ServerEventResponseCreated
```python
class ServerEventResponseCreated(Model):
    type: Literal["response.created"]
    event_id: str
    response: Response
```

### ServerEventResponseDone
```python
class ServerEventResponseDone(Model):
    type: Literal["response.done"]
    event_id: str
    response: Response
```

### ServerEventResponseAudioDelta
```python
class ServerEventResponseAudioDelta(Model):
    type: Literal["response.audio.delta"]
    event_id: str
    response_id: str
    item_id: str
    output_index: int
    content_index: int
    delta: str  # Base64-encoded audio
```

### ServerEventResponseAudioTranscriptDelta
```python
class ServerEventResponseAudioTranscriptDelta(Model):
    type: Literal["response.audio_transcript.delta"]
    event_id: str
    response_id: str
    item_id: str
    output_index: int
    content_index: int
    delta: str
```

### ServerEventResponseAudioTranscriptDone
```python
class ServerEventResponseAudioTranscriptDone(Model):
    type: Literal["response.audio_transcript.done"]
    event_id: str
    response_id: str
    item_id: str
    output_index: int
    content_index: int
    transcript: str
```

### ServerEventResponseFunctionCallArgumentsDelta
```python
class ServerEventResponseFunctionCallArgumentsDelta(Model):
    type: Literal["response.function_call_arguments.delta"]
    event_id: str
    response_id: str
    item_id: str
    output_index: int
    call_id: str
    delta: str
```

### ServerEventResponseFunctionCallArgumentsDone
```python
class ServerEventResponseFunctionCallArgumentsDone(Model):
    type: Literal["response.function_call_arguments.done"]
    event_id: str
    response_id: str
    item_id: str
    output_index: int
    call_id: str
    name: str
    arguments: str
```

---

## Session Models

### RequestSession
```python
class RequestSession(Model):
    instructions: Optional[str]
    modalities: Optional[List[Modality]]
    voice: Optional[Voice]  # str, OpenAIVoiceName, OpenAIVoice, or AzureVoice
    input_audio_format: Optional[InputAudioFormat]
    output_audio_format: Optional[OutputAudioFormat]
    turn_detection: Optional[TurnDetection]
    tools: Optional[List[Tool]]
    tool_choice: Optional[ToolChoice]
    temperature: Optional[float]
    max_response_output_tokens: Optional[Union[int, Literal["inf"]]]
    input_audio_transcription: Optional[AudioInputTranscriptionOptions]
```

### ResponseSession
```python
class ResponseSession(Model):
    id: str
    object: str
    model: str
    expires_at: int
    modalities: List[Modality]
    instructions: Optional[str]
    voice: Optional[Voice]
    input_audio_format: InputAudioFormat
    output_audio_format: OutputAudioFormat
    turn_detection: Optional[TurnDetection]
    tools: List[Tool]
    tool_choice: ToolChoice
    temperature: float
    max_response_output_tokens: Optional[int]
```

### AudioInputTranscriptionOptions
```python
class AudioInputTranscriptionOptions(Model):
    model: str  # e.g., "whisper-1"
```

---

## Conversation Items

### ConversationRequestItem (Union Type)
```python
# Can be one of:
- SystemMessageItem
- UserMessageItem
- AssistantMessageItem
- FunctionCallItem
- FunctionCallOutputItem
```

### MessageItem Base
```python
class MessageItem(Model):
    type: Literal["message"]
    id: Optional[str]
    role: MessageRole
    content: List[ContentPart]
    status: Optional[ItemParamStatus]
```

### FunctionCallItem
```python
class FunctionCallItem(Model):
    type: Literal["function_call"]
    id: Optional[str]
    call_id: str
    name: str
    arguments: str
    status: Optional[ItemParamStatus]
```

### FunctionCallOutputItem
```python
class FunctionCallOutputItem(Model):
    type: Literal["function_call_output"]
    id: Optional[str]
    call_id: str
    output: str
```

---

## Content Parts

### InputTextContentPart
```python
class InputTextContentPart(Model):
    type: Literal["input_text"]
    text: str
```

### InputAudioContentPart
```python
class InputAudioContentPart(Model):
    type: Literal["input_audio"]
    audio: str  # Base64
    transcript: Optional[str]
```

### RequestTextContentPart
```python
class RequestTextContentPart(Model):
    type: Literal["text"]
    text: str
```

### RequestAudioContentPart
```python
class RequestAudioContentPart(Model):
    type: Literal["audio"]
    audio: str  # Base64
    transcript: Optional[str]
```

### RequestImageContentPart
```python
class RequestImageContentPart(Model):
    type: Literal["input_image"]
    url: Optional[str]
    base64: Optional[str]
    detail: Optional[RequestImageContentPartDetail]  # "auto", "low", "high"
```

---

## Tools

### FunctionTool
```python
class FunctionTool(Model):
    type: Literal["function"]
    name: str
    description: Optional[str]
    parameters: Optional[dict]  # JSON Schema
```

### MCPTool
```python
class MCPTool(Model):
    type: Literal["mcp"]
    server_label: str
    require_approval: Optional[MCPApprovalType]  # "never" or "always"
```

### MCPServer
```python
class MCPServer(Model):
    type: Literal["url"]
    url: str
    name: str
    tool_configuration: Optional[dict]
```

### ToolChoiceSelection
```python
class ToolChoiceSelection(Model):
    type: Literal["function"]
    name: str
```

---

## Voice Models

### OpenAIVoice
```python
class OpenAIVoice(Model):
    type: Literal["openai"]
    name: OpenAIVoiceName
```

### AzureStandardVoice
```python
class AzureStandardVoice(Model):
    type: Literal["azure-standard"]
    name: str  # e.g., "en-US-JennyNeural"
```

### AzureCustomVoice
```python
class AzureCustomVoice(Model):
    type: Literal["azure-custom"]
    endpoint_id: str
    name: str
```

### AzurePersonalVoice
```python
class AzurePersonalVoice(Model):
    type: Literal["azure-personal"]
    speaker_profile_id: str
    model: Optional[PersonalVoiceModels]
```

---

## Turn Detection

### ServerVad
```python
class ServerVad(Model):
    type: Literal["server_vad"]
    threshold: Optional[float]  # 0.0-1.0
    prefix_padding_ms: Optional[int]
    silence_duration_ms: Optional[int]
    create_response: Optional[bool]
```

### AzureSemanticVad
```python
class AzureSemanticVad(Model):
    type: Literal["azure_semantic_vad"]
    # Uses semantic understanding for better turn detection
```

### AzureSemanticVadEn
```python
class AzureSemanticVadEn(Model):
    type: Literal["azure_semantic_vad_en"]
    eou_detection: Optional[EouDetection]
```

### EouDetection
```python
class EouDetection(Model):
    threshold_level: Optional[EouThresholdLevel]  # "low", "medium", "high", "default"
```

---

## Response Models

### Response
```python
class Response(Model):
    id: str
    object: Literal["realtime.response"]
    status: ResponseStatus
    status_details: Optional[ResponseStatusDetails]
    output: List[ResponseItem]
    usage: Optional[TokenUsage]
```

### ResponseCreateParams
```python
class ResponseCreateParams(Model):
    modalities: Optional[List[Modality]]
    instructions: Optional[str]
    voice: Optional[Voice]
    output_audio_format: Optional[OutputAudioFormat]
    tools: Optional[List[Tool]]
    tool_choice: Optional[ToolChoice]
    temperature: Optional[float]
    max_response_output_tokens: Optional[Union[int, Literal["inf"]]]
    conversation: Optional[Literal["auto", "none"]]
    input: Optional[List[ConversationRequestItem]]
```

### TokenUsage
```python
class TokenUsage(Model):
    total_tokens: int
    input_tokens: int
    output_tokens: int
    input_token_details: Optional[InputTokenDetails]
    output_token_details: Optional[OutputTokenDetails]
```

---

## Avatar Models

### AvatarConfig
```python
class AvatarConfig(Model):
    type: AvatarConfigTypes  # "video-avatar" or "photo-avatar"
    character: str
    style: Optional[str]
    output_protocol: Optional[AvatarOutputProtocol]  # "webrtc" or "websocket"
    background: Optional[Background]
    video_params: Optional[VideoParams]
```

### IceServer
```python
class IceServer(Model):
    urls: List[str]
    username: Optional[str]
    credential: Optional[str]
```

### Background
```python
class Background(Model):
    color: Optional[str]  # Hex color
    image_url: Optional[str]
```

### VideoParams
```python
class VideoParams(Model):
    resolution: Optional[VideoResolution]
    crop: Optional[VideoCrop]
```