# Open Semantic Interchange (OSI) - Field Specification

## 1. Introduction

This document provides a comprehensive field specification for the Open Semantic Interchange (OSI) YAML configuration file. The semantic model defines the structure for domain-specific data queries and analysis across various business contexts.

The YAML file serves as a metadata layer that enables AI-powered query generation and data interpretation for structured datasets.

## 2. Document Structure

This specification is organized hierarchically to mirror the YAML file structure:

- **Top-Level Fields**: Global configuration fields
- **Semantic Model**: Theme-level definitions
- **Datasets**: Individual data source definitions
- **Fields**: Detailed field specifications within each dataset

## 3. Top-Level Fields

### 3.1 `yaml-language-server`

**Data Type**: Comment directive

**Description**: Specifies the JSON schema path for YAML language server validation and IDE auto-completion support.

**Format**: `$schema=<path-to-schema>`

**Constraints**: Must reference a valid schema file path relative to the YAML file location.

---

### 3.2 `version`

**Data Type**: String

**Description**: Defines the semantic model version number using semantic versioning convention.

**Format**: `MAJOR.MINOR.PATCH`

**Constraints**: 
- Must follow semantic versioning format
- Each component must be a non-negative integer

**Default Value**: `0.0.1`

---

### 3.3 `semantic_model`

**Data Type**: Object

**Description**: The semantic model definition. In the current generator implementation, this file contains exactly one semantic model object.

**Required Sub-fields**:
- `name`
- `description`
- `ai_context`
- `datasets`
- `relationships`
- `metrics`
- `terms`
- `rules`

---

## 4. Semantic Model Object

### 4.1 `name`

**Data Type**: String

**Description**: The theme name that identifies the semantic model's thematic classification.

**Constraints**: 
- Must be a non-empty string
- Should be unique within the system

---

### 4.2 `description`

**Data Type**: String

**Description**: A brief description explaining the semantic model's purpose and scope.

**Constraints**: 
- Must be a non-empty string
- Should clearly describe the theme's domain

---

### 4.3 `ai_context`

**Data Type**: Object

**Description**: AI context configuration containing instruction information for AI processing.

#### 4.3.1 `instructions`

**Data Type**: String

**Description**: AI processing instructions that guide the AI on how to utilize this semantic model for data queries and analysis.

**Constraints**: 
- Must be a non-empty string
- Should provide clear guidance on the model's usage

---

### 4.4 `datasets`

**Data Type**: Array of objects

**Description**: List of datasets, where each dataset corresponds to a database table or view.

**Required Sub-fields** (for each dataset):
- `name`
- `source`
- `description`
- `ai_context`
- `fields`

---

### 4.5 `relationships`

**Data Type**: Array of objects

**Description**: Relationship definitions derived from foreign keys (table-to-table joins).

**Default Value**: `[]`

**Required Sub-fields** (for each relationship):
- `name`
- `from_table`
- `to_table`
- `from_columns`
- `to_columns`

---

### 4.6 `metrics`

**Data Type**: Array of objects

**Description**: Reserved for metric definitions (not populated by the current generator, but present in the output schema).

**Default Value**: `[]`

---

### 4.7 `terms`

**Data Type**: Array of objects

**Description**: Reserved for terminology definitions (not populated by the current generator, but present in the output schema).

**Default Value**: `[]`

---

### 4.8 `rules`

**Data Type**: Array of objects

**Description**: Reserved for rule definitions (not populated by the current generator, but present in the output schema).

**Default Value**: `[]`

---

## 5. Dataset Object

### 5.1 `name`

**Data Type**: String

**Description**: Dataset identifier that should correspond to the database table name.

**Constraints**: 
- Must use snake_case naming convention
- Must be unique within the semantic model
- Should match the actual database table name

---

### 5.2 `source`

**Data Type**: String

**Description**: Complete data source path specifying the database and table location.

**Format**: `database_name.table_name`

**Constraints**: 
- Must follow the format `<database>.<table>`
- Both database and table names must be valid identifiers

---

### 5.3 `description`

**Data Type**: String

**Description**: Dataset description explaining the dataset's purpose or content.

**Constraints**: 
- Must be a non-empty string
- Should briefly describe what the dataset contains

---

### 5.4 `ai_context`

**Data Type**: Object

**Description**: AI context configuration for the dataset.

#### 5.4.1 `ai_name`

**Data Type**: String

**Description**: AI-recognized name for the dataset, used for natural language query processing.

**Constraints**: 
- Must be a non-empty string
- Should be consistent with the dataset name or description

---

#### 5.4.2 `weight`

**Data Type**: Integer

**Description**: Dataset weight for ranking/recall bias during matching. Higher value means the dataset is more likely to be selected.

**Default Value**: `3`

---

#### 5.4.3 `synonyms`

**Data Type**: Array of strings

**Description**: Alternative dataset names used for matching.

**Default Value**: `[]`

---

#### 5.4.4 `default_datetime_field`

**Data Type**: String

**Description**: The default time field name for the dataset, selected by the generator by scoring date/time columns. This replaces the need for a field-level `is_default` flag in the current YAML output.

**Default Value**: `` (empty string)

**Constraints**:
- If provided, it should match one of the dataset `fields[*].name`

---

### 5.5 `fields`

**Data Type**: Array of objects

**Description**: List of field definitions for the dataset. Each field represents a column in the database table.

---

## 6. Field Object

### 6.1 `name`

**Data Type**: String

**Description**: Field identifier that must match the database column name.

**Constraints**: 
- Must use snake_case naming convention
- Must exactly match the database column name
- Must be unique within the dataset

---

### 6.2 `type`

**Data Type**: String

**Description**: Field data type that determines query methods and presentation format.

**Allowed Values**:

| Value | Description | Usage |
|-------|-------------|-------|
| `TEXT` | Text/string data | Names, codes, descriptions |
| `NUMBER` | Numeric data | Counts, measurements, amounts |
| `DATE` | Date data | Date-only values |
| `TIME` | Time/date-time data | Timestamps, datetime |
| `ID` | Identifier | UUID/GUID-like identifiers |
| `OTHER` | Other types | Fallback when type is not recognized |

**Constraints**: 
- Must be one of the allowed values
- Must match the actual database column data type

---

### 6.3 `description`

**Data Type**: String

**Description**: Human-readable field description used for UI display and documentation.

**Constraints**: 
- Must be a non-empty string
- Should clearly describe the field's meaning
- Typically written in the local language for the target domain

---

### 6.4 `ai_context`

**Data Type**: Object

**Description**: Field-level AI context configuration containing metadata for AI processing.

**Common Sub-fields**:
- `ai_name`
- `property`
- `ai_type` (optional)
- `value_list` (optional, for enum)

#### 6.4.1 `ai_name`

**Data Type**: String

**Description**: AI-recognized field name, typically identical to the `description` field.

**Constraints**: 
- Must be a non-empty string
- Should match or be similar to the `description` value

---

#### 6.4.2 `property`

**Data Type**: String

**Description**: Field property type that indicates special field attributes.

**Allowed Values**:

| Value | Description | Usage Scenario |
|-------|-------------|----------------|
| `normal` | Normal field | Standard data fields |
| `dimension` | Dimension field | Used for grouping/filtering, e.g. date/time dimensions or enum dimensions |
| `detail_only` | Detail query only | Fields displayed only in detail queries, not used in aggregations |

**Default Value**: `normal`

**Constraints**: 
- Must be one of the allowed values

---

#### 6.4.3 `ai_type`

**Data Type**: String

**Description**: AI field type indicator for special processing requirements.

**Allowed Values**:

| Value | Description | Usage Scenario |
|-------|-------------|----------------|
| `date` | Date type | Reserved for date-specific processing (not always emitted by the current generator) |
| `enum` | Enumeration type | Fields with fixed allowable values (emitted when enum is detected) |

**Default Value**: `` (empty string)

**Constraints**: 
- Must be one of the allowed values
- If `enum`, then `value_list` must be provided

---

#### 6.4.4 `value_list`

**Data Type**: Array of scalars (strings and/or numbers)

**Description**: List of allowable values for enumeration-type fields.

**Usage**: 
- Required when `ai_type` is `enum`
- Should be omitted or empty for non-enumeration fields

**Constraints**: 
- Must be an array (can be empty)
- For enumeration fields, must contain all valid values (as extracted from samples / metadata)

---

## 7. Relationship Object

Relationship entries live under `semantic_model.relationships`.

### 7.1 `name`

**Data Type**: String

**Description**: Human-readable relationship name. The generator uses the form `<from_table> to <to_table>`.

---

### 7.2 `from_table`

**Data Type**: String

**Description**: Source table for the relationship.

---

### 7.3 `to_table`

**Data Type**: String

**Description**: Target table for the relationship.

---

### 7.4 `from_columns`

**Data Type**: Array of strings

**Description**: Column list on `from_table` participating in the join.

---

### 7.5 `to_columns`

**Data Type**: Array of strings

**Description**: Column list on `to_table` participating in the join. It should align positionally with `from_columns`.

---

## 8. Numeric Format (`num_format`) Object

`num_format` is an optional object on a field, emitted when the generator detects unit/level/decimal hints.

### 8.1 `unit`

**Data Type**: String

**Description**: Display unit hint (e.g. 元, 万, %, etc.).

---

### 8.2 `num_level`

**Data Type**: String

**Description**: Numeric magnitude/level hint (e.g. 千/万/亿) used for formatting.

---

### 8.3 `num_decimal`

**Data Type**: String or Integer

**Description**: Decimal precision hint for formatting.