# Chai-1 Multi-Modal Molecular Structure Prediction

## Overview

Chai-1 is a multi-modal foundation model for molecular structure prediction. It performs at state-of-the-art levels across various benchmarks and enables unified prediction of proteins, small molecules, DNA, RNA, glycosylations using diffusion-based modeling.

### When to Use Chai
- Multi-modal structure prediction (proteins + ligands + nucleic acids + glycans)
- Protein-ligand complex structures
- Protein-RNA/DNA complexes
- Glycoprotein structures
- Modified residue handling

### When NOT to Use
- Binding affinity needed → Use Boltz-2 (supports affinity)
- Simple protein only → Consider SimpleFold (faster)
- Just need sequence design → Use ProteinMPNN

## Decision Tree

```
What do you need to predict?
│
├─ Protein + small molecule?
│   └─ submit_chai_prediction
│       → Format: >ligand|name=X followed by SMILES
│
├─ Protein + RNA/DNA?
│   └─ submit_chai_prediction
│       → Format: >rna|name=X or >dna|name=X
│
├─ Protein + glycan?
│   └─ submit_chai_prediction
│       → Format: >glycan|name=X
│
└─ Protein with modified residues?
    └─ submit_chai_prediction
        → Format: Sequence with (MOD) notation
```

## Input Format

### FASTA with Entity Type Headers

```
>entity_type|name=entity_name
SEQUENCE
```

### Supported Entity Types

**1. Protein Sequences**
```
>protein|name=example-protein
AGSHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQK
```

**2. Ligands (SMILES)**
```
>ligand|name=aspirin
CC(=O)OC1=CC=CC=C1C(=O)O
```

**3. RNA Sequences**
```
>rna|name=example-rna
AUGGCCAUUGUAAUGGGCCGC
```

**4. DNA Sequences**
```
>dna|name=example-dna
ATGGCCATTGTAATGGGCCGC
```

**5. Glycans**
```
>glycan|name=example-glycan
NAG(4-1 NAG)
```

**6. Modified Residues**
```
>protein|name=modified-protein
RKDES(SEP)EES
```

### Multi-Entity Complex Example
```
>protein|name=receptor
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
>ligand|name=small-molecule
CCCCCCCCCCCCCC(=O)O
>glycan|name=sugar
NAG(4-1 NAG)
```

## Parameters

| Parameter | Type | Default | Range | Description |
|-----------|------|---------|-------|-------------|
| `num_trunk_recycles` | int | 3 | 1-10 | Trunk model recycles |
| `num_diffn_timesteps` | int | 200 | 50-500 | Diffusion steps |
| `num_diffn_samples` | int | 5 | 1-10 | Structures to generate |
| `use_esm_embeddings` | bool | True | - | ESM embeddings for accuracy |
| `seed` | int | None | 0-999999 | Random seed |

## Quality Thresholds

### Confidence Metrics

| Metric | Range | Good | Interpret |
|--------|-------|------|-----------|
| pLDDT | 0-100 | > 70 | Per-residue confidence |
| PAE | Variable | Lower better | Aligned error |
| PDE | Variable | Lower better | Distance error |

### Interpreting Results

- **pLDDT > 70**: Generally good structure
- **pLDDT 50-70**: Low confidence, use with caution
- **pLDDT < 50**: Likely disordered or unreliable

## Output Files

### Directory Structure
```
output_folder/
├── pred_0.cif          # Best predicted structure
├── pred_1.cif          # Second best structure
├── ...
├── scores.model_idx_0.npz   # Confidence metrics
├── scores.model_idx_1.npz
├── ranking_data.json        # Model rankings
└── msa_coverage.pdf         # If MSAs used
```

### Using ranking_data.json

Rank structures by quality:
1. Load `ranking_data.json`
2. Sort by aggregate score
3. Use top-ranked structure for analysis

## Performance Settings

| Goal | Parameters | Notes |
|------|------------|-------|
| Fast | recycles: 3, timesteps: 200, samples: 1-3, ESM: false | Quick screening |
| Balanced | defaults | Recommended |
| High accuracy | recycles: 5-7, timesteps: 300-400, samples: 5-10, ESM: true | Critical predictions |

## Rate Limits

- **Per minute**: 2 jobs
- **Per day**: 10 jobs
- **File size**: 50MB
- **Runtime**: 5-30 minutes depending on complexity

## API Usage

### Submit Prediction
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_chai_prediction" \
  -F 'params={
    "input_file_path": "inputs/complex.fasta",
    "num_trunk_recycles": 5,
    "num_diffn_timesteps": 300,
    "num_diffn_samples": 5,
    "use_esm_embeddings": true,
    "seed": 42
  }'
```

### Poll Job Status
```bash
curl -X GET "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \
  -H "X-API-Key: $OPENBIO_API_KEY"
```

### Get Results
```bash
curl -X GET "https://api.openbio.tech/api/v1/jobs/{job_id}" \
  -H "X-API-Key: $OPENBIO_API_KEY"
```

## Troubleshooting

| Issue | Cause | Fix |
|-------|-------|-----|
| Invalid header | Wrong entity type format | Use `>entity_type\|name=X` |
| Too many tokens | Sequence too long | Reduce sequence length |
| Invalid entity type | Typo in header | Check: protein, ligand, rna, dna, glycan |
| Processing timeout | Large complex | Reduce parameters |
| Low confidence | Poor input | Increase recycles/timesteps |

## Best Practices

1. **Input Preparation**
   - Use clear entity type headers
   - Validate SMILES strings for ligands
   - Keep sequences to reasonable lengths

2. **Parameter Selection**
   - Start with defaults for initial testing
   - Increase parameters for production runs
   - Use reproducible seeds for consistent results

3. **Result Analysis**
   - Check confidence scores (pLDDT > 70)
   - Compare multiple samples
   - Use `ranking_data.json` to identify best structures

## Sample Output

### Successful Job Response
```json
{
  "success": true,
  "job_id": "chai_xyz123abc456",
  "message": "Job submitted successfully",
  "estimated_runtime": "10-20 minutes"
}
```

### Output Directory
```
predictions/
├── pred_0.cif          # Best predicted structure
├── pred_1.cif          # Second best structure
├── scores.model_idx_0.npz
├── ranking_data.json   # Model rankings
└── msa_coverage.pdf    # If MSAs used
```

### What Good Output Looks Like
- **pTM > 0.7**: Confident global structure
- **ipTM > 0.5**: Confident interface (> 0.7 for high confidence)
- **pLDDT > 70**: Per-residue confidence
- **CIF files**: With reasonable atom positions

## Typical Performance

| Campaign Size | Time | Notes |
|---------------|------|-------|
| 1 complex | 10-20 min | Single validation |
| 10 complexes | 1-2 hours | Small batch |
| 100 complexes | 8-16 hours | Standard campaign |

**Per-complex**: ~10-20 min for typical binder-target complex.

## Verify Success

```bash
# Check job completed
curl -s "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \
  -H "X-API-Key: $OPENBIO_API_KEY" | jq '.status'

# After downloading, verify CIF files exist
ls *.cif | wc -l  # Should match num_diffn_samples
```

### Failure Recovery

```
Low pLDDT across predictions?
├── Increase recycles
│   └── num_trunk_recycles: 5-7
├── Check sequence quality
│   └── Validate amino acids, SMILES strings
└── Try with more samples
    └── num_diffn_samples: 8-10

Low ipTM (interface quality)?
├── Check chain order in FASTA
│   └── Ensure multiple chains present
├── Interface region may be disordered
│   └── Check if binding region is well-defined
└── Try Boltz-2 for comparison
    └── Different models may capture different features
```

## Chai vs Boltz Comparison

| Feature | Chai-1 | Boltz-2 |
|---------|--------|---------|
| Binding affinity | No | **Yes** |
| Glycans | **Yes** | Limited |
| Multi-modal | **Strong** | Strong |
| Speed | Moderate | Moderate |
| MSA-free option | No | **Yes** |
| Best for | Multi-modal complexes | Affinity prediction |

---

**Next**: After validation → Use `ProteinMPNN/LigandMPNN` for sequence optimization or `ThermoMPNN` for stability.