# ProteinMPNN Sequence Design

## Overview

ProteinMPNN (Protein Message Passing Neural Network) is a deep learning framework for protein sequence design. It takes protein backbone structures as input and generates amino acid sequences predicted to be compatible with those structures.

### When to Use ProteinMPNN
- Design new sequences for existing backbone structures
- Redesign protein surface residues
- Stabilize proteins via conservative design
- Design multi-chain complex interfaces
- Generate diverse sequence variants

### When NOT to Use
- Ligand in binding site → Use LigandMPNN
- Need thermostability focus → Use ThermoMPNN
- Need de novo backbone → Use BoltzGen or Pinal
- Have only sequence → Need structure first (Boltz/Chai)

## Decision Tree

```
What do you need to design?
│
├─ Protein with ligand bound?
│   └─ NO: submit_proteinmpnn_prediction ✓
│   └─ YES: Use LigandMPNN instead
│
├─ Priority is soluble expression?
│   └─ submit_proteinmpnn_prediction with use_soluble_model: true
│
├─ Need fast processing?
│   └─ submit_proteinmpnn_prediction with ca_only: true
│
├─ Multi-chain complex?
│   └─ submit_proteinmpnn_prediction with pdb_path_chains: "A B"
│
└─ Need reproducible results?
    └─ submit_proteinmpnn_prediction with seed: 42
```

## Parameters

### Core Parameters

| Parameter | Type | Range | Default | Description |
|-----------|------|-------|---------|-------------|
| `num_seq_per_target` | int | 1-50 | 1 | Sequences to generate |
| `sampling_temp` | string | "0.1"-"1.0" | "0.1" | Temperature (string!) |
| `use_soluble_model` | bool | - | false | Enhanced solubility |
| `ca_only` | bool | - | false | CA-only for speed |
| `seed` | int | 0-2147483647 | None | Random seed |
| `pdb_path_chains` | string | - | None | Chains to design (e.g., "A B") |
| `backbone_noise` | float | 0.0-0.5 | None | Noise for diversity |

### Temperature Guide

| Temperature | Diversity | Quality | Use For |
|-------------|-----------|---------|---------|
| 0.1 | Low | High | Production, conservative |
| 0.2 | Moderate | Good | Default exploration |
| 0.3 | Higher | Moderate | Initial screening |
| 0.5+ | Very high | Lower | Maximum diversity |

**IMPORTANT**: Temperature must be passed as a **string**, not float!

## Quality Metrics

### Output Header Format
```
>protein_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
```

### Interpreting Scores

| Metric | Good | Acceptable | Investigate |
|--------|------|------------|-------------|
| score | < 1.5 | 1.5-2.5 | > 2.5 |
| seq_recovery (de novo) | 0.3-0.5 | 0.5-0.6 | > 0.7 (too conservative) |
| seq_recovery (redesign) | 0.6-0.8 | 0.5-0.6 | < 0.4 |

- **Lower scores** = better sequence-structure compatibility
- **Higher seq_recovery** = more conservative design

## Model Selection

| Model | Use Case | Speed |
|-------|----------|-------|
| Standard (default) | General protein design | Normal |
| Soluble (`use_soluble_model: true`) | E. coli expression | Normal |
| CA-only (`ca_only: true`) | Large structures, fast | Fast |

## Common Mistakes

### Wrong: Float temperature
```
❌ sampling_temp: 0.1  # May cause errors
```
```
✅ sampling_temp: "0.1"  # String with quotes
```

### Wrong: Space in chain specification
```
❌ pdb_path_chains: "A, B"  # Space after comma
```
```
✅ pdb_path_chains: "A B"  # Space-separated, no commas
```

### Wrong: Not fixing critical residues
```
❌ Redesigning catalytic residues in enzyme
```
```
✅ Use fixed_positions in LigandMPNN or redesign specific chains only
```

## API Usage

### Basic Design
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_proteinmpnn_prediction" \
  -F 'params={
    "input_file_path": "structures/my_protein.pdb",
    "num_seq_per_target": 8,
    "sampling_temp": "0.1",
    "seed": 42
  }'
```

### Soluble Design
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_proteinmpnn_prediction" \
  -F 'params={
    "input_file_path": "structures/my_protein.pdb",
    "use_soluble_model": true,
    "num_seq_per_target": 5,
    "sampling_temp": "0.2"
  }'
```

### Multi-Chain Design
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_proteinmpnn_prediction" \
  -F 'params={
    "input_file_path": "structures/complex.pdb",
    "pdb_path_chains": "A B",
    "num_seq_per_target": 4,
    "sampling_temp": "0.15"
  }'
```

### High Diversity
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_proteinmpnn_prediction" \
  -F 'params={
    "input_file_path": "structures/scaffold.pdb",
    "sampling_temp": "0.5",
    "backbone_noise": 0.1,
    "num_seq_per_target": 10
  }'
```

## Sample Output

### Successful Job Response
```json
{
  "success": true,
  "job_id": "proteinmpnn_xyz789",
  "message": "Job submitted successfully",
  "estimated_runtime": "2-5 minutes"
}
```

### Output FASTA Header
```
>protein_0001, score=1.234, global_score=1.189, seq_recovery=0.82
MKTAYIAKQRQISFVKSHFSRQLEERGLTKE...
>protein_0002, score=1.198, global_score=1.156, seq_recovery=0.79
MKTAYIAKQRQISFVKSQFSRQLDERGLTKE...
```

### What Good Output Looks Like
- **Score**: 1.0-2.0 (lower = more confident)
- **Seq recovery**: 0.3-0.6 for de novo, 0.7-0.9 for redesign
- **Diverse sequences** (not all identical) when temp > 0.1

## Expected Runtime

| Protein Size | Sequences | Time |
|--------------|-----------|------|
| Small (<100 aa) | 1-3 | 1-2 min |
| Medium (100-500 aa) | 3-10 | 2-5 min |
| Large (>500 aa) | 5-10 | 5-15 min |

## Typical Performance

| Campaign Size | Time | Notes |
|---------------|------|-------|
| 10 backbones × 8 seq | 5-10 min | Quick test |
| 100 backbones × 8 seq | 30-60 min | Standard |
| 500 backbones × 16 seq | 2-4 hours | Large campaign |

**Throughput**: ~50-100 sequences/minute for typical proteins.

## Verify Success

```bash
# Check job status
curl -s "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \
  -H "X-API-Key: $OPENBIO_API_KEY" | jq '.status'

# Count generated sequences (should match num_seq_per_target)
# Download FASTA and count headers
grep -c "^>" output.fa
```

## Tool Comparison

| Variant | Use Case | Key Difference |
|---------|----------|----------------|
| ProteinMPNN | General protein design | Standard model |
| SolubleMPNN (`use_soluble_model: true`) | Bacterial expression | Trained on soluble proteins |
| LigandMPNN | Small molecules/metals | Ligand-aware context |
| CA-only (`ca_only: true`) | Large structures | Faster, backbone only |

## Troubleshooting

| Error | Cause | Fix |
|-------|-------|-----|
| `Invalid PDB format` | Bad format | Check ATOM records, proper formatting |
| `Chain not found` | Wrong chain ID | Verify chain exists in PDB (`grep "^ATOM" file.pdb \| cut -c22 \| sort -u`) |
| `Temperature error` | Float instead of string | Use `"0.1"` not `0.1` |
| `File too large` | >50MB | Reduce or use CA-only |
| `All identical sequences` | Temperature too low | Increase to 0.2-0.3 |
| `Low quality sequences` | Temperature too high | Decrease to 0.1-0.15 |
| `IndexError: list index` | Empty chain or residue list | Check PDB has atoms, not just HEADER |

### Failure Recovery

```
Low sequence diversity?
├── Increase temperature
│   └── sampling_temp: "0.2" or "0.3"
├── Add backbone noise
│   └── backbone_noise: 0.1
└── Generate more sequences
    └── num_seq_per_target: 16-32

High scores (> 2.5)?
├── Backbone may be strained
│   └── Check input structure geometry
├── Try lower temperature
│   └── sampling_temp: "0.1"
└── Use different model
    └── use_soluble_model: true

Sequences don't fold correctly (high scRMSD)?
├── Lower temperature for more conservative design
│   └── sampling_temp: "0.1"
├── Increase sequences per target
│   └── num_seq_per_target: 32
└── Check backbone quality
    └── Regenerate backbone with different parameters
```

## Best Practices

1. **Start conservative**: Use temp 0.1-0.2 initially
2. **Use seeds**: Set seed for reproducible results
3. **Keep file size small**: <50MB for optimal performance
4. **Choose right model**:
   - Standard for general use
   - Soluble for bacterial expression
   - CA-only for large structures
5. **Validate with structure prediction**: Run Boltz/Chai on designed sequences

---

**Next**: Validate designed sequences with `Boltz` or `Chai` → Use `ThermoMPNN` to check stability.