# Structure Prediction & Protein Design Tools

This is a guide for choosing between OpenBio's ML-based prediction and design tools. See individual tool files for detailed documentation.

## Quick Reference

| Tool | Input | Output | Best For |
|------|-------|--------|----------|
| [Boltz](boltz.md) | Sequences/YAML | Structure + affinity | General prediction, binding affinity |
| [Chai](chai.md) | FASTA w/ entity types | Structure | Multi-modal (protein+ligand+RNA+glycan) |
| [SimpleFold](simplefold.md) | Sequence | Structure | Quick single-protein prediction |
| [ProteinMPNN](proteinmpnn.md) | Backbone PDB | Sequences | Fixed-backbone design |
| [LigandMPNN](ligandmpnn.md) | Complex PDB | Sequences | Ligand-aware design |
| [ThermoMPNN](thermompnn.md) | Structure PDB | ΔΔG values | Stability prediction |
| [GeoDock](geodock.md) | Two PDBs | Docked complex | Protein-protein docking |
| [Pinal](pinal.md) | Text description | Sequences | De novo from language |
| [BoltzGen](boltzgen.md) | YAML spec | Full pipeline | Comprehensive binder design |

## Master Decision Tree

```
What do you need?
│
├─ PREDICT STRUCTURE FROM SEQUENCE?
│   │
│   ├─ Single protein only?
│   │   ├─ Need speed → SimpleFold
│   │   └─ Need best quality → Boltz
│   │
│   ├─ Protein complex (multi-chain)?
│   │   └─ Boltz or Chai
│   │
│   ├─ Protein + small molecule?
│   │   ├─ Need binding affinity → Boltz-2 (YAML format)
│   │   └─ Just structure → Chai or Boltz
│   │
│   ├─ Protein + RNA/DNA/glycan?
│   │   └─ Chai (best multi-modal)
│   │
│   └─ Protein-protein docking?
│       ├─ Quick result → GeoDock
│       └─ Better accuracy → Boltz with multi-chain
│
├─ DESIGN SEQUENCES FOR BACKBONE?
│   │
│   ├─ No ligand present?
│   │   └─ ProteinMPNN
│   │
│   ├─ Ligand/cofactor in binding site?
│   │   └─ LigandMPNN
│   │
│   └─ Need thermostability focus?
│       └─ ThermoMPNN (for analysis) + ProteinMPNN (for design)
│
├─ PREDICT MUTATION EFFECTS?
│   └─ ThermoMPNN
│
├─ DE NOVO PROTEIN DESIGN?
│   │
│   ├─ From text description?
│   │   └─ Pinal
│   │
│   └─ Need full design pipeline with filtering?
│       └─ BoltzGen
│
└─ DESIGN PROTEIN BINDERS?
    │
    ├─ Protein binder to protein target?
    │   └─ BoltzGen (protein-anything protocol)
    │
    ├─ Peptide binder (including cyclic)?
    │   └─ BoltzGen (peptide-anything protocol)
    │
    ├─ Nanobody design?
    │   └─ BoltzGen (nanobody-anything protocol)
    │
    └─ Protein binding small molecule?
        └─ BoltzGen (protein-small_molecule protocol)
```

## Comparison Tables

### Structure Prediction

| Feature | Boltz-2 | Chai-1 | SimpleFold |
|---------|---------|--------|------------|
| Single protein | ✓ | ✓ | ✓ |
| Multi-chain complex | ✓ | ✓ | ✗ |
| Small molecules | ✓ | ✓ | ✗ |
| RNA/DNA | ✓ | ✓ | ✗ |
| Glycans | Limited | ✓ | ✗ |
| Binding affinity | ✓ | ✗ | ✗ |
| MSA-free option | ✓ | ✗ | ✓ |
| Speed | Moderate | Moderate | Fast |

### Sequence Design

| Feature | ProteinMPNN | LigandMPNN | ThermoMPNN |
|---------|-------------|------------|------------|
| Fixed backbone | ✓ | ✓ | ✗ (analysis) |
| Ligand awareness | ✗ | ✓ | ✗ |
| Side chain packing | ✗ | ✓ | ✗ |
| Scoring mode | ✗ | ✓ | ✗ |
| Stability prediction | ✗ | ✗ | ✓ |
| Soluble model | ✓ | ✗ | ✗ |

### De Novo Design

| Feature | Pinal | BoltzGen |
|---------|-------|----------|
| Text input | ✓ | ✗ |
| Backbone design | ✗ | ✓ |
| Inverse folding | ✗ | ✓ |
| Structure validation | ✗ | ✓ |
| Filtering/ranking | ✗ | ✓ |
| Complexity | Low | High |

## Common Workflows

### Workflow 1: Validate Designed Binder

```
1. Design with BoltzGen
   → Get sequences from final_designs/

2. Predict complex structure
   → submit_boltz_prediction with binder + target

3. Check confidence
   → Keep ipTM > 0.6, pLDDT > 0.7

4. Analyze interface
   → Use structure tools for contacts
```

### Workflow 2: Engineer Enzyme

```
1. Analyze stability
   → submit_thermompnn_prediction
   → Identify stabilizing mutations

2. Design with ligand awareness
   → submit_ligandmpnn_prediction
   → Fix catalytic residues
   → Keep substrate in context

3. Validate design
   → submit_boltz_prediction
   → Check fold maintained (pTM > 0.8)
```

### Workflow 3: Quick Screening

```
1. Predict structures rapidly
   → submit_simplefold_prediction for each sequence

2. Filter by confidence
   → Keep pLDDT > 0.7

3. Detailed analysis for top candidates
   → submit_boltz_prediction for best ones
```

## Quality Thresholds Summary

### Structure Prediction

| Metric | Excellent | Good | Poor |
|--------|-----------|------|------|
| pLDDT | > 90 | 70-90 | < 70 |
| pTM | > 0.8 | 0.5-0.8 | < 0.5 |
| ipTM (interface) | > 0.7 | 0.5-0.7 | < 0.5 |

### Sequence Design

| Metric | Good | Investigate |
|--------|------|-------------|
| Score (ProteinMPNN) | < 1.5 | > 2.5 |
| Temperature | 0.1-0.2 (conservative) | > 0.3 (diverse) |

### Stability

| ΔΔG | Effect |
|-----|--------|
| < -1.0 | Stabilizing |
| -1.0 to +1.0 | Neutral |
| > +1.0 | Destabilizing |

## Quality Control Guidelines

### Critical Limitation

**Individual metrics have weak predictive power for binding**. Research shows:
- Individual metric ROC AUC: 0.64-0.66 (slightly better than random)
- Metrics are **pre-screening filters**, not affinity predictors
- **Composite scoring is essential** for meaningful ranking

### Sequential Filtering Pipeline

```python
# Stage 1: Structural confidence
designs = designs[designs['pLDDT'] > 0.85]

# Stage 2: Self-consistency (scRMSD)
designs = designs[designs['scRMSD'] < 2.0]

# Stage 3: Binding quality
designs = designs[(designs['ipTM'] > 0.5) & (designs['PAE_interaction'] < 10)]

# Stage 4: Expression checks
designs = designs[designs['cysteine_count'] % 2 == 0]  # Even cysteines
```

### Campaign Health Assessment

| Pass Rate | Status | Action |
|-----------|--------|--------|
| > 15% | Excellent | Proceed to experimental testing |
| 10-15% | Good | Normal, proceed |
| 5-10% | Marginal | Review parameters, increase designs |
| < 5% | Poor | Diagnose issues before scaling |

### Failure Recovery Trees

**Low pLDDT across predictions?**
```
├── Check scRMSD distribution
│   ├── High scRMSD (>2.5Å) → Backbone issue, regenerate
│   └── Low scRMSD but low pLDDT → Disordered regions
├── Increase sequence diversity
│   └── num_seq_per_target: 16-32, temp: 0.2
└── Try different design approach
    └── Use SolubleMPNN or different tool
```

**Low ipTM (interface quality)?**
```
├── Review hotspot selection
│   └── Are hotspots surface-exposed?
├── Increase binder length
│   └── More contact area helps
└── Check interface geometry
    └── Flat vs concave targets need different approaches
```

## Rate Limits (All Tools)

- **Per minute**: 2 jobs
- **Per day**: 10 jobs
- **Timeout**: 30 min (most), 4 hours (BoltzGen)

## Job Management

All prediction tools return `job_id`. Poll and download:

```bash
# Check status
curl -X GET "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \
  -H "X-API-Key: $OPENBIO_API_KEY"

# Get results with download URLs
curl -X GET "https://api.openbio.tech/api/v1/jobs/{job_id}" \
  -H "X-API-Key: $OPENBIO_API_KEY"
```

---

**See individual tool files for detailed parameters, examples, and troubleshooting.**