# Boltz-2 Biomolecular Structure Prediction ## Overview Boltz is an open-source family of biomolecular interaction prediction models. **Boltz-2** predicts 3D structures and binding affinities for proteins, DNA, RNA, and small molecules with physics-level accuracy. ### When to Use Boltz - Protein structure prediction (single or multi-chain) - Protein-ligand complex structures - Binding affinity prediction - DNA/RNA structure modeling - Modified residues and covalent bonds ### When NOT to Use - Just need inverse folding → Use ProteinMPNN/LigandMPNN - Need protein-protein docking only → Use GeoDock - Simple single protein → Consider SimpleFold (faster) ## Decision Tree ``` What structure do you need? │ ├─ Single protein sequence? │ └─ submit_boltz_prediction with FASTA │ ├─ Protein + small molecule? │ └─ submit_boltz_prediction with YAML (recommended) │ → Include "affinity" property for binding prediction │ ├─ Protein complex (multi-chain)? │ └─ submit_boltz_prediction with multi-chain FASTA │ └─ Need binding pocket constraints? └─ submit_boltz_prediction with YAML + constraints ``` ## Input Formats ### FASTA Format (Simple Sequences) Best for basic protein prediction: ```fasta >A|protein MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF >B|ligand|smiles N[C@@H](Cc1ccc(O)cc1)C(=O)O ``` **Header Format**: `>CHAIN_ID|ENTITY_TYPE|ADDITIONAL_INFO` **Entity Types**: - `protein` - Amino acid sequences - `dna` - DNA nucleotide sequences - `rna` - RNA nucleotide sequences - `smiles` - Chemical notation for small molecules - `ccd` - Chemical Components Dictionary codes ### YAML Format (Binding Affinity) **Required for protein-ligand affinity prediction**: ```yaml version: 1 sequences: - protein: id: A sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPAD - ligand: id: B smiles: 'N[C@@H](Cc1ccc(O)cc1)C(=O)O' properties: - affinity: binder: B ``` ### YAML Format (Binding Pocket) For constrained pocket prediction: ```yaml sequences: - protein: id: [A1] sequence: MYNMRRLSLSPTFSMGFHLLVTVSLLFSHVDHVIAETEMEGEGNETGECTGSYYCKKGV - ligand: ccd: EKY id: [B1] constraints: - pocket: binder: B1 contacts: [ [ A1, 829 ], [ A1, 138 ] ] ``` ## Parameters ### Core Parameters | Parameter | Type | Range | Default | Description | |-----------|------|-------|---------|-------------| | `model` | string | boltz1, boltz2 | boltz2 | Model version (boltz2 recommended) | | `recycling_steps` | integer | 1-10 | 3 | Refinement iterations | | `sampling_steps` | integer | 50-500 | 200 | Diffusion steps | | `diffusion_samples` | integer | 1-5 | 1 | Number of structure samples | ### Binding Affinity (Boltz-2 only) | Parameter | Type | Range | Description | |-----------|------|-------|-------------| | `diffusion_samples_affinity` | integer | 1-10 | Additional samples for binding affinity | ### Output Options | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `output_format` | string | mmcif | Structure format (mmcif/pdb) | | `write_full_pae` | boolean | true | Save PAE matrices | | `write_full_pde` | boolean | true | Save PDE matrices | | `use_msa_server` | boolean | true | Use MSA server | ## Quality Thresholds ### Confidence Scores | Metric | Excellent | Good | Poor | |--------|-----------|------|------| | confidence_score | > 0.9 | 0.7-0.9 | < 0.7 | | ptm | > 0.8 | 0.5-0.8 | < 0.5 | | iptm (interface) | > 0.78 | 0.5-0.78 | < 0.5 | | complex_plddt | > 0.87 | 0.7-0.87 | < 0.7 | ### Interpreting Scores - **PTM > 0.5**: Generally reliable fold - **iPTM > 0.5**: Reliable interface for complexes - **PDE < 2Å**: Excellent distance error, 2-4Å good - **Affinity**: More negative = stronger binding, confidence > 0.8 is reliable ## Output Files ### Structure Files - **Format**: mmCIF (default) or PDB - **B-factor**: Contains per-residue confidence (pLDDT) ### Confidence JSON ```json { "confidence_score": 0.85, "ptm": 0.82, "iptm": 0.78, "complex_plddt": 0.87, "complex_pde": 2.1 } ``` ### Binding Affinity (Boltz-2) ```json { "affinity": -8.5, "affinity_confidence": 0.92 } ``` ## Performance Settings | Goal | Parameters | Runtime | |------|------------|---------| | Fast | `sampling_steps: 100`, `recycling_steps: 1` | 5-10 min | | Balanced | defaults | 10-20 min | | High accuracy | `sampling_steps: 400`, `recycling_steps: 5` | 20-45 min | | Protein-ligand | `diffusion_samples_affinity: 5`, `sampling_steps: 300` | 15-30 min | ## Rate Limits - **Per minute**: 2 jobs maximum - **Per day**: 10 jobs maximum - **Burst**: 3 jobs in 5 minutes - **File size**: 10MB maximum - **Timeout**: 30 minutes ## API Usage ### Get Tool Info ```bash curl -X POST "https://api.openbio.tech/api/v1/tools" \ -H "X-API-Key: $OPENBIO_API_KEY" \ -F "tool_name=get_boltz_tool_info" \ -F 'params={}' ``` ### Submit Prediction ```bash curl -X POST "https://api.openbio.tech/api/v1/tools" \ -H "X-API-Key: $OPENBIO_API_KEY" \ -F "tool_name=submit_boltz_prediction" \ -F 'params={ "input_file_path": "inputs/protein.yaml", "model": "boltz2", "recycling_steps": 3, "sampling_steps": 200 }' ``` ### Poll and Download ```bash # Check status curl -X GET "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \ -H "X-API-Key: $OPENBIO_API_KEY" # Get results with download URLs curl -X GET "https://api.openbio.tech/api/v1/jobs/{job_id}" \ -H "X-API-Key: $OPENBIO_API_KEY" ``` ## Troubleshooting | Issue | Cause | Fix | |-------|-------|-----| | Timeout | Complex too large | Reduce sampling_steps, recycling_steps | | Low confidence (<0.5) | Unreliable prediction | Check sequence quality, enable MSA | | Memory error | Sequence too long | Keep under ~2000 residues | | Invalid format | YAML syntax error | Validate YAML online | | No affinity | Missing property | Add `properties: - affinity:` to YAML | ## Sample Output ### Successful Job Submission ```json { "success": true, "job_id": "boltz_abc123def456", "message": "Job submitted successfully", "estimated_runtime": "15-20 minutes" } ``` ### Completed Job Response ```json { "success": true, "job": { "job_id": "boltz_abc123def456", "status": "completed", "created_at": "2025-01-30T10:00:00Z", "completed_at": "2025-01-30T10:18:32Z" }, "output_files_signed_urls": { "structure.cif": "https://s3.../structure.cif?...", "confidence.json": "https://s3.../confidence.json?..." } } ``` ### What Good Output Looks Like - **pTM > 0.7**: Confident global structure - **ipTM > 0.5**: Confident interface (> 0.7 for high confidence) - **pLDDT > 0.7**: Confident per-residue predictions - **CIF file**: ~100-500 KB for typical complex ## Typical Performance | Campaign Size | Time | Notes | |---------------|------|-------| | 1 complex | 10-20 min | Single validation | | 10 complexes | 1-2 hours | Small batch | | 50 complexes | 4-8 hours | Standard campaign | | 100 complexes | 8-16 hours | Large campaign | **Per-complex**: ~10-20 min for typical binder-target complex. ## Verify Success ```bash # Check job completed curl -s "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \ -H "X-API-Key: $OPENBIO_API_KEY" | jq '.status' # Should return: "completed" # Verify output files exist curl -s "https://api.openbio.tech/api/v1/jobs/{job_id}" \ -H "X-API-Key: $OPENBIO_API_KEY" | jq '.output_files_signed_urls | keys' # Should list: structure files, confidence.json ``` ## Tool Comparison | Feature | Boltz-2 | Chai-1 | SimpleFold | |---------|---------|--------|------------| | Single protein | ✓ | ✓ | ✓ | | Multi-chain complex | ✓ | ✓ | ✗ | | Small molecules | ✓ | ✓ | ✗ | | RNA/DNA | ✓ | ✓ | ✗ | | Glycans | Limited | ✓ | ✗ | | **Binding affinity** | **✓** | ✗ | ✗ | | MSA-free option | ✓ | ✗ | ✓ | | Speed | Moderate | Moderate | **Fast** | | Best for | Affinity, complexes | Multi-modal | Quick single protein | ## Troubleshooting | Error | Cause | Fix | |-------|-------|-----| | `Timeout` | Complex too large | Reduce sampling_steps, recycling_steps | | `Low confidence (<0.5)` | Unreliable prediction | Check sequence quality, enable MSA | | `CUDA out of memory` | Sequence too long | Keep under ~2000 residues, reduce diffusion_samples | | `Invalid YAML` | Syntax error | Validate YAML online, check quotes on SMILES | | `No affinity output` | Missing property | Add `properties: - affinity:` to YAML | | `KeyError: 'iptm'` | Single chain only | Ensure input has 2+ chains for interface metrics | | `File too large` | Input > 10MB | Compress or split input file | ### Failure Recovery ``` Low confidence across predictions? ├── Check sequence quality │ └── Validate amino acid sequence (standard 20 AAs only) ├── Enable MSA server │ └── use_msa_server: true (provides evolutionary context) ├── Increase sampling │ └── sampling_steps: 300-400, recycling_steps: 5 └── Check if target is difficult └── Some proteins are intrinsically disordered ``` ## Best Practices 1. **Start with defaults** for initial predictions 2. **Use YAML format** for protein-ligand complexes 3. **Increase sampling_steps** (200→400) for critical predictions 4. **Check confidence scores** - aim for > 0.7 5. **Use MSA server** unless you have specific reasons not to 6. **For binding affinity**: Set `diffusion_samples_affinity: 5` --- **Next**: After structure prediction → Use `ThermoMPNN` for stability analysis or `ProteinMPNN/LigandMPNN` for sequence optimization.