# SimpleFold Protein Structure Prediction

## Overview

SimpleFold is a protein structure prediction model developed by Apple that uses flow-matching generative modeling. It employs general-purpose transformer layers and achieves competitive performance on standard folding benchmarks.

### When to Use SimpleFold
- Quick single-protein structure prediction
- Sequence-only input (no MSA needed)
- Initial screening before detailed analysis
- Proteins where Boltz/Chai might be overkill

### When NOT to Use
- Protein complexes → Use Boltz or Chai
- Protein-ligand complexes → Use Boltz or Chai
- Need binding affinity → Use Boltz-2
- Need highest accuracy → Use Boltz/Chai

## Parameters

### Required

| Parameter | Type | Description |
|-----------|------|-------------|
| `sequence` | string | Amino acid sequence |

### Optional

| Parameter | Type | Range | Default | Description |
|-----------|------|-------|---------|-------------|
| `output_name` | string | - | "predicted_protein" | Output filename |
| `output_dir` | string | - | auto | Output directory |
| `num_steps` | int | 50-1000 | 500 | Inference steps |
| `tau` | float | 0.01-1.0 | 0.05 | Stochasticity scale |

## Performance Settings

| Goal | num_steps | tau | Runtime |
|------|-----------|-----|---------|
| Fast | 200 | 0.1 | 3-5 min |
| Balanced | 500 | 0.05 | 5-15 min |
| High accuracy | 750 | 0.03 | 15-30 min |

## Quality Thresholds

### pLDDT Scores

| pLDDT | Confidence | Interpretation |
|-------|------------|----------------|
| > 90 | Very high | Excellent prediction |
| 70-90 | High | Reliable structure |
| 50-70 | Low | Use with caution |
| < 50 | Very low | Likely disordered |

### Interpreting Results

- **Well-folded regions**: High pLDDT (>70)
- **Flexible loops**: Lower pLDDT (50-70)
- **Disordered regions**: Very low pLDDT (<50)
- **Domain boundaries**: Often show confidence transitions

## Output

### Files Generated
- **mmCIF structure**: Atomic coordinates
- **B-factor column**: Contains pLDDT scores per residue

### Using Results
1. Load in PyMOL/ChimeraX
2. Color by B-factor to visualize confidence
3. High B-factor = high confidence regions

## API Usage

### Basic Prediction
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_simplefold_prediction" \
  -F 'params={
    "sequence": "MKLLHVPLRRGTRYKLLKKKLSLPNPSLRTLGCISVIIVMSLGDPTNAGMHT"
  }'
```

### High Accuracy
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_simplefold_prediction" \
  -F 'params={
    "sequence": "MKLLHVPLRRGTRYKLLKKKLSLPNPSLRTLGCISVIIVMSLGDPTNAGMHT",
    "output_name": "my_protein",
    "num_steps": 750,
    "tau": 0.03
  }'
```

### Fast Prediction
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=submit_simplefold_prediction" \
  -F 'params={
    "sequence": "MKLLHVPLRRGTRYKLLKKKLSLPNPSLRTLGCISVIIVMSLGDPTNAGMHT",
    "num_steps": 200,
    "tau": 0.1
  }'
```

### Get Tool Info
```bash
curl -X POST "https://api.openbio.tech/api/v1/tools" \
  -H "X-API-Key: $OPENBIO_API_KEY" \
  -F "tool_name=get_simplefold_tool_info" \
  -F 'params={}'
```

## Expected Runtime

| Sequence Length | Default | Fast | High Accuracy |
|-----------------|---------|------|---------------|
| Short (<100 aa) | 5 min | 3 min | 10 min |
| Medium (100-300 aa) | 10 min | 5 min | 20 min |
| Long (>300 aa) | 15 min | 8 min | 30 min |

## Rate Limits

- **Per minute**: 2 jobs
- **Per day**: 10 jobs
- **Max sequence**: ~2000 residues
- **Timeout**: 30 minutes

## Common Mistakes

### Wrong: Non-standard amino acids
```
❌ sequence: "MKLLHXPLRR"  # X is non-standard
```
```
✅ sequence: "MKLLHAPLRR"  # Standard 20 AAs only
```

### Wrong: FASTA format
```
❌ sequence: ">protein\nMKLLH..."  # FASTA header
```
```
✅ sequence: "MKLLH..."  # Plain sequence only
```

### Wrong: Too long for resources
```
❌ sequence: (3000+ amino acids)
```
```
✅ Keep under ~2000 residues
   Split into domains if needed
```

## Troubleshooting

| Issue | Cause | Fix |
|-------|-------|-----|
| Timeout | Long sequence | Reduce num_steps |
| Low confidence | Difficult target | Increase num_steps |
| Memory error | Sequence too long | Keep under 1500 aa |
| Invalid sequence | Non-standard AA | Use only standard 20 |

## Best Practices

1. **Validate sequence**: Standard 20 AAs only
2. **Start with defaults**: 500 steps, 0.05 tau
3. **Check confidence**: Focus on pLDDT > 70 regions
4. **Consider alternatives**: For complexes, use Boltz/Chai
5. **Compare methods**: Run SimpleFold + Boltz for important proteins

## Sample Output

### Job Response
```json
{
  "success": true,
  "job_id": "simplefold_xyz789",
  "message": "Job submitted successfully",
  "estimated_runtime": "5-10 minutes"
}
```

### What Good Output Looks Like
- **pLDDT > 70**: Reliable per-residue confidence
- **B-factor column**: Contains pLDDT scores
- **mmCIF file**: ~50-200 KB for typical protein

## Typical Performance

| Sequence Length | Time |
|-----------------|------|
| Short (<100 aa) | 3-5 min |
| Medium (100-300 aa) | 5-10 min |
| Long (>300 aa) | 10-15 min |

## Verify Success

```bash
# Check job completed
curl -s "https://api.openbio.tech/api/v1/jobs/{job_id}/status" \
  -H "X-API-Key: $OPENBIO_API_KEY" | jq '.status'

# Verify structure file exists
ls *.cif
```

## SimpleFold vs Boltz/Chai

| Feature | SimpleFold | Boltz | Chai |
|---------|------------|-------|------|
| Speed | **Fast** | Moderate | Moderate |
| Single protein | Good | Excellent | Excellent |
| Complexes | No | **Yes** | **Yes** |
| Ligands | No | **Yes** | **Yes** |
| Binding affinity | No | **Yes (v2)** | No |
| MSA-free | **Yes** | Optional | No |

**Use SimpleFold for**: Quick single-protein predictions, initial screening
**Use Boltz/Chai for**: Complexes, ligands, production-quality structures

---

**Next**: For complexes → Use `Boltz` or `Chai`. For sequence optimization → Use `ProteinMPNN`.