# Hyperparameters

## Recommended Settings for GPT-124M on A100

| Parameter | Recommended Value | Notes |
|-----------|-------------------|-------|
| `max_lr` | 6e-4 | Tune ±2x for stability |
| `min_lr` | 6e-5 | Keep 10x below max |
| `warmup_steps` | 200 | Increase for large batches |
| `max_steps` | 2000-4000 | Longer runs improve quality |
| `batch_size` | 32 | Reduce for memory-heavy variants |
| `weight_decay` | 0.1 | Standard GPT-2 pretraining |
| `block_size` | 1024 | Adjust for context length |
| `grad_clip` | 1.0 | Prevents rare spikes |

## Model Configuration

```python
@dataclass
class GPTConfig:
    block_size: int = 1024      # Context length
    vocab_size: int = 50257     # GPT-2 tokenizer vocab size
    n_layer: int = 12           # Number of transformer blocks
    n_head: int = 12            # Number of attention heads
    n_embd: int = 768           # Embedding dimension
    dropout: float = 0.0        # Dropout (0 for pretraining)
    bias: bool = False          # Use bias in linear layers
```

## Optional Variant Parameters

If you introduce alternative residual schemes or routing layers, document their extra knobs. Common examples include:

- `num_streams`: number of parallel residual streams
- `mixing_iters`: iterations for normalization or mixing
- `mixing_temperature`: softness of routing weights

Start with conservative defaults and validate stability early.

## Learning Rate Schedule

```
LR
 ^
 |    /\
 |   /  \
 |  /    \______
 | /
 +---------------> Steps
   warmup  decay
```

- **Warmup**: Linear increase from 0 to max_lr over warmup_steps
- **Decay**: Cosine decay from max_lr to min_lr

## Batch Size Guidelines

| GPU | VRAM | Suggested Batch (GPT-124M) |
|-----|------|----------------------------|
| T4 | 16GB | 8 |
| A10G | 24GB | 16 |
| A100-40GB | 40GB | 32 |
| A100-80GB | 80GB | 64 |

Reduce batch size for memory-heavy variants or longer context.

## Training Duration

| Steps | Tokens | Time (A100) | Quality |
|-------|--------|-------------|---------|
| 1000 | ~32M | ~8 min | Initial convergence |
| 2000 | ~64M | ~15 min | Good quality |
| 4000 | ~128M | ~30 min | Better quality |

## Training Health Signals

| Metric | Healthy Signal | Warning Signs |
|--------|----------------|---------------|
| Validation loss | Consistent downward trend | Plateaus early or increases |
| Grad norm | Stable with occasional bumps | Repeated spikes or NaNs |
| Grad norm variability | Low over short windows | Large oscillations between steps |

## Tuning Tips

1. **Loss Spikes**: Reduce learning rate or increase warmup
2. **Slow Convergence**: Increase learning rate (carefully)
3. **OOM Errors**: Reduce batch size
4. **Variant Instability**: Simplify variant or reduce mixing temperature
