# Common Issues ## CUDA Out of Memory **Symptom**: `RuntimeError: CUDA out of memory` **Solutions**: 1. Reduce batch size 2. Use smaller model 3. Upgrade to larger GPU ```python # Example: reduce batch size for memory-heavy variants batch_size = 16 # Smaller batch to fit VRAM ``` ## Timeout **Symptom**: Function terminates before completion **Solution**: Increase timeout parameter ```python @app.function( gpu="A100", image=image, timeout=3600, # 1 hour (default is 300s) ) def train(): ... ``` | Training Type | Recommended Timeout | |---------------|---------------------| | Quick test | 300s (default) | | Large-model training | 3600s (1 hour) | | Extended training | 86400s (max, 24 hours) | ## Import Errors **Symptom**: `ModuleNotFoundError` **Solution**: Ensure packages are in `pip_install()` ```python # WRONG - package not installed image = modal.Image.debian_slim(python_version="3.11").pip_install("torch") @app.function(gpu="A100", image=image) def train(): from einops import rearrange # ModuleNotFoundError! # CORRECT - all packages installed image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch", "einops", # Now available "numpy", ) ``` ## Data Download Fails **Symptom**: Network errors downloading from HuggingFace **Solutions**: 1. Add retry logic 2. Check if dataset is public 3. Add authentication for private datasets ```python def get_file_with_retry(fname, max_retries=3): for attempt in range(max_retries): try: hf_hub_download( repo_id="your-org/your-dataset", filename=fname, repo_type="dataset", local_dir=data_dir, ) return except Exception as e: if attempt < max_retries - 1: print(f"Retry {attempt + 1}/{max_retries}: {e}") time.sleep(5) else: raise ``` ## Token Authentication **Symptom**: Authentication errors **Solution**: Set up Modal token correctly ```bash # Check if token is set modal token list # Set token modal token set --token-id --token-secret ``` ## Container Image Issues **Symptom**: Old code running despite changes **Solution**: Force image rebuild ```bash # Force rebuild modal run --force-build train_modal.py ``` ## Debugging Tips 1. **Print statements**: Appear in Modal logs in real-time 2. **Background mode**: Use `modal run --detach` for long jobs 3. **Logs**: Check Modal dashboard for full logs 4. **Local testing**: Test code locally before running on Modal ```python @app.function(gpu="A100", image=image, timeout=3600) def train(): print("Starting...") # Appears in logs print(f"GPU: {torch.cuda.get_device_name(0)}") for step in range(max_steps): if step % 100 == 0: print(f"Step {step}, Loss: {loss:.4f}") # Progress updates ```