# Data Download in Modal

## Why Download Inside Functions?

Modal functions run in isolated containers. Data must be downloaded inside the function or stored in Modal volumes.

## HuggingFace Hub Download

Use `huggingface_hub` to download datasets:

```python
# Add huggingface_hub to image
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch",
    "einops",
    "numpy",
    "huggingface_hub",
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import os
    from huggingface_hub import hf_hub_download

    # Download tokenized dataset shards (publicly accessible, no auth required)
    data_dir = "/tmp/data/dataset"
    os.makedirs(data_dir, exist_ok=True)

    def get_file(fname):
        if not os.path.exists(os.path.join(data_dir, fname)):
            print(f"Downloading {fname}...")
            hf_hub_download(
                repo_id="your-org/your-dataset",
                filename=fname,
                repo_type="dataset",
                local_dir=data_dir,
            )

    # Download validation shard
    get_file("val_000000.bin")
    # Download first training shard
    get_file("train_000001.bin")

    # Load and use the data from data_dir
    ...
```

## Example Dataset Layout

Tokenized datasets often ship in shard files like:

| File | Purpose |
|------|---------|
| `val_000000.bin` | Validation shard |
| `train_000001.bin` | Training shard |

## Private Datasets

For private HuggingFace datasets:

```python
@app.function(gpu="A100", image=image, timeout=3600, secrets=[modal.Secret.from_name("huggingface")])
def train():
    import os
    from huggingface_hub import hf_hub_download

    # Token is automatically available from secret
    hf_hub_download(
        repo_id="your-private-repo",
        filename="data.bin",
        repo_type="dataset",
        local_dir="/tmp/data",
        token=os.environ.get("HF_TOKEN"),
    )
```

## Modal Volumes (Persistent Storage)

For large datasets you want to cache:

```python
volume = modal.Volume.from_name("training-data", create_if_missing=True)

@app.function(gpu="A100", image=image, volumes={"/data": volume})
def train():
    # Data persists across function calls
    if not os.path.exists("/data/train_000001.bin"):
        download_data("/data")

    # Use cached data
    ...
```
