# MinerU API Reference

Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token

MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.

| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|---|---|---|
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
| File size | ≤ 200 MB | ≤ 10 MB |
| Pages | ≤ 200 | ≤ 20 |
| Batch | ≤ 50 per request | single file only |
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |

Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).

---

## Authentication (Standard API)

```
Authorization: Bearer YOUR_API_TOKEN
```

Get a token at https://mineru.net/apiManage/token.

> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
> The auth/gateway layer returns a *different* shape on failure:
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
> Clients must handle both — this skill maps `msgCode` to the same error hints.

---

## Standard API endpoints (`/api/v4`)

### Single URL — `POST /extract/task`

```json
{
  "url": "https://example.com/doc.pdf",
  "model_version": "vlm",
  "is_ocr": false,
  "enable_formula": true,
  "enable_table": true,
  "language": "ch",
  "page_ranges": "1-10",
  "extra_formats": ["docx", "html"],
  "data_id": "my-document"
}
```
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.

### Get task result — `GET /extract/task/{task_id}`

```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
```

### Batch local upload — `POST /file-urls/batch`

Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.

```json
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.

### Batch URL — `POST /extract/task/batch`

```json
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```

### Batch results — `GET /extract-results/batch/{batch_id}`

```json
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
  { "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
] } }
```

---

## Agent API endpoints (`/api/v1/agent`) — no token

### URL — `POST /parse/url`

```json
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
```
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.

### File — `POST /parse/file`

```json
{ "file_name": "doc.pdf", "language": "ch" }
```
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.

### Result — `GET /parse/{task_id}`

```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
```

---

## Task states

`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
`done` (complete) · `failed` (error).

---

## Parameters

| Parameter | Type | Default | Notes |
|-----------|------|---------|-------|
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
| `enable_formula` | bool | `true` | Formula recognition |
| `enable_table` | bool | `true` | Table recognition |
| `language` | string | `ch` | OCR language (see official `language` table) |
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
| `data_id` | string | – | `[A-Za-z0-9_.-]`, ≤ 128 chars |
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |

---

## Limits

| | Standard | Agent |
|---|---|---|
| File size | 200 MB | 10 MB |
| Pages | 200 | 20 |
| Batch | 50 / request | 1 |
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |

Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.

---

## Error codes

| Code | Meaning |
|------|---------|
| `A0202` | Invalid token |
| `A0211` | Token expired |
| `-500` | Parameter error |
| `-10001` / `-10002` | Service error / invalid params |
| `-60002` | Unsupported file format |
| `-60003` / `-60004` | File read failed / empty file |
| `-60005` | File too large (> 200 MB) |
| `-60006` | Too many pages (> 200) |
| `-60008` | File read timeout (URL unreachable) |
| `-60010` | Parse failed |
| `-60015` / `-60016` | File / format conversion failed |
| `-60018` | Daily quota reached |
| `-60022` | Web page read failed (rate-limited) |
| **Agent API** | |
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
| `-30002` | Unsupported file type for Agent |
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
| `-30004` | Invalid request parameters |
