arXiv Paper Reader
Use the bundled Python scripts before reasoning about arXiv content. They handle:
- searching arXiv by keyword
- filtering keyword results by submitted date range
- downloading arXiv metadata and paper content
- converting papers to Markdown and PDF in the workspace
- syncing configured topics into daily archive folders
Inputs
- Accept raw arXiv IDs like
1706.03762 or URLs such as https://arxiv.org/abs/1706.03762.
- Only accept raw IDs or HTTPS arXiv URLs on
arxiv.org, www.arxiv.org, or export.arxiv.org.
- Accept keyword searches such as
transformer, diffusion, or computer vision.
- Accept optional submitted-date windows using
YYYY-MM-DD.
- Do not use category filters or alias-based domain shortcuts; search is intentionally keyword-only.
Search workflow
- Pick a Python command:
- Prefer
python
- Fall back to
python3
- If the user wants search results or the latest papers for a topic, run:
python {baseDir}/scripts/search_arxiv.py --query "<keywords>" --limit <n>
- Read
search_results.md and search_results.json.
- Use
{baseDir}/references/search-usage.md to present the results.
- If the user asks for the latest papers matching a keyword, pass
--sort submittedDate.
- If the user wants the default best-match ranking, omit
--sort and let the script use relevance order.
- If the user gives a date window, add
--start-date YYYY-MM-DD --end-date YYYY-MM-DD.
Topic sync workflow
- Tell the user to maintain
{rootDir}/topics.json, or seed it from {baseDir}/references/topics.example.json.
- For recurring daily updates, run:
python {baseDir}/scripts/sync_arxiv_topics.py --daily --root-dir <root-dir>
- For manual backfill, run:
python {baseDir}/scripts/sync_arxiv_topics.py --start-date YYYY-MM-DD --end-date YYYY-MM-DD --root-dir <root-dir>
- Read
<root-dir>/runs/<capture-date>/run_manifest.md first.
- Each captured paper lives at
topics/<topic-slug>/<capture-date>/<paper-id>__<title-slug>/.
- Expect each paper directory to contain
paper.pdf, paper.md, metadata.json, and summary.md.
- The batch summary is template-based and grounded in the abstract plus converted Markdown; treat it as a review aid, not a substitute for reading the paper.
Fetch workflow
- Choose an output directory:
- If the user gives one, use it.
- Otherwise write to
./artifacts/arxiv/<paper-id>/ in the current workspace.
- Run the converter:
python {baseDir}/scripts/arxiv_to_md.py <paper-id-or-url> --output-dir <target-dir>
- Read the generated
paper.pdf, paper.md, and metadata.json.
- Summarize the paper in Markdown.
- Save the summary to
<target-dir>/summary.md if the user asked for files. Otherwise return the summary directly in chat.
Summary format
Use the headings in {baseDir}/references/summary-format.md.
Keep the summary grounded in the generated Markdown. If the conversion falls back to abstract-only mode, say so explicitly in the summary.
Safety
- Pass IDs, URLs, and keywords as single CLI arguments. Do not splice untrusted text into shell pipelines.
- Only pass raw arXiv IDs or HTTPS arXiv URLs; reject arbitrary third-party URLs.
- TLS verification is strict. If requests fail because your machine lacks a valid CA bundle, install
certifi or fix the system trust store.
- arXiv source archives are processed in-memory, only
.tex members are read, and suspicious paths plus oversized payloads are rejected before parsing.
- Date windows use arXiv
submittedDate and inclusive YYYY-MM-DD boundaries.
- Do not invent claims that are not supported by
paper.md or search_results.md.
- Do not reintroduce hardcoded category or alias mappings; keep search behavior keyword-only.