Agent Survey Corpus

v1.0.0

Download a small corpus of open-access arXiv survey/review PDFs about LLM agents and extract text for style learning. **Trigger**: agent survey corpus, ref c...

0· 141·1 current·1 all-time

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for willoscar/agent-survey-corpus.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Agent Survey Corpus" (willoscar/agent-survey-corpus) from ClawHub.
Skill page: https://clawhub.ai/willoscar/agent-survey-corpus
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install agent-survey-corpus

ClawHub CLI

Package manager switcher

npx clawhub@latest install agent-survey-corpus
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (download arXiv survey PDFs and extract text for style analysis) aligns with the files and code. The downloader only targets arxiv.org export API and arxiv.org/pdf URLs and writes outputs under ref/agent-surveys/, which is consistent with the stated purpose.
Instruction Scope
SKILL.md instructs editing ref/agent-surveys/arxiv_ids.txt and running scripts/run.py with workspace and max-pages options; the script reads that id list, fetches metadata, downloads PDFs and extracts the first N pages to text. The instructions do not ask the agent to read unrelated files or secrets.
Install Mechanism
There is no install spec (instruction-only install), which is low risk. However, the Python script uses third-party packages (PyMuPDF as fitz, and other tooling files reference yaml/etc.) but the skill does not declare Python dependencies or installation steps; users must install required Python packages manually before running.
Credentials
The skill requires no environment variables, no credentials, and no privileged config paths. Network access is necessary and explicitly documented; all network calls are to arxiv.org/export.arxiv.org and arxiv.org/pdf only.
Persistence & Privilege
always is false and the skill does not request persistent platform privileges. It writes files to the provided workspace only and does not modify other skills or system-level agent settings.
Assessment
This skill is coherent and limited in scope, but take these practical steps before running: 1) Run it in an explicit workspace directory (e.g., a temp folder) so PDFs/text are confined and not accidentally committed to a repo; SKILL.md notes .gitignore but verify your repo ignores ref/**/pdfs/ and ref/**/text/. 2) Ensure you install required Python packages (PyMuPDF / pymupdf and any YAML libs if you use other tooling files) — the skill does not provide an install step. 3) Confirm network access is acceptable and that downloading the listed arXiv IDs is permitted for your use case (arXiv PDFs are generally open-access but check licenses for reuse). 4) Inspect and control ref/agent-surveys/arxiv_ids.txt before running so you only download expected papers. If you want stricter isolation, run the script inside a disposable container or VM.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Any binpython3, python
latestvk971s5amd03fc3xf6nkm2fhcr98376zj
141downloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

Agent Survey Corpus (arXiv PDFs → text extracts)

Goal: create a small, local reference library so you can learn from real agent surveys when refining:

  • C2 outline structure (paper-like sectioning)
  • C4 tables/claims organization
  • C5 writing style and density

This is intentionally not part of the pipeline; it is an optional, repo-level toolkit.

Inputs

  • ref/agent-surveys/arxiv_ids.txt

Outputs

  • ref/agent-surveys/pdfs/
  • ref/agent-surveys/text/
  • ref/agent-surveys/STYLE_REPORT.md (tracked; auto-generated summary)

Workflow

  1. Edit ref/agent-surveys/arxiv_ids.txt (one arXiv id per line).
  2. Run the downloader to fetch PDFs and extract the first N pages to text.
  3. Skim the extracted text under ref/agent-surveys/text/:
    • look at section counts (H2), subsection granularity (H3), and how they transition between chapters.
    • identify repeated rhetorical patterns you want the pipeline writer to imitate.

Script

Quick Start

  • python scripts/run.py --help
  • python scripts/run.py --workspace . --max-pages 20

All Options

  • --workspace <dir> (use . to write into repo root)
  • --inputs <semicolon-separated> (default: ref/agent-surveys/arxiv_ids.txt)
  • --max-pages <N> (default: 20)
  • --sleep <seconds> (default: 1.0)
  • --overwrite (re-download + re-extract)

Examples

  • Download/extract into repo root ref/:
    • python scripts/run.py --workspace . --max-pages 20
  • Download/extract into a specific folder (treated as workspace root):
    • python scripts/run.py --workspace /tmp/surveys --max-pages 30

Troubleshooting

  • Download fails / timeout: rerun with a larger --sleep, or try fewer ids.
  • Text extract is empty: the PDF may be scanned; try another survey or increase --max-pages.
  • Files showing up in git status: PDFs/text are ignored via .gitignore (ref/**/pdfs/, ref/**/text/).

Comments

Loading comments...