Install
openclaw skills install arxiv-search-collectorModel-driven arXiv retrieval workflow for building a paper set with a manual language parameter: initialize a run, fetch metadata for each model-designed query, let the model filter irrelevant items per query by keep indexes, then merge and dedupe into per-paper metadata directories. Use when query planning and relevance filtering should be done by the model, not rule-based heuristics.
openclaw skills install arxiv-search-collectorUse this skill when you want model-led query planning and model-led relevance filtering.
Scripts are tools. The model performs the reasoning and decisions:
python3 scripts/init_collection_run.py \
--output-root /path/to/data \
--topic "LLM applications in Lean 4 formalization" \
--keywords "Lean 4,LLM,formalization" \
--categories "cs.AI,cs.LO" \
--target-range 5-10 \
--lookback 30d \
--language English
This creates a run directory with task_meta.json, task_meta.md, query_results/, and query_selection/.
--language must be set manually for each collection run.--language is non-English (for example Chinese), generated markdown files are written in that language:
task_meta.mdquery_results/<label>.md<arxiv_id>/metadata.mdpapers_index.mdFollow these rules before running per-query fetch:
3 queries for small/medium targets (2-5, 5-10).4 queries for larger targets (10-50 or above).target_max be the upper bound in target range.target_per_query = ceil(target_max / query_count).max_results = target_per_query * 2 (or * 3 when recall is more important).5-10, query count 3 -> target_per_query=4 -> each query fetches 8-12.OR inside the same semantic group (synonyms), and AND across groups.OR to increase recall.
LLM OR "large language model" OR AI."Lean 4" OR Lean OR "formal language".AND to keep relevance.
(LLM-group) AND (Lean-group).(<domain terms with OR>) AND (<method/model terms with OR>) [AND <optional constraint terms>]Theme A: LLM applications in Lean 4 formalization
all:"LLM applications in Lean 4 formalization"(all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI")(all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving"(all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM")Theme B: agentic tool use for code generation
all:"agentic tool use code generation"(all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model")(all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation")Theme C: multimodal reasoning with retrieval
all:"multimodal reasoning retrieval"(all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG")(all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")Model defines queries manually, for example:
all:"Lean 4"all:"LLM formalization"all:"AI formal verification"Recommended batch mode (safe defaults, serial execution):
python3 scripts/fetch_queries_batch.py \
--run-dir /path/to/run-dir \
--plan-json /path/to/query_plan.json
In batch mode, the script auto-applies:
--min-interval-sec 5--retry-max 4--retry-base-sec 5--retry-max-sec 120--retry-jitter-sec 1<run_dir>/.runtime/arxiv_api_state.json) for throttlingmax_results from target_range and query count (default oversample x2, cap 60)task_meta.jsonMinimal query_plan.json only needs label and query.
See references/query-plan-format.md.
You normally do not need to set fetch-control args manually.
If you need one-by-one manual fetch, run each query:
python3 scripts/fetch_query_metadata.py \
--run-dir /path/to/run-dir \
--label lean4 \
--query 'all:"Lean 4"' \
--max-results 30 \
--min-interval-sec 5 \
--retry-max 4 \
--language English
Output files:
query_results/<label>.json (indexed full metadata list)query_results/<label>.md (human-readable preview)Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...].
No second local date-filter pass is performed.
Rate-limit controls in fetch_query_metadata.py:
--min-interval-sec (default 5.0)--retry-max (default 4)--retry-base-sec (default 5.0)--retry-max-sec (default 120.0)--retry-jitter-sec (default 1.0)--rate-state-path (optional override; default is <run_dir>/.runtime/arxiv_api_state.json)--force to bypass cache and re-fetchFor each query list, the model reads indexed results and decides what to keep.
Use keep specs by index and/or arXiv ID when merging.
To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.
python3 scripts/merge_selected_papers.py \
--run-dir /path/to/run-dir \
--keep lean4:0,2,4 \
--keep llm-formalization:1,3 \
--language English
or with selection-json:
{
"lean4-round1": [0, 2, 4],
"lean4-round2": [],
"formalization-round2": [1, 3, 5]
}
An empty list means this query label is intentionally dropped (keep 0).
This writes final outputs:
<arxiv_id>/metadata.json<arxiv_id>/metadata.mdpapers_index.jsonpapers_index.mdIf relevance is weak or final count is insufficient after Step 4, iterate:
papers_index.md and per-paper metadata quality.OR terms, keep cross-group AND constraints).python3 scripts/merge_selected_papers.py \
--run-dir /path/to/run-dir \
--incremental \
--selection-json /path/to/updated_selection.json \
--language English
Incremental behavior:
query_selection/selected_by_query.json.selection-json override previous selections for those labels.[].Stop retrying when:
If relevant papers are genuinely scarce, it is valid to finish below the original minimum target range.
--max-results.--force when necessary.429 Too Many Requests, retry later and/or increase --min-interval-sec.references/io-contract.md for exact files and schema.This skill is a sub-skill of arxiv-summarizer-orchestrator.
Pipeline position:
arxiv-search-collector (this skill)arxiv-paper-processorarxiv-batch-reporterThis skill produces the initial paper-set structure and metadata that Stage B and Stage C depend on.