arxivkb

v1.0.1

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud...

⭐ 0· 661·0 current·0 all-time

by@camopel

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for camopel/arxivkb.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "arxivkb" (camopel/arxivkb) from ClawHub.
Skill page: https://clawhub.ai/camopel/arxivkb
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Required binaries: python3, ollama
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install arxivkb

ClawHub CLI

Package manager switcher

npx clawhub@latest install arxivkb

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

✓

Purpose & Capability

The skill name/description align with the included scripts: it crawls arXiv, downloads PDFs, extracts and chunks text, embeds via Ollama (nomic-embed-text) and indexes with FAISS/SQLite. Required binaries (python3, ollama) match the design.

ℹ

Instruction Scope

Runtime instructions and code operate within the declared purpose (arXiv API + local embedding). However, SKILL.md/README claim defaults and behaviors that do not fully match the code: SKILL.md says default data dir is `~/workspace/arxivkb`, while install.py/cli/db default to `~/Downloads/ArXivKB`. SKILL.md and README mention a `config.json` and `akb` CLI wrapper; the installer writes service/plist that references `--config {config.json}` but the installer does not create that config file or an 'akb' executable in PATH. These mismatches can cause unexpected file placement and failing background jobs.

ℹ

Install Mechanism

The registry entry has no formal install spec, but a provided scripts/install.py will run pip installs and call `ollama pull`. That installer will (if executed) pip-install packages (possibly using --user), pull a model from Ollama (network download), create data directories, and write systemd/launchd files. No unusual remote or obfuscated download URLs are used, but the install script performs network operations and writes persistent service files to the user's profile.

✓

Credentials

No secrets or cloud API keys are requested. The only external endpoints contacted are arXiv (public) and a local Ollama server (http://localhost:11434). An optional env var ARXIVKB_DATA_DIR is supported for data directory override. No unrelated credentials or config paths are requested.

Persistence & Privilege

The installer writes user-level service definitions (systemd timer in ~/.config/systemd/user and launchd plist in ~/Library/LaunchAgents) to schedule daily crawls. This creates persistent background network activity (periodic arXiv downloads and embedding). While expected for a crawler, users should be aware this grants the skill ongoing presence on the host. always:false mitigates global forced inclusion, but the installer still modifies user startup/service configuration.

What to consider before installing

This package appears to be what it says — a local arXiv crawler with FAISS search — but it has a few sloppy/inconsistent implementation details and will install persistent background jobs. Before running the installer or giving it shell access: 1) Inspect scripts/install.py and the generated systemd/launchd files (it writes to ~/.config/systemd/user and ~/Library/LaunchAgents) and confirm you want a daily background ingest. 2) Note the data-directory mismatch: SKILL.md/README mention ~/workspace/arxivkb but the scripts use ~/Downloads/ArXivKB; set ARXIVKB_DATA_DIR or edit the defaults to control where PDFs/DB/index are stored. 3) The systemd/launchd service references a --config {config.json} that the installer does not create — background runs may fail unless you create/populate that config or adapt the service. 4) The installer will pip-install packages and run `ollama pull nomic-embed-text` (model download) — expect network activity and non-trivial disk usage. 5) Run the installer inside a virtual environment if you want to avoid global/user pip changes. 6) Ensure Ollama is installed and intentionally run as it will accept local HTTP requests; embedding calls target localhost only. If you want higher assurance, run the tool manually (invoke scripts/cli.py directly) instead of activating the installer’s automatic timer, and verify paths and config behavior first.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

Binspython3, ollama

latestvk97bwy48av0qcaff398dj5mgd581m1wg

661downloads

0stars

2versions

Updated 14h ago

v1.0.1

MIT-0

ArXivKB — Science Knowledge Base

Why This Skill?

🏠 100% local — crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.

🔍 Semantic search on paper content — FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.

📂 arXiv category-based — tracks official arXiv categories (155 available, 8 groups). No free-text queries.

🧹 Auto-cleanup — configurable expiry deletes old papers, PDFs, and chunks.

Install

python3 scripts/install.py

Works on macOS and Linux. Installs Python deps (faiss-cpu, pdfplumber, tiktoken, arxiv, numpy), pulls nomic-embed-text via Ollama, creates data directories and DB.

Prerequisites

Ollama — must be installed and running (ollama serve)
Python 3.10+

Quick Start

# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats

Ingestion

akb ingest                    # Crawl, download PDFs, chunk, embed
akb ingest --days 14          # Look back 14 days
akb ingest --dry-run          # Preview only
akb ingest --no-pdf           # Index abstracts only (faster)

Pipeline: arXiv API → PDF download → text extraction (pdfplumber) → chunking (tiktoken, 500 tokens, 50 overlap) → embedding (Ollama nomic-embed-text) → FAISS + SQLite.

Paper Details

akb paper 2401.12345    # Show title, abstract, categories, PDF status

Statistics

akb stats   # Papers, chunks, categories, DB size

Expiry & Cleanup

akb expire               # Delete papers older than 90 days (default)
akb expire --days 30     # Override: delete papers older than 30 days
akb expire --days 30 -y  # Skip confirmation

Configuration

No config file needed. Defaults:

Setting	Default	Override
Data directory	`~/workspace/arxivkb`	`ARXIVKB_DATA_DIR` env or `--data-dir`
Ollama endpoint	`http://localhost:11434`	— (hardcoded)
Embedding model	`nomic-embed-text` (768d)	— (hardcoded)
Chunk size	500 tokens, 50 overlap	—
Expiry	90 days	`--days` flag

Data Layout

~/workspace/arxivkb/
├── arxivkb.db           # SQLite: papers, chunks, translations, categories
├── pdfs/                  # Downloaded PDF files ({arxiv_id}.pdf)
└── faiss/
    └── arxivkb.faiss    # FAISS IndexFlatIP (chunk embeddings)

DB Schema

papers: id, arxiv_id, title, abstract, categories, published, status, created_at
chunks: id, paper_id, section, chunk_index, text, faiss_id, created_at
translations: paper_id, language, abstract, created_at (PK: paper_id+language)
categories: code, description, group_name, enabled, added_at (155 entries)

💬 Chat Commands (OpenClaw Agent)

When this skill is installed, the agent recognizes /akb as a shortcut:

Command	Action
`/akb list`	Show enabled categories
`/akb add cs.AI cs.RO`	Enable categories for crawling
`/akb remove cs.AI`	Disable a category
`/akb browse`	Browse all 155 arXiv categories
`/akb browse robotics`	Filter categories by keyword
`/akb stats`	Show paper/chunk/category counts
`/akb help`	Show available commands

The agent runs these via the akb CLI internally.

📱 PrivateApp Dashboard

A companion PWA dashboard is available. Provides:

Semantic search across paper content
Paper detail with abstract translation (on-demand via LLM)
Inline PDF viewing
Category browser
Stats (papers, chunks, categories)

Architecture

scripts/
├── cli.py             # CLI — categories, ingest, paper, stats, expire
├── db.py              # SQLite schema + CRUD
├── arxiv_crawler.py   # arXiv API search + PDF download
├── arxiv_taxonomy.py  # Full arXiv category taxonomy (155 categories)
├── pdf_processor.py   # PDF text extraction + tiktoken chunking
├── embed.py           # Ollama nomic-embed-text (768d, normalized)
├── faiss_index.py     # FAISS IndexFlatIP manager
├── search.py          # Semantic search: query → FAISS → group by paper
└── install.py         # One-command installer

Comments

Loading comments...

arxivkb

Install

Install with OpenClaw

CLI Commands

Runtime requirements

ArXivKB — Science Knowledge Base

Why This Skill?

Install

Prerequisites

Quick Start

Categories

Ingestion

Paper Details

Statistics

Expiry & Cleanup

Configuration

Data Layout

DB Schema

💬 Chat Commands (OpenClaw Agent)

📱 PrivateApp Dashboard

Architecture

Comments