RSS采集器 | RSS Fetcher

统一的RSS采集与管理系统 | Unified RSS Feed Fetcher and Manager 支持增量抓取、自动去重、自动标签、源健康监控、HTML报告生成 Incremental fetching, auto-dedup, auto-tagging, source health monitoring...

MIT-0 · Free to use, modify, and redistribute. No attribution required.
0 · 38 · 1 current installs · 1 all-time installs
bynoah@noah-1106
MIT-0
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description describe an RSS fetcher and the package provides Python scripts to initialize a local SQLite DB, fetch feeds, auto-tag, deduplicate and generate HTML — the requested binary (python3) and config files align with that purpose.
Instruction Scope
Runtime instructions tell the agent/user to run the included scripts (init_db.py, fetch.py, generate_html.py, source.py, list.py). The fetch script performs network I/O (HTTP(S) requests) to every URL listed in config/sources.json and writes results into data/rss_fetcher.db and data/index.html. This is expected behaviour, but be aware the default config contains many (100+) feeds and default concurrency is high (20 workers, up to 50) which will generate substantial outbound traffic and many requests to third-party sites.
Install Mechanism
There is no automated install step (instruction-only install), so nothing is downloaded or executed outside the shipped Python scripts. The code optionally imports feedparser if available but falls back to builtin urllib/regex parsing — no external URLs are used for installing code.
Credentials
The skill requests no environment variables or credentials. All configuration is local (config/sources.json) and data is stored under the skill's data/ directory — no unexplained secrets or unrelated credentials are required.
Persistence & Privilege
The skill does not request persistent platform privileges (always:false) and does not modify other skills or system-wide settings. It writes its own database and static HTML files under the skill directory, which is consistent with its purpose.
Assessment
This package appears to do what it says: fetch RSS feeds, store them locally and generate static HTML. Before installing/running: 1) Review config/sources.json and remove any feeds you don't want scraped (default contains 100+ feeds). 2) Run in an environment where outbound HTTP(S) traffic and disk writes are acceptable (it will create data/rss_fetcher.db and data/index.html). 3) Consider installing feedparser in your Python environment for more robust parsing; otherwise the fallback uses regex-based parsing. 4) Reduce --workers if you want fewer concurrent connections. 5) Inspect any truncated/omitted files if you need deeper assurance. If you want, I can scan the remaining truncated files or point out exact lines where feeds are requested and where files are written.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.1.0
Download zip
latestvk971d4an7cy1v0vt8d1jvfs0n982yc00

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

📰 Clawdis
Binspython3

SKILL.md

RSS Fetcher - 统一RSS采集系统 | Unified RSS Feed Fetcher

核心特性 | Core Features

  • 增量抓取 / Incremental Fetching - 只抓取新文章,基于URL哈希自动去重 | Only fetch new articles, auto-deduplicate based on URL hash
  • 自动标签 / Auto-tagging - 优先使用RSS自带category,无则自动提取标题关键词 | Prioritize RSS category, auto-extract keywords from title if absent
  • HTML报告 / HTML Reports - 生成可筛选的静态HTML页面,支持日期/分类/标签多维度筛选 | Generate filterable static HTML pages with date/category/tag filters
  • 源健康监控 / Source Health Monitoring - 检测RSS源可用性,支持批量检查 | Monitor RSS source availability with batch checking
  • 分类管理 / Category Management - 文章自动继承源的分类,支持多维度筛选 | Articles inherit source categories, multi-dimensional filtering
  • 超时设置 / Timeout Setting - 单源30秒超时,避免长时间阻塞 | 30-second timeout per source to avoid blocking

数据库设计 | Database Design

表结构 | Table Structure

表名 / Table用途 / Purpose核心字段 / Core Fields
articles文章主数据 / Article dataid, source_id, category, title, url, published_at
tags标签定义 / Tag definitionsid, name
article_tags文章-标签关联 / Article-tag relationarticle_id, tag_id
fetch_logs抓取日志 / Fetch logssource_id, started_at, found, new, status

说明 / Note: RSS源通过 config/sources.json 文件管理,不存入数据库 | RSS sources are managed via config/sources.json file, not in database

关键设计 | Key Design

  • 时间戳使用 INTEGER (Unix时间戳) - 查询更快、比较更简单 | INTEGER Unix timestamps for faster queries
  • published_at NOT NULL - 必填,缺失时标记为 UNRELIABLE_TIME (1970-01-01) | Required, marked as UNRELIABLE_TIME if missing
  • URL唯一索引 / URL Unique Index - 确保去重 | Ensure deduplication
  • 多标签支持 / Multi-tag Support - 一篇文章可拥有多个标签 | Multiple tags per article
  • category字段 / Category Field - 继承自sources.json的分类配置 | Inherited from sources.json configuration

快速开始 | Quick Start

1. 初始化数据库 | Initialize Database

cd skills/rss_fetcher
python3 scripts/init_db.py

2. 配置RSS源 | Configure RSS Sources

编辑 config/sources.json,添加你的RSS源: Edit config/sources.json to add your RSS sources:

{
  "sources": [
    {
      "id": "openai",
      "name": "OpenAI Blog",
      "url": "https://openai.com/blog/rss.xml",
      "category": "tech",
      "enabled": true
    }
  ]
}

3. 执行抓取 | Execute Fetch

# 抓取所有源(最近24小时)/ Fetch all sources (last 24 hours)
python3 scripts/fetch.py

# 抓取指定源 / Fetch specific sources
python3 scripts/fetch.py --sources openai huggingface

# 抓取最近48小时 / Fetch last 48 hours
python3 scripts/fetch.py --hours 48

# 使用更多线程(默认20,最大50)/ Use more workers (default 20, max 50)
python3 scripts/fetch.py --workers 50

⚠️ 抓取后记得更新HTML报告 - 新抓取的文章需要重新生成页面才能在浏览器中查看 ⚠️ Remember to update HTML report after fetching - New articles require regeneration to view in browser

python3 scripts/fetch.py && python3 scripts/generate_html.py

4. 生成HTML报告 | Generate HTML Report

注意:每次抓取新文章后,必须重新生成HTML页面才能看到最新内容。 Note: Must regenerate HTML after fetching new articles to see latest content.

# 抓取并立即更新HTML(推荐工作流)/ Fetch and update HTML (recommended workflow)
python3 scripts/fetch.py && python3 scripts/generate_html.py

# 单独生成HTML(已有新数据时)/ Generate HTML only (when new data exists)
python3 scripts/generate_html.py

# 打开查看 / Open to view
open data/index.html  # Mac
# 或浏览器访问 / Or browser: file:///.../rss_fetcher/data/index.html

HTML报告功能 / HTML Report Features:

  • 📅 日期筛选 / Date Filter - 起止日期选择 | Start/end date selection
  • 🏷️ 分类筛选 / Category Filter - 按文章分类筛选 | Filter by article category
  • 🔍 关键词搜索 / Keyword Search - 实时搜索标题 | Real-time title search
  • ☑️ 标签多选 / Multi-tag Selection - 多标签组合筛选(AND逻辑)| Multi-tag combo filter (AND logic)
  • 📊 实时统计 / Real-time Stats - 显示筛选结果数量 | Show filtered results count

5. 源管理 | Source Management

# 检查所有源的健康状态 / Check all source health
python3 scripts/source.py check

# 查看源统计 / View source statistics
python3 scripts/source.py stats

# 添加新源 / Add new source
python3 scripts/source.py add myblog "My Blog" "https://example.com/feed.xml" tech

# 禁用/启用/删除源 / Disable/enable/remove source
python3 scripts/source.py disable myblog
python3 scripts/source.py enable myblog
python3 scripts/source.py remove myblog

6. 查看文章列表 | View Article List

# 终端表格查看最近文章 / View recent articles in terminal table
python3 scripts/list.py

# 查看最近48小时 / View last 48 hours
python3 scripts/list.py --hours 48

# 按分类查看 / View by category
python3 scripts/list.py --category tech

# JSON格式输出 / JSON output
python3 scripts/list.py --json

配置文件 | Configuration Files

sources.json

{
  "_description": "RSS源配置文件 | RSS source config file",
  "_updated": "2026-03-15",
  "_total_sources": 111,
  "sources": [
    {
      "id": "openai",
      "name": "OpenAI Blog",
      "url": "https://openai.com/blog/rss.xml",
      "category": "tech",
      "enabled": true
    }
  ]
}

字段说明 / Field Description:

  • id - 源唯一标识 | Source unique identifier
  • name - 显示名称 | Display name
  • url - RSS订阅地址 | RSS feed URL
  • category - 文章分类 | Article category
  • enabled - 是否启用 | Whether enabled

分类可自由定义,在 sources.json 中使用任意分类名称即可。 Categories can be freely defined using any category name in sources.json.


自动标签系统 | Auto-tagging System

标签生成逻辑 | Tag Generation Logic

  1. 优先使用RSS自带category - 提取 <category> 标签内容 Prioritize RSS category - Extract <category> tag content
  2. Fallback关键词提取 - 无category时从标题提取: Fallback keyword extraction - Extract from title if no category:
    • 规则匹配(AI/区块链/股票等预定义规则)| Rule matching (AI/blockchain/stocks etc.)
    • 名词提取(英文大写单词、中文词组)| Noun extraction (English caps, Chinese phrases)

预定义标签规则 | Predefined Tag Rules

关键词 / Keywords标签 / Tag
AI, GPT, 大模型, 机器学习AI
区块链, 比特币, crypto区块链 / Blockchain
股票, 股市, equity股票 / Stocks
游戏, gaming, esports游戏 / Gaming
......

规则定义在 fetch.pyTAG_RULES 中,可自由扩展。 Rules defined in TAG_RULES in fetch.py, freely extensible.


数据查询示例 | Data Query Examples

获取今天所有文章 | Get Today's Articles

SELECT title, url, source_id
FROM articles
WHERE date(fetched_at, 'unixepoch') = date('now')
ORDER BY published_at DESC;

获取某分类的文章 | Get Articles by Category

SELECT * FROM articles
WHERE category = 'tech'
AND published_at > strftime('%s', 'now', '-24 hours');

获取带标签的文章 | Get Articles with Tags

SELECT a.title, a.url, GROUP_CONCAT(t.name) as tags
FROM articles a
LEFT JOIN article_tags at ON a.id = at.article_id
LEFT JOIN tags t ON at.tag_id = t.id
WHERE a.category = 'tech'
GROUP BY a.id;

获取热门标签 | Get Popular Tags

SELECT t.name, COUNT(*) as count
FROM tags t
JOIN article_tags at ON t.id = at.tag_id
GROUP BY t.id
ORDER BY count DESC;

文件位置 | File Locations

rss_fetcher/
├── SKILL.md                    # 本文档 | This document
├── config/
│   └── sources.json           # RSS源配置 | RSS source config
├── scripts/
│   ├── init_db.py             # 数据库初始化 | DB initialization
│   ├── fetch.py               # 核心抓取脚本(含自动标签)| Core fetch script
│   ├── generate_html.py       # HTML报告生成 | HTML report generation
│   ├── source.py              # 源健康检查与管理 | Source health check
│   ├── list.py                # 终端文章列表 | Terminal article list
│   └── query.py               # 数据查询工具 | Data query tool
├── data/
│   ├── rss_fetcher.db         # SQLite数据库 | SQLite database
│   └── index.html             # 生成的HTML报告 | Generated HTML report
└── references/
    └── schema.sql             # 数据库结构参考 | DB schema reference

数据库位置 | Database Location

rss_fetcher/data/rss_fetcher.db

注意事项 | Notes

  1. 首次抓取会比较慢 - 需要抓所有历史文章 First fetch is slow - Need to fetch all historical articles
  2. SQLite并发 - 单进程访问,避免并发写入 SQLite concurrency - Single process access, avoid concurrent writes
  3. 时间不可靠文章 - published_at = 0 的文章需人工审核 Unreliable time articles - Articles with published_at = 0 need manual review
  4. 标签自动累积 - 随着文章增多,标签会自动丰富 Tags auto-accumulate - Tags enrich as more articles are fetched
  5. 定期重新生成HTML - 抓取新文章后需重新运行 generate_html.py Regularly regenerate HTML - Must rerun generate_html.py after fetching new articles

Part of OpenClaw Daily Research System

Files

10 total
Select a file
Select a file to preview.

Comments

Loading comments…