Install
openclaw skills install news-digest-v1Automatically scrape, process, and generate daily news digests from Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy, and pricing information. Use when: user asks for daily news summary, news digest, 每日新闻摘要, 新闻汇总, 新闻摘要, or wants to set up automated news monitoring from Chinese news websites. Outputs formatted summaries with source attribution and original links.
openclaw skills install news-digest-v1Automated pipeline for Chinese news aggregation and digest generation.
# 第 1 步:安装依赖
pip install requests beautifulsoup4
# 第 2 步:一键初始化(建表 + 插入示例网站 + 关键词)
python scripts/news_digest_v2/init_db.py
# 第 3 步:运行摘要
python scripts/news_digest_v2/run_all_stages.py
或者一条命令全部搞定:
python scripts/news_digest_v2/quick_start.py
Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)
Stage 1: Fetch → Scrape websites → Filter → Save to SQLite DB
Stage 2: Process → Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM → Batch LLM summarization (optional, requires API key)
Stage 3: Output → Read LLM summaries (fallback to rule summaries) → Save to files
requests, beautifulsoup4Run the init script to create tables and seed with sample data:
python scripts/news_digest_v2/init_db.py
This creates:
Default database path: news.db (in the skill directory).
Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db
After initialization, add or remove websites and keywords via SQL:
-- Add a website
INSERT INTO monitor_websites (name, url, selector, category, priority)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);
-- Add a keyword
INSERT INTO system_keywords (keyword, category, weight)
VALUES ('新能源', 'core', 5);
| Table | Purpose |
|---|---|
articles | Scraped news articles (title, content, URL, date, keywords, duplicate flag) |
monitor_websites | Monitored websites (name, URL, CSS selector, category, enabled) |
system_keywords | Keywords for relevance scoring (core vs auxiliary, with weight) |
digest_output | LLM-generated summaries (optional) |
python scripts/news_digest_v2/run_all_stages.py
Takes ~5 minutes (network-bound).
python scripts/news_digest_v2/quick_start.py
Runs init + fetch + process + output in one shot.
schedule: "0 20 * * *" # Daily 20:00
payload:
run: python scripts/news_digest_v2/run_all_stages.py
then: read .news-digest-out.md and send to messaging
timeout: 600 # 10 minutes
【来源:标题】
摘要内容(智能选段,300字以内,包含关键数据和核心事实)
发布时间:YYYY-MM-DD
原文链接:http://...
不完整句子自动过滤:
教程/指南类内容全部过滤:
rules_config.py 中 social 分类的教程关键词列表Not simple truncation. Each paragraph is scored by:
Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.
Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.
教程类(全部过滤):教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator 等。
Invalid keywords: clickbait patterns, advertising, webpage navigation elements.
| Variable | Default | Description |
|---|---|---|
NEWS_DIGEST_DB | news.db | SQLite database path |
NEWS_DIGEST_LLM_API_KEY | (empty) | LLM API key for Stage 2.5 summarization |
NEWS_DIGEST_LLM_BASE_URL | (empty) | LLM API base URL |
NEWS_DIGEST_LLM_MODEL | qwen-plus | LLM model name |
If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.
news-digest/
├── SKILL.md
└── scripts/
└── news_digest_v2/
├── __init__.py
├── config.py # DB path, websites, keywords, holidays, LLM config
├── database.py # SQLite operations
├── fetcher.py # Web scraping + smart summary extraction
├── filters.py # Content filtering logic
├── formatter.py # Output formatting + incomplete sentence handling
├── init_db.py # One-click database initialization (NEW in v1.0.1)
├── quick_start.py # One-command full pipeline (NEW in v1.0.1)
├── rules_config.py # Exclusion rules, keywords, dateline patterns
├── similarity.py # Jaccard deduplication
├── stage1_fetch.py # Stage 1 entry (fetch)
├── stage2_process.py # Stage 2 entry (dedup + keywords)
├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
├── stage3_output.py # Stage 3 entry (read + format + save)
└── run_all_stages.py # Full pipeline entry
Q: 安装后跑不起来?
A: 确保先运行了 init_db.py 初始化数据库。没有数据库和示例数据,后续步骤会失败。
Q: pip install 失败?
A: 尝试 pip install --upgrade pip 后再安装。如果网络问题,使用 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests beautifulsoup4。
Q: 某些网站抓取失败? A: 正常现象。部分网站有反爬或 SSL 问题,脚本会继续处理其他网站。不影响最终输出。
Q: 输出是空的?
A: 检查数据库中是否有数据。运行 python scripts/news_digest_v2/init_db.py 重新初始化。
Q: 如何自定义监测网站?
A: 通过 SQL 插入 monitor_websites 表,字段:name, url, selector, category, priority。
Q: 数据库会越来越大吗?
A: 约 30-50 条/天。建议定期清理旧数据,或删除 news.db 后重新初始化。
INSERT OR IGNORE),旧新闻保留。is_duplicate = 1,不删除。init_db.py for one-click database initialization with sample dataquick_start.py for one-command full pipeline