Install
openclaw skills install news-digest-v1Automatically scrape, process, and generate daily news digests from Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy, and pricing information. Use when: user asks for daily news summary, news digest, 每日新闻摘要, 新闻汇总, 新闻摘要, or wants to set up automated news monitoring from Chinese news websites. Outputs formatted summaries with source attribution and original links.
openclaw skills install news-digest-v1Automated pipeline for Chinese news aggregation and digest generation.
# 第 1 步:安装依赖
pip install requests beautifulsoup4
# 第 2 步:一键初始化(建表 + 插入示例网站 + 关键词)
python scripts/news_digest_v2/init_db.py
# 第 3 步:运行摘要
python scripts/news_digest_v2/run_all_stages.py
或者一条命令全部搞定:
python scripts/news_digest_v2/quick_start.py
Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)
Stage 1: Fetch → Scrape websites → Filter → Save to SQLite DB
Stage 2: Process → Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM → Batch LLM summarization (optional, requires API key)
Stage 3: Output → Read LLM summaries (fallback to rule summaries) → Save to files
requests, beautifulsoup4Run the init script to create tables and seed with sample data:
python scripts/news_digest_v2/init_db.py
This creates:
Default database path: news.db (in the skill directory).
Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db
After initialization, add or remove websites and keywords via SQL:
-- Add a website
INSERT INTO monitor_websites (name, url, selector, category, priority)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);
-- Add a keyword
INSERT INTO system_keywords (keyword, category, weight)
VALUES ('新能源', 'core', 5);
| Table | Purpose |
|---|---|
articles | Scraped news articles (title, content, URL, date, keywords, duplicate flag) |
monitor_websites | Monitored websites (name, URL, CSS selector, category, enabled) |
system_keywords | Keywords for relevance scoring (core vs auxiliary, with weight) |
digest_output | LLM-generated summaries (optional) |
python scripts/news_digest_v2/run_all_stages.py
Takes ~13 minutes (network + LLM bound).
python scripts/news_digest_v2/quick_start.py
Runs init + fetch + process + output in one shot.
schedule: "0 20 * * *" # Daily 20:00
payload:
run: python scripts/news_digest_v2/run_all_stages.py
then: read .news-digest-out.md and send to messaging
timeout: 900 # 15 minutes
【来源:标题】
摘要内容(智能选段,300字以内,包含关键数据和核心事实)
发布时间:YYYY-MM-DD
原文链接:http://...
不完整句子自动过滤:
教程/指南类内容全部过滤:
rules_config.py 中 social 分类的教程关键词列表Not simple truncation. Each paragraph is scored by:
Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.
Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.
教程类(全部过滤):教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator 等。
企业宣传稿/软文(全部过滤):产能突破、全线投产、技术溢出、供应链底气、跨界营销、负面舆情、品鉴官、品牌定位等。
教育/社会活动/颁奖(全部过滤):十佳、颁奖仪式、表彰、职校生、职业院校、评选、杰出代表、工匠精神等。
Invalid keywords: clickbait patterns, advertising, webpage navigation elements.
| Variable | Default | Description |
|---|---|---|
NEWS_DIGEST_DB | news.db | SQLite database path |
NEWS_DIGEST_LLM_API_KEY | (empty) | LLM API key for Stage 2.5 summarization |
NEWS_DIGEST_LLM_BASE_URL | (empty) | LLM API base URL |
NEWS_DIGEST_LLM_MODEL | qwen3.6-plus | LLM model name |
If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.
news-digest/
├── SKILL.md
└── scripts/
└── news_digest_v2/
├── __init__.py
├── config.py # DB path, websites, keywords, holidays, LLM config
├── database.py # SQLite operations
├── fetcher.py # Web scraping + smart summary extraction
├── filters.py # Content filtering logic
├── formatter.py # Output formatting + incomplete sentence handling
├── init_db.py # One-click database initialization (NEW in v1.0.1)
├── quick_start.py # One-command full pipeline (NEW in v1.0.1)
├── rules_config.py # Exclusion rules, keywords, dateline patterns
├── similarity.py # Jaccard deduplication
├── stage1_fetch.py # Stage 1 entry (fetch)
├── stage2_process.py # Stage 2 entry (dedup + keywords)
├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
├── stage3_output.py # Stage 3 entry (read + format + save)
└── run_all_stages.py # Full pipeline entry
Q: 安装后跑不起来?
A: 确保先运行了 init_db.py 初始化数据库。没有数据库和示例数据,后续步骤会失败。
Q: pip install 失败?
A: 尝试 pip install --upgrade pip 后再安装。如果网络问题,使用 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests beautifulsoup4。
Q: 某些网站抓取失败? A: 正常现象。部分网站有反爬或 SSL 问题,脚本会继续处理其他网站。不影响最终输出。
Q: 输出是空的?
A: 检查数据库中是否有数据。运行 python scripts/news_digest_v2/init_db.py 重新初始化。
Q: 如何自定义监测网站?
A: 通过 SQL 插入 monitor_websites 表,字段:name, url, selector, category, priority。
Q: 数据库会越来越大吗?
A: 约 30-50 条/天。建议定期清理旧数据,或删除 news.db 后重新初始化。
INSERT OR IGNORE),旧新闻保留。is_duplicate = 1,不删除。rules_config.py):
parse_rmrbhwb 路由增加 paper.people.com.cn/rmrb/ 支持(数据库新增 id=47, priority=1)cross_day_dedup.py 新增硬规则4(标题互相包含检测)+ Jaccard权重 0.5→0.6formatter.py 中"收录新闻"统计从预过滤数改为实际输出数
filtered_news 长度(排除重复后),但未减去内容类型黑名单过滤的条目corporate_scandal 分类)cross_day_dedup.py,对比最近 3 天历史摘要自动拦截跨天重复新闻
stage1_fetch.py 从硬编码 42 改为动态读取 len(WEBSITES)decode_response() 已合并入主流程,支持所有已知 GBK 来源fetcher.py 新增 decode_response() 函数,对已知 GBK 编码来源(人民日报海外版)强制使用 GBK 解码,从根源修复 Cyrillic 乱码stage2_5_llm_summary.py 新增乱码标题检测,发现乱码时提示 LLM 从正文生成准确标题formatter.py 新增乱码标题兜底过滤TITLE_EXCLUDE_KEYWORDS(评论丨/时评/社评/深度观察/记者观察/招聘/面试/递补/人事任免/讣告/专访等)URL_EXCLUDE_PATTERNS(人民网评论频道等)formatter.py 输出时自动跳过非硬新闻类型qwen-plus(不存在,400 错误)改为 qwen3.6-pluscorporate_pr(企业宣传稿/软文)+ education_social(教育/颁奖/评选)·、空格)parse_cnr): 央广网页面标题和正文在同一 <a> 标签内,新增独立解析器只取 <strong> 作为标题,避免标题+正文混一起导致标题过长被过滤<meta> 标签检测编码,提高抓取成功率init_db.py for one-click database initialization with sample dataquick_start.py for one-command full pipeline