hn-crawler

v1.0.0

爬取 https://hn.aimaker.dev/ 网站资讯,执行爬取->提取->整理->总结完整流程。Invoke when user wants to crawl news from hn.aimaker.dev or process web content through the full pipeline.

0· 105·0 current·0 all-time
byproanimer@drowning-in-codes

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for drowning-in-codes/hn-crawler-cn.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "hn-crawler" (drowning-in-codes/hn-crawler-cn) from ClawHub.
Skill page: https://clawhub.ai/drowning-in-codes/hn-crawler-cn
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install hn-crawler-cn

ClawHub CLI

Package manager switcher

npx clawhub@latest install hn-crawler-cn
Security Scan
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
Name/description match the provided scripts and SKILL.md. The package contains crawl/extract/organize/summarize scripts and a run_pipeline orchestrator which all operate on the stated site (default TARGET_URL is https://hn.aimaker.dev/). There are no unrelated required binaries or environment variables.
Instruction Scope
SKILL.md and the scripts limit actions to HTTP GET requests to the target site, parsing HTML, local file read/write under data/, and generating summaries. Declared environment variables (TARGET_URL, OUTPUT_DIR, TIMEOUT) are used. The code does not reference other system credentials, config paths, or external endpoints beyond normal HTTP requests to the target URL. Note: some source files (organize.py) contain syntax/typing errors that will prevent successful execution until fixed; this is a functionality issue rather than a security misdirection.
Install Mechanism
There is no automated install spec; SKILL.md instructs the user to run pip install -r requirements.txt. Installing packages from PyPI is normal but carries the usual supply-chain risk (verify package versions and trust). No downloads from arbitrary URLs or archive extraction steps are present in the skill itself.
Credentials
The skill does not request credentials or secrets. The only environment variables used (TARGET_URL, OUTPUT_DIR, TIMEOUT) are proportional and documented. Scripts operate on local output directories and do not exfiltrate data to unlisted remote endpoints.
Persistence & Privilege
The skill is not marked always:true and does not attempt to modify other skills or system-level agent configuration. It does not request permanent presence or elevated privileges.
Assessment
This skill appears internally consistent for crawling and processing hn.aimaker.dev content. Before installing or running: 1) Inspect the code locally (you already have the files); there are syntax/typing bugs (e.g., in organize.py) that must be fixed for the pipeline to run. 2) Follow robots.txt and rate-limit requests to avoid abusive crawling. 3) When running pip install -r requirements.txt, review which packages and versions will be installed (PyPI packages are common but carry supply-chain risk). 4) Run the skill in a sandbox or non-critical environment first (it writes files to data/). 5) If you need higher assurance, request the full, untruncated source for final review or ask the author to provide a fixed release with tests and an explicit provenance/homepage.

Like a lobster shell, security has layers — review code before you run it.

latestvk9787tfjy4qhv1wtrpefxr3a5183nzsf
105downloads
0stars
1versions
Updated 1mo ago
v1.0.0
MIT-0

HN 资讯爬虫 Skill

本 Skill 用于爬取 https://hn.aimaker.dev/ 网站的资讯内容,并通过完整的处理流程将原始数据转化为结构化的总结报告。

工作流程

整个处理流程分为四个阶段:

┌─────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│  Crawl  │ -> │ Extract  │ -> │ Organize │ -> │ Summarize │
│  爬取   │    │  提取    │    │  整理    │    │  总结     │
└─────────┘    └──────────┘    └──────────┘    └───────────┘

1. Crawl(爬取)

  • 脚本: scripts/crawl.py
  • 功能: 使用 HTTP 请求获取网页原始 HTML 内容
  • 输出: data/raw/hn_aimaker_<timestamp>.html

2. Extract(提取)

  • 脚本: scripts/extract.py
  • 功能: 解析 HTML,提取文章标题、链接、摘要、发布时间等信息
  • 输出: data/extracted/articles_<timestamp>.json

3. Organize(整理)

  • 脚本: scripts/organize.py
  • 功能: 对提取的数据进行清洗、去重、分类和格式化
  • 输出: data/organized/articles_organized_<timestamp>.json

4. Summarize(总结)

  • 脚本: scripts/summarize.py
  • 功能: 生成摘要报告,包括热点话题统计、趋势分析等
  • 输出: data/summary/summary_<timestamp>.md

快速开始

安装依赖

cd .trae/skills/hn-crawler/scripts
pip install -r requirements.txt

运行完整流程

# 方法1:逐个执行
python scripts/crawl.py
python scripts/extract.py
python scripts/organize.py
python scripts/summarize.py

# 方法2:一键执行完整流程
python scripts/run_pipeline.py

目录结构

.trae/skills/hn-crawler/
├── SKILL.md                    # 本文件
├── scripts/
│   ├── requirements.txt        # Python 依赖
│   ├── crawl.py               # 爬取脚本
│   ├── extract.py             # 提取脚本
│   ├── organize.py            # 整理脚本
│   ├── summarize.py           # 总结脚本
│   └── run_pipeline.py        # 一键运行完整流程
└── data/                      # 数据输出目录(自动创建)
    ├── raw/                   # 原始 HTML
    ├── extracted/             # 提取的 JSON 数据
    ├── organized/             # 整理后的数据
    └── summary/               # 总结报告

数据格式

提取后的文章格式 (JSON)

{
  "articles": [
    {
      "title": "文章标题",
      "url": "https://example.com/article",
      "summary": "文章摘要",
      "published_at": "2024-01-15T10:30:00",
      "source": "hn.aimaker.dev",
      "category": "AI",
      "score": 150
    }
  ],
  "metadata": {
    "crawled_at": "2024-01-15T12:00:00",
    "total_count": 30
  }
}

配置选项

各脚本支持以下环境变量或命令行参数:

  • TARGET_URL: 目标 URL(默认: https://hn.aimaker.dev/)
  • OUTPUT_DIR: 输出目录(默认: data/)
  • TIMEOUT: 请求超时时间(默认: 30秒)

注意事项

  1. 请遵守网站的 robots.txt 和爬虫协议
  2. 建议设置适当的请求间隔,避免对服务器造成压力
  3. 爬取的数据仅供个人学习研究使用

Comments

Loading comments...