小红书笔记抓取与选题助手

v1.0.1

抓取并整理小红书笔记公开页面信息(标题、正文摘要、作者、发布时间、互动数据、标签、封面图等)为结构化 JSON 或 Markdown。用于“根据笔记链接提取内容”“批量收集笔记基础信息”“生成笔记摘要素材”等场景;当用户提供小红书笔记 URL、URL 列表或需要导出机器可读结果时触发。

0· 72·0 current·0 all-time
Security Scan
Capability signals
Requires OAuth token
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
medium confidence
Purpose & Capability
The name/description (抓取小红书笔记并结构化输出) aligns with the included files: multiple fetchers (HTTP and Playwright), TikHub API callers, result processors, and export tools. The code and SKILL.md consistently implement search-by-keyword, single/batch URL fetch, and output generation.
Instruction Scope
Runtime instructions are narrowly scoped to fetching pages, parsing metadata, and exporting results. They do request user-provided cookie strings/files and API tokens (TikHub or generic endpoints), and they produce optional screenshots/HTML. These are relevant to the stated task but are sensitive inputs (browser cookies/session tokens). The generic adapter (search_notes_generic.py) can call any base_url you supply — that is powerful and can be misused if pointed at a malicious endpoint.
Install Mechanism
There is no formal install spec (instruction-only skill), which is low-risk. The repo contains package.json/package-lock and instructs installing playwright when using the browser-mode scraper — installing Playwright downloads browser binaries (normal for this task but more heavyweight). All downloads referenced are standard registries (npm) and known endpoints (no obscure download URLs in SKILL.md).
Credentials
The skill does not declare required env vars and does not require unrelated credentials. It does legitimately require API tokens or cookie files provided by the user to access TikHub or to fetch authenticated pages. Because the generic search script accepts arbitrary base_url + token, a misconfigured usage could leak a token to an unintended endpoint — the files and instructions make this possible but it is user-driven rather than hidden.
Persistence & Privilege
The skill is not always-enabled and does not request elevated persistence. It does not modify other skills or system-wide configs. It runs as invoked and writes output files in the working directory (expected behavior).
Assessment
This package appears coherent for scraping and structuring public Xiaohongshu notes, but pay attention to secrets and external endpoints before running: 1) Do not paste full browser cookie strings or session tokens into untrusted environments — those are equivalent to logging in as you. 2) TikHub API calls will send your TikHub token to api.tikhub.io (expected); only use tokens you control and trust the third party. 3) The generic API script (search_notes_generic.py) will call whatever base_url you provide with whatever token/header you give it — avoid pointing it at unknown hosts to prevent accidental secret exfiltration. 4) generate_wow_pack encodes content into a mermaid.ink URL — generating/sharing that URL will upload the encoded content to a third-party service when the URL is resolved. 5) If you run the Playwright scraper, installing Playwright will download browser binaries; run it in a sandbox or environment where you can vet the network activity. 6) Ensure your usage complies with Xiaohongshu’s terms, robots.txt, and privacy rules. If you need to proceed: run locally in an isolated environment, avoid reusing high-privilege session cookies, and inspect outputs before sharing.

Like a lobster shell, security has layers — review code before you run it.

automationvk97fj8ff6rvd1fgf2petk9xcbh84q7wqcontentvk97fj8ff6rvd1fgf2petk9xcbh84q7wqlatestvk97fh8n4xf8n02etsnwqq9j7ks84qz6zmindmapvk97fj8ff6rvd1fgf2petk9xcbh84q7wqxiaohongshuvk97fj8ff6rvd1fgf2petk9xcbh84q7wq
72downloads
0stars
2versions
Updated 6d ago
v1.0.1
MIT-0

小红书笔记获取

极简规则

输入 TikHub API Key + 需求(关键词、页码) 即可抓取小红书搜索数据。

最小示例:

make tikhub-fetch \
  KEYWORD="女性力量" \
  TIKHUB_PAGE=2 \
  TIKHUB_TOKEN="<YOUR_TIKHUB_KEY>" \
  TIKHUB_ENDPOINT=web \
  TIKHUB_AUTH_MODE=bearer

输出文件:

  • workspace/xiaohongshu-note-fetcher-skill-data/tikhub_search_page2.json

快速开始

  1. 收集输入来源:
  • 单条链接:直接给 URL。
  • 批量链接:准备一个文本文件,每行一个 URL。
  • 若目标页面需登录态:提供 Cookie 字符串或 Cookie 文件。
  1. 运行脚本抓取(页面解析模式):
python3 scripts/fetch_xiaohongshu_notes.py \
  --url "https://www.xiaohongshu.com/explore/<note_id>" \
  --format both \
  --output result.json
  1. 批量抓取示例:
python3 scripts/fetch_xiaohongshu_notes.py \
  --url-file ./urls.txt \
  --format json \
  --output notes.json
  1. 运行 TikHub API 搜索(推荐用于关键词搜笔记):
python3 scripts/search_notes_tikhub.py \
  --token "<YOUR_TIKHUB_TOKEN>" \
  --keyword "女性主义" \
  --page 1 \
  --output search_page1.json

也可以使用 Makefile 一键调用(推荐长期使用):

make tikhub-fetch \
  KEYWORD="美食" \
  TIKHUB_TOKEN="<YOUR_TIKHUB_TOKEN>" \
  TIKHUB_ENDPOINT=web \
  TIKHUB_AUTH_MODE=bearer

若你不想每次手输 token,可把 token 放到文件(如 ./.tikhub_token):

make tikhub-fetch \
  KEYWORD="美食" \
  TIKHUB_TOKEN_FILE=./.tikhub_token \
  TIKHUB_ENDPOINT=web \
  TIKHUB_AUTH_MODE=bearer
  1. 使用其他服务商 API(通用适配):
python3 scripts/search_notes_generic.py \
  --base-url "https://your-api.example.com/search_notes" \
  --auth-mode bearer \
  --auth-header Authorization \
  --token "<YOUR_API_TOKEN>" \
  --keyword "美食推荐" \
  --page 1 \
  --param sort_type=general \
  --output generic_search.json
  1. 从 TikHub 响应生成文章列表(按点赞过滤):
python3 scripts/build_article_list_from_tikhub.py \
  --input ./tikhub_search.json \
  --min-likes 1000 \
  --rank-by hot \
  --md-output ./xhs_article_list.md \
  --csv-output ./xhs_article_list.csv \
  --json-output ./xhs_article_list.json \
  --template-output ./xhs_publish_templates.md
  1. 自动翻页抓取并合并(直接调 TikHub):
python3 scripts/build_article_list_from_tikhub.py \
  --token "<YOUR_TIKHUB_TOKEN>" \
  --keyword "美食" \
  --pages 5 \
  --sort general \
  --note-type _0 \
  --min-likes 1000 \
  --rank-by hot \
  --top 50
  1. 交互式筛选与查看(会先问你筛选方式,含默认方案):
python3 scripts/interactive_filter_view.py \
  --input ./tikhub_search.json

直接回车会采用默认方案:

  • 点赞阈值:1000
  • 排序:hot(综合热度)
  • 条数:20
  • 查看:summary
  • 同时导出 md/csv

若你不想交互,直接用默认方案:

python3 scripts/interactive_filter_view.py \
  --input ./tikhub_search.json \
  --non-interactive
  1. 导图输出(保留):
python3 scripts/generate_wow_pack.py \
  --input ./xhs_article_list.json \
  --keyword 美食 \
  --url-output ./xhs_topic_mindmap_url.txt
  1. 本机浏览器抓取(不走第三方 API,推荐):
node scripts/fetch_xiaohongshu_note_playwright.js \
  --url "https://www.xiaohongshu.com/explore/<note_id>" \
  --cookie-file ./cookie.txt \
  --output note_browser.json \
  --screenshot note_browser.png \
  --html-out note_browser.html

若首次运行提示 playwright_not_installed,先安装:

cd scripts
npm i playwright
npx playwright install chromium

参数说明

  • --url: 单条小红书笔记 URL。
  • --url-file: 批量 URL 文件(每行一个 URL,支持 # 注释行)。
  • --cookie: 原始 Cookie 请求头字符串。
  • --cookie-file: Cookie 文件(纯文本 Cookie 字符串)。
  • --format: jsonmdboth,默认 json
  • --output: 输出路径;json/both 写该文件,md 会在同目录生成 .md
  • --timeout: 请求超时秒数,默认 20
  • search_notes_tikhub.py 关键参数:
  • --keyword--page 必填。
  • --sort-type 可选:generaltime_descendingpopularity_descending 等。
  • --note-type 支持中英文:不限/all视频笔记/video普通笔记/image直播笔记/live
  • --time-filter 支持中英文:不限/all一天内/day一周内/week半年内/half_year
  • --ai-mode 使用整数 01
  • search_notes_generic.py 关键参数:
  • --base-url:新 API 的搜索端点。
  • --auth-modenonebearerapikey
  • --auth-header:鉴权头名称,默认 Authorization
  • --keyword-param / --page-param:当对方字段不是 keyword/page 时改这里。
  • --param key=value:补充任意查询参数,可重复。
  • --header key=value:补充任意请求头,可重复。
  • fetch_xiaohongshu_note_playwright.js 关键参数:
  • --url:笔记 URL。
  • --cookie-file:浏览器 Cookie 文本(建议提供,提高字段完整度)。
  • --headed:显示浏览器窗口调试。
  • --screenshot / --html-out:输出调试文件,方便排查风控页和登录页。

--url--url-file 至少给一个。

工作流程

  1. 读取 URL 列表。
  2. 发送 HTTP 请求拉取页面 HTML(带浏览器 UA,可选 Cookie)。
  3. 从页面中提取:
  • OpenGraph 元信息(og:titleog:descriptionog:image
  • JSON-LD(发布时间、作者、关键词、互动统计)
  • 页面脚本中的 noteId(若存在)
  1. 产出统一字段结构,见 references/output-schema.md

常见问题处理

  1. 仅返回基础字段或计数缺失:
  • 小红书前端结构会变化,且部分互动字段需要更完整会话态;优先补充 --cookie 后重试。
  1. 返回 403 或风控页:
  • 更换网络环境或降低抓取频率,避免高并发。
  • 仅用于你有权访问的数据,不要绕过平台安全机制。
  1. 需要深度字段(如评论明细):
  • 本 skill 默认仅做页面级基础信息抽取;如需更深层数据,先确认合规性,再扩展独立脚本。
  1. TikHub 400 Request failed
  • 首次请求尽量只保留 keywordpage,确认成功后再加筛选项。
  • 避免把数值参数写成字符串语义值:page 用整数、ai_mode0/1
  • 分页时优先使用上一页返回的 search_idsearch_session_id
  • 先用官方默认演示参数跑通,再逐项加入自定义参数。

合规边界

  • 只抓取你有权限访问的公开或授权内容。
  • 遵守小红书服务条款、robots、隐私和数据保护要求。
  • 不执行账号盗用、验证码绕过、反爬绕过等违规行为。

资源

Comments

Loading comments...