职业教育政策信息抓取工具 | Vocational Education Policy Scraper

Automation

Automatically scrapes and filters vocational education policy documents and project announcements from Chinese government education websites with keyword and...

Install

openclaw skills install zhinao-vocational-policy

职业教育政策信息抓取工具 | Vocational Education Policy Scraper

概述 | Overview

自动抓取教育部、人社部及各省教育厅官网的职业教育政策文件、课题申报信息,支持按关键词筛选和定期汇总。

Automatically scrapes vocational education policy documents and project announcements from Ministry of Education, Ministry of Human Resources, and provincial education departments. Supports keyword filtering and periodic summaries.

核心功能 | Core Features

1. 多源数据抓取 | Multi-Source Data Scraping

支持的数据源 | Supported Sources:

抓取内容 | Content Types:

  • 政策文件 (Policy Documents)
  • 课题申报信息 (Project Applications)
  • 教学成果奖申报 (Teaching Achievement Awards)
  • 产教融合文件 (Industry-Education Integration)
  • 1+X证书政策 (1+X Certificate Policies)
  • 双高计划通知 (Double High Plan Notifications)

2. 智能分类筛选 | Intelligent Classification and Filtering

分类体系 | Classification System:

  • policy: 政策文件 (Policy Documents)
  • project: 课题申报 (Project Applications)
  • achievement: 教学成果奖 (Teaching Achievement Awards)
  • integration: 产教融合 (Industry-Education Integration)
  • certificate: 1+X证书 (1+X Certificates)
  • double_high: 双高计划 (Double High Plan)

筛选功能 | Filtering Capabilities:

  • 关键词筛选 (Keyword Filtering)
  • 时间范围筛选 (Date Range Filtering)
  • 类别筛选 (Category Filtering)
  • 来源筛选 (Source Filtering)

3. 定期汇总 | Periodic Summaries

汇总功能 | Summary Features:

  • 按时间周期汇总 (Daily/Weekly/Monthly summaries)
  • 按主题分类汇总 (Themed summaries)
  • 按地区分类汇总 (Regional summaries)
  • 自动生成报告 (Automatic report generation)

快速开始 | Quick Start

基本用法 | Basic Usage

中文示例 | Chinese Examples:

# 抓取最近30天的所有政策文件
python scripts/scrape_voc_ed_policy.py --days 30

# 按关键词筛选(双高计划、产教融合)
python scripts/scrape_voc_ed_policy.py --keywords "双高计划" "产教融合" --days 30

# 按类别筛选(仅政策文件)
python scripts/scrape_voc_ed_policy.py --category policy --days 7

# 综合筛选(多个关键词 + 类别 + 时间)
python scripts/scrape_voc_ed_policy.py --keywords "1+X证书" --category certificate --days 14

# 保存到指定文件
python scripts/scrape_voc_ed_policy.py --keywords "教学成果奖" --output results.json

English Examples:

# Scrape all policy documents from the last 30 days
python scripts/scrape_voc_ed_policy.py --days 30 --lang en

# Filter by keywords
python scripts/scrape_voc_ed_policy.py --keywords "双高计划" "产教融合" --days 30 --lang en

# Filter by category
python scripts/scrape_voc_ed_policy.py --category policy --days 7 --lang en

# Comprehensive filtering
python scripts/scrape_voc_ed_policy.py --keywords "1+X证书" --category certificate --days 14 --lang en

# Save to specified file
python scripts/scrape_voc_ed_policy.py --keywords "教学成果奖" --output results.json --lang en

命令行参数 | Command Line Arguments

参数说明示例
--keywords关键词列表--keywords "双高计划" "产教融合"
--days回溯天数(默认30)--days 7
--category筛选类别--category policy
--output输出文件路径--output results.json
--lang语言 (zh/en)--lang zh

工作流程 | Workflow

步骤 1: 确定抓取需求 | Step 1: Determine Scraping Requirements

中文: 明确需要抓取的内容类型、时间范围、关键词和类别。

English: Clarify the content type, time range, keywords, and category needed.

示例 | Example:

  • 内容:双高计划相关政策 (Double high plan policies)
  • 时间:最近30天 (Last 30 days)
  • 类别:政策文件 (Policy documents)

步骤 2: 运行抓取脚本 | Step 2: Run Scraping Script

中文: 根据需求配置参数,运行抓取脚本。

English: Configure parameters based on requirements and run the scraping script.

python scripts/scrape_voc_ed_policy.py --keywords "双高计划" --category policy --days 30

步骤 3: 查看结果 | Step 3: Review Results

中文: 抓取完成后,查看生成的JSON文件或终端输出摘要。

English: After scraping is complete, review the generated JSON file or terminal summary.

输出格式 | Output Format:

{
  "websites_scraped": 3,
  "total_documents": 45,
  "results": [
    {
      "title": "教育部关于公布中国特色高水平高职学校和专业建设计划名单的通知",
      "url": "https://www.moe.gov.cn/...",
      "date": "2024-01-15",
      "source": "教育部",
      "category": "double_high",
      "keywords": ["双高计划", "高职学校"]
    }
  ],
  "errors": [],
  "timestamp": "2024-01-20T10:30:00",
  "filters": {
    "keywords": ["双高计划"],
    "days": 30,
    "category": "policy"
  }
}

步骤 4: 数据分析和汇总 | Step 4: Data Analysis and Summary

中文: 根据抓取结果进行分析,生成汇总报告。

English: Analyze the scraped results and generate summary reports.

高级功能 | Advanced Features

定期抓取设置 | Scheduled Scraping Setup

中文: 使用cronjob设置定期抓取任务。

English: Use cronjob to set up scheduled scraping tasks.

# 每天早上8点抓取最近30天的政策文件
0 8 * * * python /path/to/scripts/scrape_voc_ed_policy.py --days 30 --output /path/to/results/daily_$(date +\%Y\%m\%d).json

# 每周一抓取最近7天的政策文件
0 8 * * 1 python /path/to/scripts/scrape_voc_ed_policy.py --days 7 --output /path/to/results/weekly_$(date +\%Y\%m\%d).json

自定义网站配置 | Custom Website Configuration

中文: 在脚本中添加新的网站配置。

English: Add new website configurations in the script.

EDU_WEBSITES = {
    "新增网站": {
        "base_url": "https://example.gov.cn",
        "policy_url": "https://example.gov.cn/policy/",
        "selectors": {
            "title": "a[title]",
            "date": ".date",
            "link": "a[href]"
        },
        "keywords": ["职业教育", "政策"]
    }
}

结果导出和格式化 | Result Export and Formatting

中文: 将JSON结果转换为其他格式(CSV、Markdown、HTML)。

English: Convert JSON results to other formats (CSV, Markdown, HTML).

# 导出为CSV
import pandas as pd
df = pd.DataFrame(results['results'])
df.to_csv('results.csv', index=False, encoding='utf-8-sig')

# 导出为Markdown
def to_markdown(results):
    md = "# 职业教育政策抓取结果\n\n"
    for item in results['results']:
        md += f"## {item['title']}\n"
        md += f"- **来源**: {item['source']}\n"
        md += f"- **日期**: {item['date']}\n"
        md += f"- **链接**: {item['url']}\n\n"
    return md

资源说明 | Resources

scripts/scrape_voc_ed_policy.py

核心抓取脚本,支持:

  • 多网站并行抓取
  • 智能关键词匹配
  • 自动分类
  • 双语输出
  • 错误处理

references/edu_websites.md

教育部、人社部及各省教育厅官网列表,包含:

  • 官网URL
  • 职业教育专栏URL
  • 常见关键词

i18n_helper.py

国际化辅助模块,支持:

  • 自动语言检测
  • 双语输出
  • 可扩展的语言支持

i18n.json

翻译文件,包含:

  • UI字符串
  • 错误消息
  • 分类名称
  • 帮助文本

注意事项 | Important Notes

网站访问限制 | Website Access Restrictions

中文:

  1. 部分政府网站可能有反爬虫机制,需要设置合理的访问频率
  2. 建议每次抓取间隔至少1秒
  3. 避免短时间内大量请求
  4. 遵守网站的robots.txt规则

English:

  1. Some government websites may have anti-scraping mechanisms; set reasonable request intervals
  2. Recommend at least 1 second interval between requests
  3. Avoid massive requests in a short time
  4. Follow website robots.txt rules

数据准确性 | Data Accuracy

中文:

  1. 官方网站可能随时更新,URL结构可能变化
  2. 不同网站的HTML结构差异大,需要针对性解析
  3. 日期格式不统一,需要灵活解析
  4. 建议定期验证抓取结果的准确性

English:

  1. Official websites may update at any time, URL structures may change
  2. HTML structures vary significantly between websites, requiring targeted parsing
  3. Date formats are not uniform, requiring flexible parsing
  4. Regularly verify the accuracy of scraped results

法律合规 | Legal Compliance

中文:

  1. 仅用于学习和研究目的
  2. 不得用于商业用途
  3. 遵守相关法律法规
  4. 尊重网站版权和使用条款

English:

  1. For learning and research purposes only
  2. Not for commercial use
  3. Comply with relevant laws and regulations
  4. Respect website copyright and terms of use

常见问题 | FAQ

Q1: 如何添加新的网站?

中文:

  1. references/edu_websites.md 中添加网站信息
  2. scripts/scrape_voc_ed_policy.pyEDU_WEBSITES 字典中添加配置
  3. 测试新的网站配置
  4. 更新相关文档

English:

  1. Add website information in references/edu_websites.md
  2. Add configuration in EDU_WEBSITES dictionary in scripts/scrape_voc_ed_policy.py
  3. Test the new website configuration
  4. Update relevant documentation

Q2: 如何优化抓取速度?

中文:

  1. 使用多线程/异步请求
  2. 缓存已抓取的内容
  3. 减少不必要的数据提取
  4. 使用更快的HTML解析库

English:

  1. Use multi-threading/async requests
  2. Cache scraped content
  3. Reduce unnecessary data extraction
  4. Use faster HTML parsing libraries

Q3: 如何处理抓取错误?

中文:

  1. 查看错误日志
  2. 检查网络连接
  3. 验证网站URL和选择器
  4. 增加重试机制和错误处理

English:

  1. Check error logs
  2. Verify network connection
  3. Validate website URLs and selectors
  4. Add retry mechanisms and error handling

扩展建议 | Extension Suggestions

未来可能的改进 | Potential Future Improvements

  1. 增量抓取: 只抓取新增和更新的内容
  2. 智能推荐: 基于用户历史行为推荐相关政策
  3. 全文搜索: 支持对抓取内容的全文检索
  4. 可视化分析: 生成图表和可视化报告
  5. 邮件通知: 新政策发布时自动发送通知
  6. 多格式输出: 支持导出为PDF、Word等格式

贡献指南 | Contribution Guidelines

中文: 欢迎提交问题和改进建议。在提交PR之前,请确保:

  1. 代码符合PEP 8规范
  2. 添加必要的注释和文档
  3. 测试新增功能
  4. 更新相关文档

English: Issues and improvement suggestions are welcome. Before submitting a PR, ensure:

  1. Code follows PEP 8 standards
  2. Add necessary comments and documentation
  3. Test new features
  4. Update relevant documentation

版本: 1.0.0 | Version: 1.0.0 最后更新: 2024年 | Last Updated: 2024