Install
openclaw skills install chat2duckdb基于 DuckDB 引擎的高效数据分析工具;当用户需要对 CSV/JSON/Parquet/Excel 等数据文件进行 SQL 查询、数据分析、数据抽样或需要自动纠错的查询执行时使用
openclaw skills install chat2duckdbduckdb>=1.5.0
pandas>=2.0.0
确认数据文件路径(CSV/JSON/Parquet/Excel 等格式)
# 完整统计模式(推荐)
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode describe
# 简单模式(仅基本信息)
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode describe --simple
# 导出分析报告
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode describe --output report.json
# Excel 文件(默认读取第一个工作表)
python scripts/duckdb_analyzer.py --file_path ./data.xlsx --mode describe
# Excel 文件(指定工作表)
python scripts/duckdb_analyzer.py --file_path ./data.xlsx --excel_sheet "sheetTitle" --mode describe
# 基础查询
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--sql "SELECT * FROM data LIMIT 10"
# 聚合查询
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--sql "SELECT category, SUM(price * quantity) as total_sales FROM data GROUP BY category"
# 抽样验证(先在小样本上测试)
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--sql "SELECT * FROM data WHERE price > 100" --sample_fraction 0.1
# 导出查询结果(支持 CSV/Excel/JSON/Parquet)
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--sql "SELECT * FROM data" --output result.csv
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--sql "SELECT * FROM data" --output result.xlsx
# 持久化到 DuckDB 文件(后续可直接关联查询)
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--persist_db_path ./analysis.duckdb --persist_table \
--sql "SELECT category, SUM(price * quantity) as total_sales FROM data GROUP BY category"
--persist_db_path:指定 DuckDB 数据库文件路径--persist_table:将注册表持久化为普通表(默认是临时表)--sample_fraction 参数在小样本上验证--max_retries 参数调整重试次数场景:拿到新数据集,需要了解数据结构和质量
命令:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode describe
输出包含:
场景:分析各类别产品的销售表现
命令:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT category, COUNT(*) as num_products, SUM(price * quantity) as total_revenue, AVG(price) as avg_price FROM data GROUP BY category ORDER BY total_revenue DESC"
输出:
执行 SQL: SELECT category, COUNT(*) as num_products, SUM(price * quantity) as total_revenue, AVG(price) as avg_price FROM data GROUP BY category ORDER BY total_revenue DESC
【查询结果】
执行时间:0.05 秒
重试次数:0
结果行数:2
数据预览:
category num_products total_revenue avg_price
Electronics 12 42938.24 356.99
Furniture 8 19949.23 356.99
业务洞察:
场景:分析不同区域的销售情况
命令:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT region, COUNT(*) as num_orders, SUM(price * quantity) as total_sales, AVG(price) as avg_order_value FROM data GROUP BY region ORDER BY total_sales DESC"
场景:找出高价产品(price > 200),先在 10% 样本上验证
命令:
# 先在样本上验证
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT product_name, category, price FROM data WHERE price > 200" --sample_fraction 0.1
# 验证无误后执行完整查询
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT product_name, category, price FROM data WHERE price > 200 ORDER BY price DESC"
场景:SQL 有语法错误(多余分号),系统自动校正
命令:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT * FROM data WHERE price > 100;"
输出:
【SQL 校正记录】
✓ 语法校正:;\s*$ ->
【查询结果】
执行时间:0.03 秒
重试次数:1
结果行数:15
场景:将查询结果保存为 CSV、Excel、JSON 或 Parquet 文件
CSV 导出:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT category, region, SUM(price * quantity) as sales FROM data GROUP BY category, region" \
--output sales_summary.csv
Excel 导出:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT category, region, SUM(price * quantity) as sales FROM data GROUP BY category, region" \
--output sales_summary.xlsx
JSON 导出:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT category, region, SUM(price * quantity) as sales FROM data GROUP BY category, region" \
--output sales_summary.json
Parquet 导出:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT category, region, SUM(price * quantity) as sales FROM data GROUP BY category, region" \
--output sales_summary.parquet
结果:根据文件扩展名自动选择导出格式,保存为相应文件
场景:分析销售趋势
命令:
python scripts/duckdb_analyzer.py --file_path ./sales_data.csv --mode query \
--sql "SELECT DATE_TRUNC('month', sale_date) as month, SUM(price * quantity) as monthly_sales FROM data GROUP BY month ORDER BY month"
Q1: 文件找不到?
错误:数据文件不存在:./data.csv
解决:检查文件路径是否正确,使用绝对路径试试
Q2: Excel 读取失败?
错误:无法注册数据表:...
解决:
.xlsx 或 .xls--excel_sheet "工作表名"openpyxlQ3: SQL 执行失败? 系统会自动重试和校正 SQL,如果仍然失败,检查:
Q4: 内存不足? 解决:
--sample_fraction 0.1先用 describe 了解数据,再执行多个查询:
# 步骤 1:探索数据
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode describe
# 步骤 2:基于了解执行针对性查询
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode query \
--sql "SELECT category, AVG(price) as avg_price FROM data GROUP BY category"
对于大数据集:
# 先用 1% 样本快速验证
python scripts/duckdb_analyzer.py --file_path ./large_data.csv --mode query \
--sql "SELECT ..." --sample_fraction 0.01
# 验证通过后再执行完整查询
python scripts/duckdb_analyzer.py --file_path ./large_data.csv --mode query \
--sql "SELECT ..."
python scripts/duckdb_analyzer.py --file_path ./data.csv --mode describe | grep "缺失"
,;() 转 ,;())datainf/-inf -> NULL)data 表生成SELECT "销售渠道", SUM("售后退款-仅退货金额") AS total_return
FROM data
GROUP BY "销售渠道"
ORDER BY total_return DESC
LIMIT 10
SELECT 销售渠道, SUM(售后退款-仅退货金额) AS total_return
FROM data
GROUP BY 销售渠道