Install
openclaw skills install clean-csv-toolkitLocal CSV / TSV / JSONL inspection and cleanup toolkit. Profile a tabular file (row count, auto-detected column types, nulls, distincts, samples), validate it against a small JSON schema, deduplicate by full row or key columns, diff two files by key, and convert between csv/tsv/jsonl/json/markdown. Pure Python 3 standard library, no pandas, no remote calls.
openclaw skills install clean-csv-toolkitv0.1.0
A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.
scripts/inspect.py — profile a .csv / .tsv / .jsonl file: row count, auto-detected column types (int, float, bool, date, datetime, string, empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.scripts/validate.py — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.scripts/dedupe.py — remove duplicate rows by full-row match or by key columns. Optional --keep first|last, --case-insensitive, --trim, and a JSONL report of every removed row.scripts/diff.py — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.scripts/convert.py — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.scripts/check_deps.sh — verify python3 is available.bash scripts/check_deps.sh
Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.
python3 scripts/inspect.py customers.csv
Output:
file: /path/customers.csv
size: 284 B (284 bytes)
encoding: utf-8
kind: csv
rows: 5
columns: 6
# name type nulls null% distinct sample
----------------------------------------------------------------------------------------------------
1 id int 0 0.00 5 '1', '2', '3'
2 email string 0 0.00 5 'alice@example.com', ...
3 name string 0 0.00 5 'Alice', 'Bob', 'Carol'
4 amount float 1 20.00 4 '42.50', '100.00', '7.25'
5 status string 0 0.00 3 'approved', 'pending', ...
6 signup_date date 0 0.00 5 '2025-01-15', ...
Pass --json for machine-readable output that pipes into other tools.
The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.
Write a schema.json:
{
"required_columns": ["id", "email", "amount", "status"],
"columns": {
"id": {"type": "int", "required": true, "unique": true, "min": 1},
"email": {"type": "string", "required": true, "regex": ".+@.+\\..+"},
"amount": {"type": "float", "min": 0, "max": 100000},
"status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
"signup_date": {"type": "date"}
}
}
Then:
python3 scripts/validate.py customers.csv --schema schema.json
A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:
row column kind detail
------------------------------------------------------------------------------------------------
2 email regex_mismatch value did not match regex | value='not-an-email'
2 amount bad_type value does not match type 'float' | value='abc'
3 amount below_min value -50.0 < min 0 | value='-50.00'
3 status not_in_enum value not in allowed set | value='unknown_status'
4 id duplicate_unique value already seen earlier in this column | value='1'
Pass --json for a structured report and --max-errors N to cap collection on huge files.
By full-row match (any two rows identical in every column):
python3 scripts/dedupe.py messy.csv clean.csv
By a key column (only one canonical row per id):
python3 scripts/dedupe.py messy.csv clean.csv --key id \
--removed-report removed.jsonl
--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " alice@example.com" and "ALICE@example.com" collapse to one row.
The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.
python3 scripts/diff.py customers_old.csv customers_new.csv --key id
Output:
added: 1
removed: 1
changed: 1
--- ADDED (1) ---
+ 6
--- REMOVED (1) ---
- 4
--- CHANGED (1) ---
~ 2
amount: '100.00' -> '150.00'
status: 'pending' -> 'approved'
Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").
python3 scripts/convert.py data.csv data.jsonl # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv # delimiter change
Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \n in cell values so the table stays well-formed.
| Code | Meaning |
|---|---|
| 0 | success / validation pass / files identical |
| 1 | validation fail / files differ / no rows in input |
| 2 | bad arguments / unsafe path / missing input / unsupported extension / schema malformed |
This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:
python3 scripts/validate.py incoming.csv --schema schema.json \
&& python3 scripts/dedupe.py incoming.csv clean.csv --key id \
&& python3 scripts/inspect.py clean.csv
subprocess calls. No shell invocation.;, |, &, >, <, $, `, backslash-newline, etc.).inspect.py profiles 10,000 rows in well under one second on a single core (single-pass streaming read).dedupe.py and diff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop."1,234.56" is detected as string, not float. Re-export with a different number format if you need different inference.MIT. See LICENSE.