Clean CSV Toolkit

Local CSV / TSV / JSONL inspection and cleanup toolkit. Profile a tabular file (row count, auto-detected column types, nulls, distincts, samples), validate it against a small JSON schema, deduplicate by full row or key columns, diff two files by key, and convert between csv/tsv/jsonl/json/markdown. Pure Python 3 standard library, no pandas, no remote calls.

Audits

Pass

Install

openclaw skills install clean-csv-toolkit

clean-csv-toolkit

v0.1.0

A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.

What this skill does

  • scripts/inspect.py — profile a .csv / .tsv / .jsonl file: row count, auto-detected column types (int, float, bool, date, datetime, string, empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.
  • scripts/validate.py — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.
  • scripts/dedupe.py — remove duplicate rows by full-row match or by key columns. Optional --keep first|last, --case-insensitive, --trim, and a JSONL report of every removed row.
  • scripts/diff.py — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.
  • scripts/convert.py — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.
  • scripts/check_deps.sh — verify python3 is available.

What this skill does not do

  • It does not call any LLM, web service, or remote API.
  • It does not load a full dataframe into memory just to do simple structural work; the helpers stream rows where possible.
  • It does not write outside the input/output paths the caller provides.
  • It does not do statistical analysis (mean, percentile, correlation). For that, use a dataframe library.
  • It does not parse Excel files (.xls / .xlsx). Export to CSV first.

Required dependencies

bash scripts/check_deps.sh

Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.

Workflows

1. Profile an unknown CSV

python3 scripts/inspect.py customers.csv

Output:

file:      /path/customers.csv
size:      284 B (284 bytes)
encoding:  utf-8
kind:      csv
rows:      5
columns:   6

  #  name                          type           nulls   null%    distinct  sample
----------------------------------------------------------------------------------------------------
  1  id                            int                0    0.00           5  '1', '2', '3'
  2  email                         string             0    0.00           5  'alice@example.com', ...
  3  name                          string             0    0.00           5  'Alice', 'Bob', 'Carol'
  4  amount                        float              1   20.00           4  '42.50', '100.00', '7.25'
  5  status                        string             0    0.00           3  'approved', 'pending', ...
  6  signup_date                   date               0    0.00           5  '2025-01-15', ...

Pass --json for machine-readable output that pipes into other tools.

The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.

2. Validate against a schema

Write a schema.json:

{
  "required_columns": ["id", "email", "amount", "status"],
  "columns": {
    "id":     {"type": "int", "required": true, "unique": true, "min": 1},
    "email":  {"type": "string", "required": true, "regex": ".+@.+\\..+"},
    "amount": {"type": "float", "min": 0, "max": 100000},
    "status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
    "signup_date": {"type": "date"}
  }
}

Then:

python3 scripts/validate.py customers.csv --schema schema.json

A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:

   row  column                  kind                    detail
------------------------------------------------------------------------------------------------
     2  email                   regex_mismatch          value did not match regex | value='not-an-email'
     2  amount                  bad_type                value does not match type 'float' | value='abc'
     3  amount                  below_min               value -50.0 < min 0 | value='-50.00'
     3  status                  not_in_enum             value not in allowed set | value='unknown_status'
     4  id                      duplicate_unique        value already seen earlier in this column | value='1'

Pass --json for a structured report and --max-errors N to cap collection on huge files.

3. Remove duplicates

By full-row match (any two rows identical in every column):

python3 scripts/dedupe.py messy.csv clean.csv

By a key column (only one canonical row per id):

python3 scripts/dedupe.py messy.csv clean.csv --key id \
  --removed-report removed.jsonl

--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " alice@example.com" and "ALICE@example.com" collapse to one row.

The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.

4. Diff two files

python3 scripts/diff.py customers_old.csv customers_new.csv --key id

Output:

added:      1
removed:    1
changed:    1

--- ADDED (1) ---
  + 6
--- REMOVED (1) ---
  - 4
--- CHANGED (1) ---
  ~ 2
      amount: '100.00' -> '150.00'
      status: 'pending' -> 'approved'

Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").

5. Convert between formats

python3 scripts/convert.py data.csv data.jsonl       # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv       # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md          # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv         # delimiter change

Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \n in cell values so the table stays well-formed.

Exit codes

CodeMeaning
0success / validation pass / files identical
1validation fail / files differ / no rows in input
2bad arguments / unsafe path / missing input / unsupported extension / schema malformed

This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:

python3 scripts/validate.py incoming.csv --schema schema.json \
  && python3 scripts/dedupe.py incoming.csv clean.csv --key id \
  && python3 scripts/inspect.py clean.csv

Safety properties

  • Pure Python 3 standard library. No third-party dependencies.
  • No subprocess calls. No shell invocation.
  • All file paths are validated against a strict allowlist regex that rejects shell metacharacters (;, |, &, >, <, $, `, backslash-newline, etc.).
  • Scripts only read the input paths the caller provides and write to the output paths the caller provides. No temp files outside the system's tempdir.
  • All inputs and outputs use UTF-8 by default; CSV reads auto-fall-back through utf-8-sig, cp1252, and latin-1 when the file's encoding is non-UTF-8.
  • Deterministic: the same input produces the same output every time.

Performance

  • inspect.py profiles 10,000 rows in well under one second on a single core (single-pass streaming read).
  • All scripts stream rows; they do not load the entire file into memory for processing. The exception is dedupe.py and diff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop.
  • No background threads, no process pool, no caching.

Known limitations

  • Type inference uses regex-shape matching, not locale-aware parsing. "1,234.56" is detected as string, not float. Re-export with a different number format if you need different inference.
  • The Markdown writer flattens multi-line cells to single lines (newlines become spaces).
  • JSON Lines input must have one JSON object per line. Multi-line JSON arrays are not supported; use the regular CSV/JSONL pipeline.

License

MIT. See LICENSE.