Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or data entries.

Install

openclaw skills install dedupe

Dedupe — Data Deduplication Reference

Quick-reference skill for deduplication strategies, algorithms, and data quality patterns.

When to Use

  • Removing duplicate rows from datasets or databases
  • Deduplicating files in storage systems
  • Implementing fuzzy matching for near-duplicate detection
  • Choosing between exact and probabilistic dedup methods
  • Building ETL pipelines with deduplication stages

Commands

intro

scripts/script.sh intro

Overview of deduplication — types, strategies, and tradeoffs.

exact

scripts/script.sh exact

Exact deduplication — hash-based, key-based, and sorting approaches.

fuzzy

scripts/script.sh fuzzy

Fuzzy deduplication — similarity measures, blocking, and record linkage.

files

scripts/script.sh files

File-level deduplication — fdupes, jdupes, rdfind, and storage dedup.

algorithms

scripts/script.sh algorithms

Dedup algorithms — bloom filters, HyperLogLog, MinHash, SimHash.

sql

scripts/script.sh sql

SQL deduplication patterns — ROW_NUMBER, DISTINCT, GROUP BY strategies.

cli

scripts/script.sh cli

Command-line dedup tools — sort, uniq, awk, and stream processing.

checklist

scripts/script.sh checklist

Deduplication quality checklist and validation steps.

help

scripts/script.sh help

version

scripts/script.sh version

Configuration

VariableDescription
DEDUPE_DIRData directory (default: ~/.dedupe/)

Powered by BytesAgain | bytesagain.com | hello@bytesagain.com