Pandas Skill
Expert pandas skill for data manipulation, cleaning, analysis, and transformation. Use this skill when working with tabular data, CSV/Excel files, data analy...
Like a lobster shell, security has layers — review code before you run it.
License
SKILL.md
Pandas Data Processing Skill
English | 简体中文
This skill provides comprehensive pandas data processing capabilities through executable scripts and reference documentation. Use this skill whenever tasks involve data manipulation, cleaning, analysis, or transformation of tabular data.
When to Use This Skill
Activate this skill when the user requests:
- Data cleaning operations (handling missing values, duplicates, outliers)
- Data analysis and statistical summaries
- Format conversions (CSV ↔ Excel ↔ JSON ↔ Parquet)
- Data transformation (filtering, sorting, aggregation, pivoting)
- Merging or combining multiple datasets
- Generating data quality reports
- Any pandas DataFrame operations
Core Capabilities
1. Data Cleaning (scripts/data_cleaner.py)
Handles common data cleaning tasks with a single command:
Usage:
python scripts/data_cleaner.py input.csv output.csv [options]
Available Options:
--remove-duplicates: Remove duplicate rows--handle-missing [strategy]: Handle missing values- Strategies:
drop,fill,forward,backward,mean,median
- Strategies:
--fill-value [value]: Custom fill value for missing data--remove-outliers: Remove outliers using IQR or Z-score method--outlier-method [method]: Chooseiqrorzscore(default: iqr)--standardize-columns: Standardize column names (lowercase, underscores)
Example:
python scripts/data_cleaner.py data.csv cleaned_data.csv \
--remove-duplicates \
--handle-missing mean \
--remove-outliers \
--standardize-columns
2. Data Analysis (scripts/data_analyzer.py)
Generates comprehensive data analysis reports:
Usage:
python scripts/data_analyzer.py input.csv [options]
Available Options:
--output, -o [file]: Save report to file--format [format]: Output format (jsonortext, default: json)
Report Includes:
- Basic information (rows, columns, memory usage)
- Data type distribution
- Missing values analysis
- Numeric column statistics (mean, std, min, max, quartiles, skewness, kurtosis)
- Categorical column statistics (unique values, value counts)
- Correlation analysis
- Outlier detection
Example:
python scripts/data_analyzer.py sales_data.csv -o report.json --format json
3. Data Transformation (scripts/data_transformer.py)
Performs various data transformation operations through subcommands:
Convert Format
python scripts/data_transformer.py convert input.csv output.xlsx
Supports: CSV, Excel (.xlsx/.xls), JSON, Parquet, HTML
Merge Files
python scripts/data_transformer.py merge file1.csv file2.csv file3.csv \
--output merged.csv \
--how outer \
--on key_column
Filter Data
python scripts/data_transformer.py filter data.csv \
--query "age > 18 and city == 'Beijing'" \
--output filtered.csv
Sort Data
python scripts/data_transformer.py sort data.csv \
--by sales quantity \
--descending \
--output sorted.csv
Select Columns
python scripts/data_transformer.py select data.csv \
--columns name age city \
--output selected.csv
Reference Documentation
The references/ directory contains detailed documentation:
references/common_operations.md
Comprehensive reference covering:
- Data reading/saving (CSV, Excel, JSON, SQL, Parquet)
- Data exploration (head, info, describe, dtypes)
- Data selection and filtering (loc, iloc, boolean indexing, query)
- Data cleaning (handling missing/duplicate values, type conversion)
- Data transformation (apply, map, sorting, column operations)
- Groupby and aggregation operations
- Pivot tables
- Merging and joining (concat, merge, join)
- Time series operations
- String operations
- Performance optimization tips
When to use: When Claude needs to understand pandas syntax or find the right method for a specific operation.
references/data_cleaning_best_practices.md
Best practices guide covering:
- Data quality check checklist
- Missing value handling strategies with decision tree
- Outlier detection methods (IQR, Z-Score, percentile)
- Data type optimization for memory efficiency
- String cleaning techniques
- Date/time standardization
- Complete cleaning pipeline template
- Common problems and solutions
- Data validation methods
When to use: When designing a data cleaning workflow or deciding on the best approach for specific data quality issues.
Workflow Guidelines
Step 1: Initial Assessment
Always start by analyzing the data:
python scripts/data_analyzer.py input_file.csv -o analysis_report.json
Review the report to understand data quality, types, missing values, and potential issues.
Step 2: Plan Cleaning Strategy
Based on the analysis report:
- Identify missing value strategy (reference:
data_cleaning_best_practices.md) - Determine if duplicates should be removed
- Decide on outlier handling approach
- Plan any necessary type conversions
Step 3: Execute Cleaning
Run the data cleaner with appropriate options:
python scripts/data_cleaner.py input.csv cleaned.csv [options]
Step 4: Transform as Needed
Apply any transformations (filtering, sorting, format conversion, merging):
python scripts/data_transformer.py [subcommand] [options]
Step 5: Validate Results
Re-run analysis on the cleaned data to verify improvements:
python scripts/data_analyzer.py cleaned.csv -o final_report.json
Common Patterns
Pattern 1: Quick Data Quality Report
python scripts/data_analyzer.py data.csv --format text
Pattern 2: Standard Cleaning Pipeline
python scripts/data_cleaner.py raw_data.csv clean_data.csv \
--standardize-columns \
--remove-duplicates \
--handle-missing median \
--remove-outliers
Pattern 3: Excel to CSV with Filtering
# Convert
python scripts/data_transformer.py convert data.xlsx data.csv
# Filter
python scripts/data_transformer.py filter data.csv \
--query "status == 'active'" \
--output filtered.csv
Pattern 4: Merge Multiple CSVs
python scripts/data_transformer.py merge *.csv \
--output combined.csv
Dependencies
Ensure pandas is installed:
pip install pandas numpy openpyxl
Optional for specific formats:
pip install pyarrow # For Parquet support
pip install xlrd # For older Excel files (.xls)
Tips for Effective Use
- Start with analysis: Always run the analyzer first to understand the data
- Incremental cleaning: Apply cleaning operations step by step, verify each step
- Preserve originals: Never overwrite original data files
- Check references: Consult reference docs for complex operations or best practices
- Validate results: Use the analyzer to verify cleaning effectiveness
- Memory efficiency: For large files, consider using the data type optimization techniques in the reference docs
- Combine operations: Chain multiple transformer commands for complex workflows
Limitations
- Scripts work with single-machine memory constraints (for very large datasets, consider Dask)
- Time series resampling and rolling operations require custom pandas code
- Complex statistical modeling beyond basic descriptive statistics requires additional libraries
- For advanced visualizations, use matplotlib/seaborn directly
Troubleshooting
Import errors: Ensure pandas and dependencies are installed
Memory errors: Process data in chunks or optimize dtypes (see references)
Encoding issues: Add encoding='utf-8' parameter when loading CSVs
Date parsing issues: Use pd.to_datetime() with explicit format string
For detailed pandas operations and troubleshooting, always refer to references/common_operations.md and references/data_cleaning_best_practices.md.
Files
14 totalComments
Loading comments…
