Data Cleaning & Annotation Workflow

Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform (data.smlcrm.com). Includes downloading, cl...

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 514 · 0 current installs · 0 all-time installs

byYash Deshmukh@Deyashmukh

MIT-0

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Benign

high confidence

ℹ

Purpose & Capability

The skill’s name/description match the provided files: a downloader helper and a pandas cleaning script plus detailed upload instructions to data.smlcrm.com. One small expectation gap: the SKILL.md suggests using the Kaggle CLI, which implicitly requires the user's Kaggle API token/config (~/.kaggle/kaggle.json); the skill does not declare or attempt to manage that token (this is normal but worth noting).

ℹ

Instruction Scope

Runtime instructions are limited to finding/downloading data, running the local clean_dataset.py script, and manually uploading/configuring datasets on data.smlcrm.com. That stays within the stated purpose. One procedural oddity: it instructs assigning 'ALL group tags' to ALL variables including target variables — this is unexpected from an ML-quality standpoint (could harm labeling/analysis) but is not a security/privacy exfiltration action.

✓

Install Mechanism

No install specification: instruction-only with two small helper scripts. Nothing is downloaded from arbitrary URLs or written/executed by an installer; therefore low install risk.

✓

Credentials

The skill does not request environment variables, credentials, or config paths. The only implicit requirement is that the user may need the Kaggle CLI configured locally (user-managed kaggle.json) to run download_kaggle.sh — this is proportional to the stated task.

✓

Persistence & Privilege

always is false and the skill does not request persistent/system-wide changes or access to other skills' configs. It does not perform autonomous network exfiltration or modify agent settings.

Assessment

This skill appears to do what it says: local downloading (via your Kaggle CLI), pandas-based cleaning, and manual upload to data.smlcrm.com. Before installing or running: 1) Verify the data.smlcrm.com site is the intended/legitimate platform; 2) Review scripts locally (they are short and readable) and run them in a safe environment; 3) Ensure your Kaggle CLI is configured securely (kaggle.json in ~/.kaggle) — the skill won’t manage that token; 4) Be cautious about the 'assign ALL group tags to ALL variables' step — it may produce incorrect metadata or labels for ML models; and 5) Check dataset licenses and privacy constraints before downloading or uploading any datasets.

Like a lobster shell, security has layers — review code before you run it.

Current versionv1.0.0

Download zip

latestvk97dnsh2r10pte1r9k1aav2s6n81dezw

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Simulacrum Data Annotation Workflow

Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).

What This Skill Does

This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:

Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
Download: Get CSV files via browser or Kaggle CLI
Clean: Run Python/pandas script to handle missing values, duplicates, formatting
Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
Configure Headers: Set column types (Time, Target, Covariate, Group) and units
Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
Upload Cleaned: Final upload → CLEAN status

Supported Domains

Energy: Power consumption, utilities, renewable energy, grid data
Manufacturing: Industrial processes, steel production, emissions, equipment data
Climate: CO2 emissions, environmental monitoring, weather correlation data

Quick Start

For the full pipeline from Kaggle to annotated dataset:

1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status

Workflow Steps

Step 1: Find and Download Dataset

From Kaggle (Browser Method):

Navigate to kaggle.com/datasets
Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
Review data description, file list, and preview
Click "Download" button
Extract CSV file from downloaded zip

Alternative: Kaggle CLI

# Install if needed: pip install kaggle
# Configure: kaggle competitions list

scripts/download_kaggle.sh <dataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption

Step 2: Clean the Dataset

Always run the cleaning script before upload:

python3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]

What the script does:

Strips whitespace from column names
Removes duplicate rows
Fills missing numeric values with median
Fills missing categorical values with mode or 'Unknown'
Converts timestamp columns to datetime format
Outputs column summary for metadata configuration

Output:

Cleaned CSV file ready for upload
Column summary printed to console (save this for metadata config)

Step 3: Upload Raw Dataset to Platform

Navigate to data.smlcrm.com/dashboard
Click "Upload Dataset" button
Fill in metadata for the RAW dataset:
- Name: Descriptive dataset name
- Domain: Category (Energy, Manufacturing, Climate, etc.)
- Source URL: Kaggle or original source URL
- Description: Brief summary of the dataset
Upload the original/raw CSV file (not cleaned yet)
Click Upload

Result: Dataset appears in list with RAW status

Step 4: Upload Cleaned File & Configure Metadata

Find the RAW dataset in the list
Click "Clean" button
Upload the cleaned CSV file (from Step 2)
Configure headers for each column:

Setting	Description
Name	Column name (editable)
Units	Measurement units (kWh, °C, %, ratio, tCO2, etc.)
Type	Time / Target / Covariate / Group

Column Type Guide:

Time: Timestamp/datetime columns (usually required)
Target: Variable to predict (at least one required)
Covariate: Input features/independent variables
Group: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.)

Bulk Configuration:

Select multiple rows via checkboxes
Use "Apply" dropdown to set type for selected columns
Set units individually or in bulk

Common Unit Patterns:

Energy: kWh, MWh, MW
Power: kVarh, kW
Emissions: tCO2, kgCO2
Ratios: ratio, %
Time: seconds, minutes, hours

Step 5: Assign Groups to Variables

Purpose: Group variables define how data is segmented for analysis.

Exact Workflow:

Select ALL variables by checking their checkboxes:
- Target variable(s)
- ALL covariate variables
Apply ALL group tags to selected variables:
- Click first group tag (e.g., WeekStatus) → all selected get this group
- Click second group tag (e.g., Day_of_week) → all selected get this group
- Click third group tag (e.g., Load_Type) → all selected get this group
- Continue for all available group tags
Result: All variables have all groups assigned (e.g., "WeekStatus × Day_of_week × Load_Type")

Important: Assign groups to BOTH target variables AND all covariates.

Step 6: Final Upload

Click "Upload Cleaned Dataset" button
Wait for processing
Dataset status changes from RAW → CLEAN
Verify data points count is correct

Example: Steel Industry Energy Dataset

Source: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption

Metadata:

Name: Steel Industry Energy Consumption (South Korea)
Domain: Energy
Data Points: 350,400

Column Configuration:

Column	Type	Units
Timestamps	Time	-
Usage_kWh	Target	kWh
Lagging_Current_Reactive.Power_kVarh	Covariate	kVarh
Leading_Current_Reactive_Power_kVarh	Covariate	kVarh
CO2(tCO2)	Covariate	tCO2
Lagging_Current_Power_Factor	Covariate	ratio
Leading_Current_Power_Factor	Covariate	ratio
NSM	Covariate	seconds
WeekStatus	Group	-
Day_of_week	Group	-
Load_Type	Group	-

Group Assignment:

Select: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM
Click: WeekStatus → all selected get WeekStatus
Click: Day_of_week → all selected get Day_of_week
Click: Load_Type → all selected get Load_Type
Final: All variables show "WeekStatus × Day_of_week × Load_Type"

Reference Materials

For detailed platform configuration guidance, see references/platform_guide.md.

Troubleshooting

"Next" button disabled:

Check at least one Time column is set
Check at least one Target column is set
Verify all columns have types assigned

Groups not appearing:

Columns must be marked as "Group" type first
Proceed to next step after setting Group types

Upload fails:

Re-run cleaning script
Check CSV format (comma-delimited)
Verify no empty column names

Scripts

Script	Purpose
`scripts/clean_dataset.py`	Clean and prepare CSV for upload
`scripts/download_kaggle.sh`	Download datasets via Kaggle CLI

Platform URL

Data Annotation Platform: https://data.smlcrm.com

Files

4 total

Select a file

Select a file to preview.

Comments

Loading comments…