{"skill":{"slug":"bigdata","displayName":"Bigdata","summary":"Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data.","description":"---\nname: bigdata\nversion: \"2.0.0\"\nauthor: BytesAgain\nhomepage: https://bytesagain.com\nsource: https://github.com/bytesagain/ai-skills\nlicense: MIT-0\ntags: [bigdata, tool, utility]\ndescription: \"Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data.\"\n---\n\n# BigData\n\nA comprehensive data processing toolkit for ingesting, transforming, querying, filtering, aggregating, and managing data workflows — all from the command line with local timestamped log storage.\n\n## Commands\n\n| Command | Description |\n|---------|-------------|\n| `bigdata ingest <input>` | Ingest raw data into the system. Without args, shows recent ingest entries |\n| `bigdata transform <input>` | Record a data transformation step. Without args, shows recent transforms |\n| `bigdata query <input>` | Log and track data queries. Without args, shows recent queries |\n| `bigdata filter <input>` | Apply and record data filters. Without args, shows recent filters |\n| `bigdata aggregate <input>` | Record aggregation operations. Without args, shows recent aggregations |\n| `bigdata visualize <input>` | Log visualization tasks. Without args, shows recent visualizations |\n| `bigdata export <input>` | Log export operations. Without args, shows recent exports |\n| `bigdata sample <input>` | Record data sampling operations. Without args, shows recent samples |\n| `bigdata schema <input>` | Track schema definitions and changes. Without args, shows recent schemas |\n| `bigdata validate <input>` | Log data validation checks. Without args, shows recent validations |\n| `bigdata pipeline <input>` | Record pipeline configurations. Without args, shows recent pipelines |\n| `bigdata profile <input>` | Log data profiling operations. Without args, shows recent profiles |\n| `bigdata stats` | Show summary statistics across all entry types |\n| `bigdata search <term>` | Search across all log entries for a keyword |\n| `bigdata recent` | Show the 20 most recent activity entries from the history log |\n| `bigdata status` | Health check — version, data dir, total entries, disk usage, last activity |\n| `bigdata help` | Show all available commands |\n| `bigdata version` | Print version (v2.0.0) |\n\nEach data command (ingest, transform, query, etc.) works the same way:\n- **With arguments**: saves the entry with a timestamp to its dedicated `.log` file and records it in the activity history\n- **Without arguments**: displays the 20 most recent entries from that command's log\n\n## Data Storage\n\nAll data is stored locally in plain-text log files:\n\n```\n~/.local/share/bigdata/\n├── ingest.log          # Ingested data entries\n├── transform.log       # Transformation records\n├── query.log           # Query log\n├── filter.log          # Filter operations\n├── aggregate.log       # Aggregation records\n├── visualize.log       # Visualization tasks\n├── export.log          # Export operations\n├── sample.log          # Sampling records\n├── schema.log          # Schema definitions\n├── validate.log        # Validation checks\n├── pipeline.log        # Pipeline configurations\n├── profile.log         # Profiling results\n└── history.log         # Unified activity log with timestamps\n```\n\nEach entry is stored as `YYYY-MM-DD HH:MM|<value>` for easy parsing and export.\n\n## Requirements\n\n- **Bash** 4.0+ (uses `set -euo pipefail`)\n- Standard UNIX utilities: `date`, `wc`, `du`, `grep`, `head`, `tail`, `cat`\n- No external dependencies or API keys required\n- Works offline — all data stays on your machine\n\n## When to Use\n\n1. **Data pipeline tracking** — Record each step of a multi-stage data workflow (ingest → transform → validate → export) with full timestamps for audit trails\n2. **Quick data logging** — Capture observations, measurements, or notes about datasets directly from the terminal without opening a separate app\n3. **Schema management** — Keep track of schema definitions, changes, and validation rules as your data evolves over time\n4. **Data quality monitoring** — Log validation checks and profiling results to build a history of data quality metrics\n5. **Workflow documentation** — Use search and recent commands to review what data operations were performed, when, and in what order\n\n## Examples\n\n### Log a complete data workflow\n\n```bash\n# Ingest raw data\nbigdata ingest \"customer_orders_2024.csv — 1.2M rows loaded\"\n\n# Transform it\nbigdata transform \"normalize dates to ISO-8601, trim whitespace, deduplicate\"\n\n# Validate the output\nbigdata validate \"all required fields present, no nulls in customer_id\"\n\n# Record the schema\nbigdata schema \"orders: id(int), customer_id(int), amount(decimal), date(date)\"\n\n# Export when ready\nbigdata export \"final dataset pushed to analytics warehouse\"\n```\n\n### Search and review activity\n\n```bash\n# Search across all logs for a keyword\nbigdata search \"customer\"\n\n# Check overall statistics\nbigdata stats\n\n# View recent activity across all commands\nbigdata recent\n\n# Health check\nbigdata status\n```\n\n### Pipeline and profiling\n\n```bash\n# Define a pipeline\nbigdata pipeline \"daily-etl: ingest → clean → validate → load — runs at 02:00 UTC\"\n\n# Profile a dataset\nbigdata profile \"users table: 500K rows, 12 columns, 0.3% nulls in email field\"\n\n# Sample data for testing\nbigdata sample \"random 10% sample from transactions for QA testing\"\n\n# Record an aggregation\nbigdata aggregate \"monthly revenue by region — Q1 totals computed\"\n```\n\n### Filter and query tracking\n\n```bash\n# Log a filter operation\nbigdata filter \"removed records older than 2020-01-01, kept 850K of 1.2M rows\"\n\n# Track a query\nbigdata query \"SELECT region, SUM(revenue) FROM orders GROUP BY region\"\n\n# Log a visualization\nbigdata visualize \"bar chart: monthly revenue trend, exported as PNG\"\n```\n\n## Output\n\nAll commands print confirmation to stdout. Data is persisted in `~/.local/share/bigdata/`. Use `bigdata stats` for a summary or `bigdata search <term>` to find specific entries across all logs.\n\n---\n\n*Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*\n","tags":{"latest":"2.0.1"},"stats":{"comments":0,"downloads":644,"installsAllTime":1,"installsCurrent":1,"stars":0,"versions":7},"createdAt":1773545027120,"updatedAt":1778491917527},"latestVersion":{"version":"2.0.1","createdAt":1773830131728,"changelog":"update","license":"MIT-0"},"metadata":{"setup":[],"os":null,"systems":null},"owner":{"handle":"bytesagain3","userId":"s17b0rz2zaqen3pqpq807a6q6983fgq5","displayName":"bytesagain3","image":"https://avatars.githubusercontent.com/u/218212813?v=4"},"moderation":null}