Install
openclaw skills install dataset-searchFind, compare, and obtain datasets or data lakes across ML repositories, cloud public data registries, government portals, scientific archives, geospatial/climate catalogs, NLP corpora, and generic web dataset indexes.
openclaw skills install dataset-searchUse this skill when the user needs a dataset, benchmark, public data lake, open-data portal, or data source for analysis, ML, BI, RAG, geospatial work, climate, NLP, multimodal projects, or data engineering.
python3 skills/dataset-search/scripts/dataset_search.py search "solar radiation hourly Brazil agriculture" --profile climate --region BR --limit 8 --format markdown
python3 skills/dataset-search/scripts/dataset_search.py search "credit card fraud transactions" --source kaggle,huggingface,openml --limit 10
python3 skills/dataset-search/scripts/dataset_search.py search "sentinel crop classification" --profile geospatial --source aws-open-data,copernicus,huggingface
scripts/dataset_search.py is a standard-library Python helper. It queries direct APIs/CLIs where practical and emits resilient fallback search links for sources without a stable public search API.
Common commands:
python3 skills/dataset-search/scripts/dataset_search.py sources
python3 skills/dataset-search/scripts/dataset_search.py search "income inequality Brazil time series" --profile economics --format json --output /tmp/dataset-results.json
python3 skills/dataset-search/scripts/dataset_search.py search "large multilingual instruction dataset" --profile nlp --offline --format markdown
python3 skills/dataset-search/scripts/dataset_search.py download --from-results /tmp/dataset-results.json --index 0 --output-dir /tmp/datasets
python3 skills/dataset-search/scripts/dataset_search.py download --from-results /tmp/dataset-results.json --index 0 --output-dir /tmp/datasets --yes
By default, download prints a safe acquisition plan. It only executes downloads or source CLIs with --yes.
Useful options:
--profile: general, ml, nlp, geospatial, climate, economics, government, brazil, biomed, multimodal, cloud--region: country, state, language, or geographic hint, such as BR, EU, US, Ceara, Portuguese--source: comma-separated source ids, or all--brief: JSON file with structured fields such as question, domain, task, geography, period, format, license, must_have, avoid, preferred_sources--offline: do not call the network; return source-specific search URLs and acquisition guidanceThe script has direct adapters for Hugging Face Datasets, Kaggle CLI, OpenML, UCI when its API is reachable, Zenodo, Figshare, data.gov CKAN, NASA/CDC Socrata catalogs, Harvard Dataverse, GBIF, and generic CKAN-style portals when configured in the script.
It also produces guided search/acquisition entries for AWS Registry of Open Data, Google Cloud Public Datasets, Azure Open Datasets, Databricks Marketplace, Snowflake Marketplace, World Bank Open Data, data.europa.eu, IBGE, dados.gov.br, Eurostat, UN Data, WHO GHO, FRED, IMF, Our World in Data, CERN Open Data, NOAA, Copernicus, NASA POWER, NASA Earthdata, USGS, OpenStreetMap/Geofabrik, OpenAQ, Google Dataset Search, DataHub, data.world, Dryad, Mendeley Data, OpenAIRE, Awesome Public Datasets, Common Crawl, The Pile/EleutherAI, LAION, Nasdaq Data Link, and other registry-style sources.
kaggle CLI and local credentials.huggingface-cli; gated datasets require authentication and acceptance of the dataset terms.