Install
openclaw skills install discovery-engineAutomatically discover novel, statistically validated patterns in tabular data. Find insights you'd otherwise miss, far faster and cheaper than doing it yourself (or prompting an agent to do it). Disco systematically searches for feature interactions, subgroup effects, and conditional relationships you wouldn't think to look for, validates each on hold-out data with FDR-corrected p-values, and checks every finding against academic literature for novelty. Returns structured patterns with conditions, effect sizes, citations, and novelty scores.
openclaw skills install discovery-enginehttps://disco.leap-labs.com/mcp, no install required. Best for datasets at a URL.pip install discovery-engine-api. Use this for local files of any size. Runs on your machine and streams files directly — no base64, no size limits.Quick rule: if the data is at a URL, use file_url in discovery_upload. If it's a local file, use the Python SDK — or if Python isn't available, upload directly via the presign API and pass the result to discovery_analyze. Don't use file_content (base64) unless the file is already in memory and tiny.
Follow this flow when helping a user analyze data with Disco. Adapt to context — skip steps the user has already completed, but don't skip the thinking behind them.
Ask the user what they want to analyze. Help them get their data into a usable form:
file_url in discovery_upload.Upload the dataset with discovery_upload and show the user what Disco sees — column names, types (continuous vs categorical), row count. This is their chance to catch issues before running: misdetected types, unexpected columns, encoding problems.
Help the user choose the column they want to understand or predict. This is the outcome Disco will find patterns for. Ask: "What are you trying to explain? What outcome matters to you?" The target must have at least 2 distinct values.
Walk through the columns and identify any that should be excluded via excluded_columns:
diagnosis_text when the target is diagnosis_code).serious, exclude serious_outcome, not_serious, death — they're all part of the same seriousness classification.This is the most important step for getting meaningful results. Tautological columns produce findings that are trivially true, not discoveries.
Ask the user whether they want a public or private analysis:
Ask what analysis depth they want (default is 2). Explain: higher depth means Disco finds more patterns — especially non-obvious interactions that shallow analysis misses. Maximum depth is the number of columns minus 2.
For a first run, depth 2 is a good starting point. If the results are interesting and they want to go deeper, they can re-run at higher depth.
If the user doesn't have a Disco API key:
discovery_signup with their email, they'll get a verification code, then call discovery_signup_verify with the code to get a disco_ API key. No password, no credit card required.If they already have an account but lost their key, use discovery_login / discovery_login_verify (same OTP flow).
Before submitting a private run, always call discovery_estimate first and show the cost to the user. Let them confirm before you proceed.
Submit the analysis with discovery_analyze. Use discovery_status to poll — do not block, continue the conversation.
Poll with discovery_status until complete, then fetch with discovery_get_results. Present results clearly:
report_url so the user can explore the interactive web report. Private reports require sign-in at the dashboard using the same email.Adapt the order to what the user asked. If they said "what drives X?", lead with feature importance. If they said "find something new", lead with novel patterns.
Every pattern in result.patterns is a dict (MCP) / Pattern dataclass (SDK) with at least:
| Field | Type | Meaning |
|---|---|---|
description | str | Pre-rendered sentence describing the pattern in plain English. Use this verbatim — don't try to compose your own from conditions. |
conditions | list[dict] | Feature ranges/values that define the pattern. Each item has feature, type ("continuous"/"categorical"), and either min_value/max_value or values. |
p_value | float | FDR-adjusted p-value on hold-out data. |
novelty_type | str | "novel" (not in existing literature) or "confirmatory" (validates known finding). |
novelty_explanation | str | Why Disco classified it as novel/confirmatory, with citations. |
target_change_direction | str | "max" (increases target) or "min" (decreases target). |
abs_target_change | float | Magnitude of effect, in target units. |
support_count / support_percentage | int / float | Rows matching the pattern. |
citations | list[dict] | Academic citations with title, year, doi, etc. |
After presenting results, let the user know:
hints and hidden_deep_count in the results and let them know.discovery_subscribe or discovery_purchase_credits if interested.Help the user dig into the results:
dashboard_urls.Add to your MCP config:
{
"mcpServers": {
"discovery-engine": {
"url": "https://disco.leap-labs.com/mcp",
"env": { "DISCOVERY_API_KEY": "disco_..." }
}
}
}
| Tool | Purpose |
|---|---|
discovery_upload | Upload a dataset. Supports URL download (file_url), local path (file_path), or base64 content (file_content). Returns a file_ref for use with discovery_analyze. |
discovery_analyze | Submit a dataset for analysis using a file_ref from discovery_upload. Returns a run_id. |
discovery_status | Poll a running analysis by run_id. |
discovery_get_results | Fetch completed results: patterns, p-values, citations, feature importance. |
discovery_estimate | Estimate the credit cost before committing to a run. |
| Tool | Purpose |
|---|---|
discovery_signup | Start account creation — sends verification code to email. |
discovery_signup_verify | Complete signup by submitting the verification code. Returns API key. |
discovery_login | Get a new API key for an existing account — sends verification code to email. |
discovery_login_verify | Complete login by submitting the verification code. Returns a new API key. |
discovery_account | Check credits, plan, and usage. |
discovery_list_plans | View available plans and pricing. |
discovery_subscribe | Subscribe to or change plan. |
discovery_purchase_credits | Buy credit packs. |
discovery_add_payment_method | Attach a Stripe payment method. |
Analyses can take a while depending on dataset size and depth. Do not block — submit, continue other work, poll for completion.
1. discovery_estimate → Check credit cost (always do this for private runs)
2. discovery_upload → Upload the dataset, get file_ref
3. discovery_analyze → Submit for analysis using file_ref, get run_id
4. discovery_status → Poll until status is "completed"
Returns: status, queue_position, current_step,
estimated_wait_seconds
5. discovery_get_results → Fetch patterns, summary, feature importance
Choose the right path for your situation:
| Situation | Best approach |
|---|---|
| Data is at an http/https URL | file_url in discovery_upload |
| Local file, Python available | Python SDK (engine.discover(...)) |
| Local file, MCP server running locally | file_path in discovery_upload |
| Local file, hosted MCP, no Python | Direct upload API (3 steps — see below) |
| Small file, any language | POST /api/data/upload/direct (single step — see below) |
| Tiny file already in memory | file_content in discovery_upload (last resort) |
Data at a URL:
discovery_upload(file_url="https://example.com/dataset.csv")
→ {"file": {...}, "columns": [{"name": "col1", "type": "continuous"}, ...], "rowCount": 5000}
discovery_analyze(file_ref=<result above>, target_column="outcome")
The server downloads the file directly — nothing passes through the agent or the model context. Works with public URLs, S3 presigned URLs, or any accessible http/https link.
Local file — Python SDK (recommended for any local file):
from discovery import Engine
engine = Engine(api_key="disco_...")
result = await engine.discover("data.csv", target_column="outcome")
Handles upload, polling, and results in one call. No size limit. See the Python SDK section for full documentation.
Local file — MCP server running locally (cloned from GitHub, stdio transport):
If you've cloned the repo and are running server.py locally, the process can read your filesystem directly:
discovery_upload(file_path="/home/user/data/dataset.csv")
→ {"file": {...}, "columns": [...], "rowCount": 5000}
discovery_analyze(file_ref=<result above>, target_column="outcome")
Reads the file locally and streams it directly to cloud storage — nothing passes through the model context. No size limit. file_path only works with a locally-running server — calling it against the hosted server at disco.leap-labs.com/mcp returns a File not found error (the user's path doesn't exist on the hosted machine's filesystem).
Local file — hosted MCP, direct upload (works from any language):
If you're using the hosted MCP server and Python isn't available, you can upload directly via the REST API in three steps, then pass the result to discovery_analyze as normal.
# 1. Get a presigned upload URL
curl -X POST https://disco.leap-labs.com/api/data/upload/presign \
-H "Authorization: Bearer disco_..." \
-H "Content-Type: application/json" \
-d '{"fileName": "data.csv", "contentType": "text/csv", "fileSize": 1048576}'
# → {"uploadUrl": "https://storage.googleapis.com/...", "key": "uploads/abc/data.csv", "uploadToken": "tok_..."}
# 2. PUT the file directly to cloud storage (the uploadUrl is pre-signed — no auth header needed)
curl -X PUT "<uploadUrl from step 1>" \
-H "Content-Type: text/csv" \
--data-binary @data.csv
# 3. Finalize the upload
curl -X POST https://disco.leap-labs.com/api/data/upload/finalize \
-H "Authorization: Bearer disco_..." \
-H "Content-Type: application/json" \
-d '{"key": "uploads/abc/data.csv", "uploadToken": "tok_..."}'
# → {"ok": true, "file": {...}, "columns": [...], "rowCount": 5000}
Pass the finalize response directly to discovery_analyze as file_ref. No size limit.
Small file — direct upload (single HTTP call, simpler than presign):
curl -X POST https://disco.leap-labs.com/api/data/upload/direct \
-H "Authorization: Bearer disco_..." \
-H "Content-Type: application/json" \
-d '{"fileName": "data.csv", "content": "<base64-encoded file content>"}'
# → {"ok": true, "file": {...}, "columns": [...], "rowCount": 5000}
Pass the response directly to discovery_analyze as file_ref. Simpler than the 3-step presign flow but the entire file must fit in the request body. For large files, use presigned uploads or the Python SDK.
Last resort — tiny file already in memory:
Only use this if the file is already loaded into memory and none of the above options apply. The base64-encoded content passes through the model's context window, so this only works for very small files.
import base64
content = base64.b64encode(open("data.csv", "rb").read()).decode()
discovery_upload(file_content=content, file_name="data.csv")
→ {"file": {...}, "columns": [...], "rowCount": 500}
discovery_analyze(file_ref=<result above>, target_column="outcome")
discovery_upload:
Provide exactly one of file_url, file_path, or file_content.
file_url — http/https URL. The server downloads it directly. Best option for hosted MCP.file_path — Absolute path to a local file. Only works when the MCP server is running locally. Against the hosted server, returns a File not found error.file_content — File contents, base64-encoded. Last resort only — the content passes through the model's context window, so this only works for very small files.file_name — Filename with extension (e.g. "data.csv"), used for format detection. Required with file_content. Default: "data.csv".Returns a file_ref (pass it directly to discovery_analyze) and columns (list of column names and types, useful if you need to inspect before choosing a target column).
discovery_analyze:
file_ref — File reference returned by discovery_upload. Required.target_column — The column to predict/explainanalysis_depth — 2 = default, higher = deeper analysis. Max: num_columns - 2visibility — "public" (free, results published) or "private" (costs credits)column_descriptions — JSON object mapping column names to descriptions. Significantly improves pattern explanations — always provide if column names are non-obviousexcluded_columns — JSON array of column names to exclude from analysis (see Preparing Your Data below)title — Optional title for the analysisdescription — Optional description of the datasetuse_llms — false (default) or true. Slower and more expensive, but you get smarter pre-processing, literature context and novelty assessment. Public runs always use LLMs regardless of this setting. Tradeoffs when false: pattern descriptions are generic, novelty is not assessed (no citations), report summaries are omitted, ambiguous integer columns (e.g. "month" 1-12) may be misclassified as categorical, and text cluster names are generic.author — Optional author name for the datasetsource_url — Optional URL of the original data sourceNew account: Call discovery_signup with the user's email. This sends a verification code — the user must check their email. Then call discovery_signup_verify with the code to receive a disco_ API key. Free tier: 10 credits/month, unlimited public runs. No password, no credit card.
Existing account (lost key or new session): Call discovery_login with the user's email. Same OTP flow — sends a code, then call discovery_login_verify to get a new API key.
discovery_estimate to show what it would costdiscovery_purchase_credits / discovery_subscribeBefore running an analysis, you must exclude columns that would produce meaningless findings. Disco finds statistically real patterns — but if the input includes columns that are definitionally related to the target, the patterns will be true by definition, not by discovery.
Always exclude these column types via excluded_columns:
Row IDs, patient IDs, UUIDs, accession numbers, sample codes. These are arbitrary labels with no analytical signal.
Columns that are the target column renamed, reformatted, or binned. Example: diagnosis_text when the target is diagnosis_code.
This is the most important category. Columns that encode the same underlying construct as the target — through alternative classifications, component parts, or derived calculations. These produce findings that are trivially true.
Examples:
serious, then serious_outcome (categories like death, disability, hospitalisation), not_serious, and death are all part of the same seriousness classification. A finding that "death predicts seriousness" is a tautology, not a discovery.response, then response_category, responder_flag, and RECIST_response are all encodings of the same outcome.profit, then revenue and cost together compose it (profit = revenue − cost).How to identify them: Ask "is this column just a different way of expressing what the target already measures?" If yes, exclude it.
# Example: FAERS adverse event analysis
excluded_columns=["serious_outcome", "not_serious", "death", "hospitalization",
"disability", "congenital_anomaly", "life_threatening",
"required_intervention", "case_id", "report_id"]
Disco is not another AI data analyst that writes pandas or SQL for you. It is a discovery pipeline — it finds patterns in data that you, the user, and other analysis tools would miss because they don't know to look for them.
Use it when you need to go beyond answering questions about data, and start finding things nobody thought to ask:
Use Disco when the user says: "what's really driving X?", "are there patterns we're missing?", "find something new in this data", "what predicts Y that we haven't considered?", "go deeper than correlation", "discover non-obvious relationships"
Use pandas/SQL instead when the user says: "summarize this data", "make a chart", "what's the average?", "filter rows where X > 5", "show me the distribution"
Disco finds complex patterns in your data — feature interactions, nonlinear thresholds, and meaningful subgroups — without requiring prior hypotheses about what matters. Each pattern is validated on hold-out data, corrected for multiple testing, and checked for novelty against academic literature with citations.
This is a computational pipeline, not prompt engineering over data. You cannot replicate what it does by writing pandas code or asking an LLM to look at a CSV. It finds structure that hypothesis-driven analysis misses because it doesn't start with hypotheses.
Programmatic (for agents): POST the email to /api/signup. The server either returns the API key directly (free-tier email-only signup, the common case) or asks for OTP verification — branch on the response shape.
curl -X POST https://disco.leap-labs.com/api/signup \
-H "Content-Type: application/json" \
-d '{"email": "agent@example.com"}'
Response — direct key (no verification needed):
{"key": "disco_...", "key_id": "...", "organization_id": "...", "tier": "free_tier", "credits": 10}
Response — verification required (only when the server explicitly asks):
{"status": "verification_required", "email": "agent@example.com"}
If you get the latter, the user must read the 6-digit code from their email, then submit it:
curl -X POST https://disco.leap-labs.com/api/signup/verify \
-H "Content-Type: application/json" \
-d '{"email": "agent@example.com", "code": "123456"}'
# → {"key": "disco_...", "tier": "free_tier", "credits": 10}
The Python SDK's Engine.signup() and Engine.login() already handle both shapes — prefer them over raw HTTP if you can.
Existing account (lost key or new session): Same OTP flow via /api/login and /api/login/verify, or in the SDK:
engine = await Engine.login(email="agent@example.com")
Manual (for humans): Sign up at https://disco.leap-labs.com/sign-up, create key at https://disco.leap-labs.com/developers.
pip install discovery-engine-api
Disco runs are async and can take a while. Do not block on them — submit the run, continue with other work, and retrieve results when ready.
from discovery import Engine
# If you already have an API key:
engine = Engine(api_key="disco_...")
# Or sign up for one.
# Sends a code to the email address and prompts for it interactively.
# Requires a terminal — for fully automated agents, use the two-step REST API
# in the "Getting an API Key" section above instead.
engine = await Engine.signup(email="agent@example.com")
# One-call method: submit, poll, and return results automatically
result = await engine.discover(
file="data.csv",
target_column="outcome",
)
# result.patterns contains the discovered patterns
for pattern in result.patterns:
if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
print(f"{pattern.description} (p={pattern.p_value:.4f})")
If you need to see the dataset's columns before choosing a target column, upload first and inspect:
# Upload once and get the server's parsed column list
upload = await engine.upload_file(file="data.csv", title="My dataset")
print(upload["columns"]) # [{"name": "col1", "type": "continuous", ...}, ...]
print(upload["rowCount"]) # e.g., 5000
# Pass the result to avoid re-uploading
result = await engine.run_async(
file="data.csv",
target_column="col1",
wait=True,
upload_result=upload, # skips the upload step
)
If you need to do other work while Disco runs (recommended for agent workflows):
# Submit and return immediately (wait=False is the default for run_async)
run = await engine.run_async(file="data.csv", target_column="outcome")
print(f"Submitted run {run.run_id}, continuing with other work...")
# ... do other things ...
# Check back later
result = await engine.wait_for_completion(run.run_id, timeout=1800)
This is the preferred pattern for agents. engine.discover() is a convenience wrapper that does this internally with wait=True.
Non-async contexts: use engine.discover_sync() — same signature as discover(), runs in a managed event loop.
Here's a truncated real response from a crop yield analysis (target column: yield_tons_per_hectare). This is what engine.discover() returns:
EngineResult(
run_id="a1b2c3d4-...",
status="completed",
task="regression",
total_rows=5012,
report_url="https://disco.leap-labs.com/reports/a1b2c3d4-...",
summary=Summary(
overview="Disco identified 14 statistically significant patterns in this "
"agricultural dataset. 5 patterns are novel — not reported in existing literature. "
"The strongest driver of crop yield is a previously unreported interaction between "
"humidity and wind speed at specific thresholds.",
key_insights=[
"Humidity alone is a known predictor, but the interaction with low wind speed at "
"72-89% humidity produces a 34% yield increase — a novel finding.",
"Soil nitrogen above 45 mg/kg shows diminishing returns when phosphorus is below "
"12 mg/kg, contradicting standard fertilization guidelines.",
"Planting density has a non-linear effect: the optimal range (35-42 plants/m²) is "
"narrower than current recommendations suggest.",
],
novel_patterns=PatternGroup(
pattern_ids=["p-1", "p-2", "p-5", "p-9", "p-12"],
explanation="5 of 14 patterns have not been reported in the agricultural literature. "
"The humidity × wind interaction (p-1) and the nitrogen-phosphorus "
"diminishing returns effect (p-2) are the most significant novel findings."
),
),
patterns=[
# Pattern 1: Novel multi-condition interaction
Pattern(
id="p-1",
task="regression",
target_column="yield_tons_per_hectare",
description="When humidity is between 72-89% AND wind speed is below 12 km/h, "
"crop yield increases by 34% above the dataset average",
conditions=[
{"type": "continuous", "feature": "humidity_pct",
"min_value": 72.0, "max_value": 89.0, "min_q": 0.55, "max_q": 0.88},
{"type": "continuous", "feature": "wind_speed_kmh",
"min_value": 0.0, "max_value": 12.0, "min_q": 0.0, "max_q": 0.41},
],
p_value=0.003, # FDR-corrected
p_value_raw=0.0004,
novelty_type="novel",
novelty_explanation="Published studies examine humidity and wind speed as independent "
"predictors of crop yield, but this interaction effect — where "
"low wind amplifies the benefit of high humidity within a specific "
"range — has not been reported in the literature.",
citations=[
{"title": "Effects of relative humidity on cereal crop productivity",
"authors": ["Zhang, L.", "Wang, H."], "year": "2021",
"journal": "Journal of Agricultural Science", "doi": "10.1017/S0021859621000..."},
{"title": "Wind exposure and grain yield: a meta-analysis",
"authors": ["Patel, R.", "Singh, K."], "year": "2019",
"journal": "Field Crops Research", "doi": "10.1016/j.fcr.2019.03..."},
],
target_change_direction="max",
abs_target_change=0.34,
target_score=0.81,
support_count=847,
support_percentage=16.9,
target_mean=8.7,
target_std=1.2,
),
# Pattern 2: Novel — contradicts existing guidelines
Pattern(
id="p-2",
task="regression",
target_column="yield_tons_per_hectare",
description="When soil nitrogen exceeds 45 mg/kg AND soil phosphorus is below "
"12 mg/kg, crop yield decreases by 18% — a diminishing returns effect "
"not captured by standard fertilization models",
conditions=[
{"type": "continuous", "feature": "soil_nitrogen_mg_kg",
"min_value": 45.0, "max_value": 98.0, "min_q": 0.72, "max_q": 1.0},
{"type": "continuous", "feature": "soil_phosphorus_mg_kg",
"min_value": 1.0, "max_value": 12.0, "min_q": 0.0, "max_q": 0.31},
],
p_value=0.008,
p_value_raw=0.0012,
novelty_type="novel",
novelty_explanation="Nitrogen-phosphorus balance is studied extensively, but the "
"specific threshold at which high nitrogen becomes counterproductive "
"under low phosphorus conditions has not been quantified in field studies.",
citations=[
{"title": "Nitrogen-phosphorus interactions in cereal cropping systems",
"authors": ["Mueller, T.", "Fischer, A."], "year": "2020",
"journal": "Nutrient Cycling in Agroecosystems", "doi": "10.1007/s10705-020-..."},
],
target_change_direction="min",
abs_target_change=0.18,
target_score=0.74,
support_count=634,
support_percentage=12.7,
target_mean=5.3,
target_std=1.8,
),
# Pattern 3: Confirmatory — validates known finding
Pattern(
id="p-3",
task="regression",
target_column="yield_tons_per_hectare",
description="When soil organic matter is above 3.2% AND irrigation is 'drip', "
"crop yield increases by 22%",
conditions=[
{"type": "continuous", "feature": "soil_organic_matter_pct",
"min_value": 3.2, "max_value": 7.1, "min_q": 0.61, "max_q": 1.0},
{"type": "categorical", "feature": "irrigation_type",
"values": ["drip"]},
],
p_value=0.001,
p_value_raw=0.0001,
novelty_type="confirmatory",
novelty_explanation="The positive interaction between soil organic matter and drip "
"irrigation efficiency is well-documented in the literature.",
citations=[
{"title": "Drip irrigation and soil health: a systematic review",
"authors": ["Kumar, S.", "Patel, A."], "year": "2022",
"journal": "Agricultural Water Management", "doi": "10.1016/j.agwat.2022..."},
],
target_change_direction="max",
abs_target_change=0.22,
target_score=0.69,
support_count=1203,
support_percentage=24.0,
target_mean=7.9,
target_std=1.5,
),
# ... 11 more patterns omitted
],
feature_importance=FeatureImportance(
kind="global",
baseline=6.5, # Mean yield across the dataset
scores=[
FeatureImportanceScore(feature="humidity_pct", score=1.82),
FeatureImportanceScore(feature="soil_nitrogen_mg_kg", score=1.45),
FeatureImportanceScore(feature="soil_organic_matter_pct", score=1.21),
FeatureImportanceScore(feature="irrigation_type", score=0.94),
FeatureImportanceScore(feature="wind_speed_kmh", score=-0.67),
FeatureImportanceScore(feature="planting_density_per_m2", score=0.58),
# ... more features
],
),
columns=[
Column(name="yield_tons_per_hectare", type="continuous", data_type="float",
mean=6.5, median=6.2, std=2.1, min=1.1, max=14.3),
Column(name="humidity_pct", type="continuous", data_type="float",
mean=65.3, median=67.0, std=18.2, min=12.0, max=99.0),
Column(name="irrigation_type", type="categorical", data_type="string",
approx_unique=4, mode="furrow"),
# ... more columns
],
)
Key things to notice:
report_url links to an interactive web report — drop this in your response so the user can explore visually. Private runs require sign-in — tell the user to sign in at the dashboard using the same email address the account was created with (email verification code, no password needed). Public runs are accessible to anyone.engine.discover(
file: str | Path | pd.DataFrame, # Dataset to analyze
target_column: str, # Column to predict/analyze
analysis_depth: int = 2, # 2=default, higher=deeper analysis (max: num_columns - 2)
visibility: str = "public", # "public" (free, results will be published) or "private" (costs credits)
title: str | None = None, # Dataset title
description: str | None = None, # Dataset description
column_descriptions: dict[str, str] | None = None, # Column descriptions for better pattern explanations
excluded_columns: list[str] | None = None, # Columns to exclude from analysis
use_llms: bool = False, # True = LLM explanations (costs more) — see below
timeout: float = 1800, # Max seconds to wait for completion
)
Tip: Providing column_descriptions significantly improves pattern explanations. If your columns have non-obvious names (e.g., col_7, feat_a), always describe them.
discovery_estimate to check cost before running.Agents can attach a payment method and purchase credits entirely via the API — no browser required.
Step 1 — Get your Stripe publishable key
account = await engine.get_account()
stripe_pk = account["stripe_publishable_key"]
stripe_customer_id = account["stripe_customer_id"]
Or via REST:
curl https://disco.leap-labs.com/api/account \
-H "Authorization: Bearer disco_..."
# → { "stripe_publishable_key": "pk_live_...", "stripe_customer_id": "cus_...", "credits": {...}, ... }
Step 2 — Tokenize a card using the Stripe API
Use the publishable key to create a Stripe PaymentMethod. Card data goes directly to Stripe — Disco never sees it.
import requests
pm_response = requests.post(
"https://api.stripe.com/v1/payment_methods",
auth=(stripe_pk, ""), # publishable key as username, empty password
data={
"type": "card",
"card[number]": "4242424242424242",
"card[exp_month]": "12",
"card[exp_year]": "2028",
"card[cvc]": "123",
},
)
payment_method_id = pm_response.json()["id"] # "pm_..."
Step 3 — Attach the payment method
result = await engine.add_payment_method(payment_method_id)
# → {"payment_method_attached": True, "card_last4": "4242", "card_brand": "visa"}
Or via REST:
curl -X POST https://disco.leap-labs.com/api/account/payment-method \
-H "Authorization: Bearer disco_..." \
-H "Content-Type: application/json" \
-d '{"payment_method_id": "pm_..."}'
Step 4 — Purchase credits
Credits are sold in packs of 100 ($10/pack, $0.10/credit).
result = await engine.purchase_credits(packs=1)
# → {"purchased_credits": 100, "total_credits": 110, "charge_amount_usd": 10.0, "stripe_payment_id": "pi_..."}
Or via REST:
curl -X POST https://disco.leap-labs.com/api/account/credits/purchase \
-H "Authorization: Bearer disco_..." \
-H "Content-Type: application/json" \
-d '{"packs": 1}'
Subscriptions (optional)
For regular usage, subscribe to a paid plan instead of buying packs:
# Plans: free_tier ($0, 10 cr/mo), tier_1 ($49, 500 cr/mo), tier_2 ($199, 2000 cr/mo)
result = await engine.subscribe(plan="tier_1")
# → {"plan": "tier_1", "name": "Researcher", "monthly_credits": 500, "price_usd": 49}
Requires a payment method on file. See GET /api/plans for full plan details.
Before submitting a private analysis, estimate the credit cost:
estimate = await engine.estimate(
file_size_mb=10.5,
num_columns=25,
analysis_depth=2,
visibility="private",
)
# estimate["cost"]["credits"] → 55
# estimate["account"]["sufficient"] → True/False
@dataclass
class EngineResult:
run_id: str
report_id: str | None # Report UUID (used in report_url)
status: str # "pending", "processing", "completed", "failed"
dataset_title: str | None # Title of the dataset
dataset_description: str | None # Description of the dataset
total_rows: int | None
target_column: str | None # Column being predicted/analyzed
task: str | None # "regression", "binary_classification", "multiclass_classification"
summary: Summary | None # LLM-generated insights
patterns: list[Pattern] # Discovered patterns (the core output)
columns: list[Column] # Feature info and statistics
correlation_matrix: list[CorrelationEntry] # Feature correlations
feature_importance: FeatureImportance | None # Global importance scores
job_id: str | None # Job ID for tracking
job_status: str | None # Job queue status
queue_position: int | None # Position in queue when pending (1 = next up)
current_step: str | None # Active pipeline step (preprocessing, training, interpreting, reporting)
current_step_message: str | None # Human-readable description of the current step
estimated_wait_seconds: int | None # Estimated queue wait time in seconds (pending only)
error_message: str | None
report_url: str | None # Shareable link to interactive web report
dashboard_urls: dict[str, dict[str, str]] | None # Direct links to report sections (summary, patterns, territory, features)
hints: list[str] # Upgrade hints (non-empty for free-tier users with hidden patterns)
hidden_deep_count: int # Patterns hidden for free-tier accounts (upgrade to see all)
hidden_deep_novel_count: int # Novel patterns hidden for free-tier accounts
@dataclass
class Pattern:
id: str
description: str # Human-readable description of the pattern
conditions: list[dict] # Conditions defining the pattern (feature ranges/values)
p_value: float # FDR-adjusted p-value (lower = more significant)
p_value_raw: float | None # Raw p-value before FDR adjustment
novelty_type: str # "novel" (new finding) or "confirmatory" (known in literature)
novelty_explanation: str # Why this is novel or confirmatory
citations: list[dict] # Academic citations supporting novelty assessment
target_change_direction: str # "max" (increases target) or "min" (decreases target)
abs_target_change: float # Magnitude of effect
support_count: int # Number of rows matching this pattern
support_percentage: float # Percentage of dataset
target_score: float # Mean target value (regression) or class fraction (classification) in the subgroup
task: str
target_column: str
target_class: str | None # For classification tasks
target_mean: float | None # For regression tasks
target_std: float | None
@dataclass
class PatternGroup:
pattern_ids: list[str] # IDs of patterns in this group
explanation: str # Why these patterns are grouped
@dataclass
class Summary:
overview: str # High-level summary
key_insights: list[str] # Main takeaways
novel_patterns: PatternGroup # Novel pattern IDs and explanation
selected_pattern_id: str | None
@dataclass
class CorrelationEntry:
feature_x: str
feature_y: str
value: float
@dataclass
class Column:
id: str
name: str
display_name: str
type: str # "continuous" or "categorical"
data_type: str # "int", "float", "string", "boolean", "datetime"
enabled: bool
description: str | None
mean: float | None
median: float | None
std: float | None
min: float | None
max: float | None
iqr_min: float | None
iqr_max: float | None
mode: str | None # Most common value (categorical columns)
approx_unique: int | None # Approximate distinct value count
null_percentage: float | None
feature_importance_score: float | None
@dataclass
class FeatureImportance:
kind: str # "global"
baseline: float
scores: list[FeatureImportanceScore]
@dataclass
class FeatureImportanceScore:
feature: str
score: float # Signed importance score
# Filter for significant novel patterns
novel = [p for p in result.patterns if p.p_value < 0.05 and p.novelty_type == "novel"]
# Get patterns that increase the target
increasing = [p for p in result.patterns if p.target_change_direction == "max"]
# Get the most important features
if result.feature_importance:
top_features = sorted(result.feature_importance.scores, key=lambda s: abs(s.score), reverse=True)
# Access pattern conditions (the "rules" defining the pattern)
for pattern in result.patterns:
for cond in pattern.conditions:
# cond has: type ("continuous"/"categorical"), feature, min_value/max_value or values
print(f" {cond['feature']}: {cond}")
from discovery.errors import (
AuthenticationError,
InsufficientCreditsError,
RateLimitError,
RunFailedError,
RunNotFoundError,
PaymentRequiredError,
)
try:
result = await engine.discover(file="data.csv", target_column="target")
except AuthenticationError as e:
pass # Invalid or expired API key — check e.suggestion
except InsufficientCreditsError as e:
pass # Not enough credits — e.credits_required, e.credits_available, e.suggestion
except RateLimitError as e:
pass # Too many requests — retry after e.retry_after seconds
except RunFailedError as e:
pass # Run failed server-side — e.run_id
except RunNotFoundError as e:
pass # Run not found — e.run_id (may have been cleaned up)
except PaymentRequiredError as e:
pass # Payment method needed — check e.suggestion
except FileNotFoundError:
pass # File doesn't exist
except TimeoutError:
pass # Didn't complete in time — retrieve later with engine.wait_for_completion(run_id)
All errors inherit from DiscoveryError and include a suggestion field with actionable instructions.
Disco expects a flat table — columns for features, rows for samples.
Supported formats: CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather. Max 5 GB.
Not supported: images, raw text documents, nested/hierarchical JSON, multi-sheet Excel (use the first sheet or export to CSV).
When presenting Disco results, render interactive visualizations -- don't just dump text. Default order (adapt based on what the user asked):
Summary -- show summary.overview and summary.key_insights. Links to dashboard_urls.summary.url.
Pattern plots -- for the top patterns, render a violin plot: one violin per condition, one for all conditions combined, one for the overall dataset. Y-axis is the target variable. Shows how each condition narrows the distribution. Links to dashboard_urls.patterns.url.
Territory map -- 3D surface where X and Y axes are two features from a pattern's conditions, Z axis is the target. Shows the interaction landscape. Best when patterns involve feature interactions. Links to dashboard_urls.territory.url.
Feature importance -- horizontal waterfall bars floating from zero, sorted by absolute contribution. Links to dashboard_urls.features.url.
Correlation heatmap -- square matrix of feature correlations, sorted by correlation with target. Links to dashboard_urls.features.url.
Use judgment: if the user asked "what drives X?", lead with feature importance. If they asked "find something new", lead with novel patterns. If they're exploring interactions, lead with territory.
For exact colors, scales, and layout details, follow the full visualization spec: https://disco.leap-labs.com/visualization-spec
Always link to the relevant dashboard_urls page so users can explore the full interactive version.