Install
openclaw skills install data-scientistExpertise in statistical analysis, predictive modeling, causal inference, and data-driven storytelling to generate validated business insights and support de...
openclaw skills install data-scientistProvides statistical analysis and predictive modeling expertise specializing in machine learning, experimental design, and causal inference. Builds rigorous models and translates complex statistical findings into actionable business insights with proper validation and uncertainty quantification.
Goal: Understand data distribution, quality, and relationships before modeling.
# Load and profile
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt
df = pd.read_csv("data.csv")
print(df.info()); print(df.describe())
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))
# Univariate analysis
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))
sns.histplot(df[col], kde=True, ax=ax1)
sns.boxplot(x=df[col], ax=ax2)
plt.show()
# Correlation
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# Cleaning
df['age'].fillna(df['age'].median(), inplace=True)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
results = df.groupby('group')['converted'].agg(['count','sum','mean'])
control, treatment = results.loc['A'], results.loc['B']
count = np.array([treatment['sum'], control['sum']])
nobs = np.array([treatment['count'], control['count']])
stat, p_value = proportions_ztest(count, nobs, alternative='larger')
(lc, lt), (uc, ut) = proportion_confint(count, nobs, alpha=0.05)
If p < 0.05: reject H0 (statistically significant). Check practical significance (lift magnitude).
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
# Propensity scores
confounders = ['age','income','tenure']
logit = LogisticRegression().fit(df[confounders], df['is_premium'])
df['pscore'] = logit.predict_proba(df[confounders])[:, 1]
# Nearest neighbor matching
nn = NearestNeighbors(n_neighbors=1).fit(control[['pscore']])
_, indices = nn.kneighbors(treatment[['pscore']])
matched_control = control.iloc[indices.flatten()]
ate = treatment['spend'].mean() - matched_control['spend'].mean()
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Data Leakage | Scaling/encoding before split | Pipeline; fit only on train |
| P-Hacking | Testing 50 hypotheses, reporting p<0.05 | Bonferroni/FDR correction; pre-register |
| Imbalanced Classes | 99.9% accuracy on 0.1% fraud | Use PR-AUC, F1; SMOTE; class_weights |