Cost Prediction

Predict construction project costs using Machine Learning. Use Linear Regression, K-Nearest Neighbors, and Random Forest models on historical project data. Train, evaluate, and deploy cost prediction models.

MIT-0 · Free to use, modify, and redistribute. No attribution required.

⭐ 0 · 1.1k · 5 current installs · 5 all-time installs

by@datadrivenconstruction

MIT-0

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name and description (train/evaluate/deploy cost-prediction models) match the SKILL.md and instructions.md. The declared filesystem permission aligns with reading historical CSVs and saving models. The only mild ambiguity is the word 'deploy' — the instructions constrain computation to local filesystem (no external APIs), so 'deploy' appears to mean saving models locally rather than deploying to a remote service.

✓

Instruction Scope

Instructions focus on preparing local datasets, feature engineering, training LinearRegression/KNN/RandomForest, evaluating metrics, and saving models. They do not instruct reading unrelated system files, accessing environment secrets, or contacting external endpoints. The guidance to 'gather project parameters' means soliciting user input, not scanning system data.

ℹ

Install Mechanism

There is no install spec (instruction-only), which is low risk because nothing is written or downloaded by the skill itself. However, runtime Python libraries (pandas, scikit-learn, numpy) are used in the examples but not declared; users must ensure those dependencies are present. This is a minor coherence/usability gap but not a security red flag.

✓

Credentials

The skill requests no environment variables or external credentials. The filesystem permission declared in claw.json is proportionate to the stated need to read historical datasets and save trained models. Note: filesystem permission inherently allows reading any local files the agent process can access, so users should be aware of privacy of local data used for training.

✓

Persistence & Privilege

always:false and normal autonomous-invocation settings are appropriate. The skill does not request to modify other skills or system-wide settings and has no install step that would create persistent background components.

Assessment

This skill appears coherent and local-only. Before using it, confirm you trust the CSVs or other local data you will load (they may contain sensitive project or personnel information). Ensure the Python environment has pandas, numpy, and scikit-learn installed (the SKILL.md examples assume these). If you prefer stricter isolation, run the code in a controlled environment (container/VM) so filesystem access is limited. Finally, clarify what 'deploy' means for your workflow — the skill saves models locally but does not include steps to publish to a remote service.

Like a lobster shell, security has layers — review code before you run it.

Current versionv2.0.0

Download zip

latestvk97e74df71jdktyfd7hqj86xq58125pp

License

MIT-0

Free to use, modify, and redistribute. No attribution required.

Termshttps://spdx.org/licenses/MIT-0.html

SKILL.md

Construction Cost Prediction with Machine Learning

Overview

Based on DDC methodology (Chapter 4.5), this skill enables predicting construction project costs using historical data and machine learning algorithms. The approach transforms traditional expert-based estimation into data-driven prediction.

Book Reference: "Будущее: прогнозы и машинное обучение" / "Future: Predictions and Machine Learning"

"Предсказания и прогнозы на основе исторических данных позволяют компаниям принимать более точные решения о стоимости и сроках проектов." — DDC Book, Chapter 4.5

Core Concepts

Historical Data → Feature Engineering → ML Model → Cost Prediction
    │                    │                │              │
    ▼                    ▼                ▼              ▼
Past projects      Prepare data      Train model    New project
with costs         for ML            on history     cost forecast

Quick Start

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Load historical project data
df = pd.read_csv("historical_projects.csv")

# Features and target
X = df[['area_m2', 'floors', 'complexity_score']]
y = df['total_cost']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, predictions):.2f}")
print(f"MAE: ${mean_absolute_error(y_test, predictions):,.0f}")

# Predict new project
new_project = [[5000, 10, 3]]  # area, floors, complexity
cost = model.predict(new_project)
print(f"Predicted cost: ${cost[0]:,.0f}")

Data Preparation

Prepare Historical Dataset

import pandas as pd
import numpy as np

def prepare_cost_dataset(df):
    """Prepare historical project data for ML"""
    # Select relevant features
    features = [
        'area_m2',
        'floors',
        'building_type',
        'location',
        'year_completed',
        'complexity_score',
        'material_quality',
        'total_cost'
    ]

    df = df[features].copy()

    # Handle missing values
    df = df.dropna(subset=['total_cost'])
    df['complexity_score'] = df['complexity_score'].fillna(df['complexity_score'].median())

    # Encode categorical variables
    df = pd.get_dummies(df, columns=['building_type', 'location'])

    # Calculate derived features
    df['cost_per_m2'] = df['total_cost'] / df['area_m2']
    df['cost_per_floor'] = df['total_cost'] / df['floors']

    # Adjust for inflation (to current year prices)
    current_year = 2024
    inflation_rate = 0.03  # 3% annual
    df['years_ago'] = current_year - df['year_completed']
    df['adjusted_cost'] = df['total_cost'] * (1 + inflation_rate) ** df['years_ago']

    return df

# Usage
df = pd.read_csv("projects_history.csv")
df_prepared = prepare_cost_dataset(df)

Feature Engineering

def engineer_features(df):
    """Create additional features for better predictions"""
    # Interaction features
    df['area_x_floors'] = df['area_m2'] * df['floors']
    df['area_x_complexity'] = df['area_m2'] * df['complexity_score']

    # Polynomial features
    df['area_squared'] = df['area_m2'] ** 2

    # Log transforms (for skewed features)
    df['log_area'] = np.log1p(df['area_m2'])

    # Binned features
    df['size_category'] = pd.cut(
        df['area_m2'],
        bins=[0, 1000, 5000, 10000, float('inf')],
        labels=['small', 'medium', 'large', 'xlarge']
    )

    return df

Machine Learning Models

Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def train_linear_model(X_train, y_train):
    """Train Linear Regression model with scaling"""
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])

    pipeline.fit(X_train, y_train)

    # Feature importance (coefficients)
    coefficients = pd.DataFrame({
        'feature': X_train.columns,
        'coefficient': pipeline.named_steps['regressor'].coef_
    }).sort_values('coefficient', key=abs, ascending=False)

    return pipeline, coefficients

# Usage
model, importance = train_linear_model(X_train, y_train)
print("Feature Importance:")
print(importance)

K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

def train_knn_model(X_train, y_train):
    """Train KNN model with optimal k"""
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_train)

    # Find optimal k using cross-validation
    param_grid = {'n_neighbors': range(3, 20)}
    knn = KNeighborsRegressor()
    grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_absolute_error')
    grid_search.fit(X_scaled, y_train)

    print(f"Best k: {grid_search.best_params_['n_neighbors']}")
    print(f"Best MAE: ${-grid_search.best_score_:,.0f}")

    return grid_search.best_estimator_, scaler

# Usage
knn_model, scaler = train_knn_model(X_train, y_train)

Random Forest

from sklearn.ensemble import RandomForestRegressor

def train_random_forest(X_train, y_train):
    """Train Random Forest model"""
    rf = RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        random_state=42
    )

    rf.fit(X_train, y_train)

    # Feature importance
    importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': rf.feature_importances_
    }).sort_values('importance', ascending=False)

    return rf, importance

# Usage
rf_model, importance = train_random_forest(X_train, y_train)
print("Feature Importance:")
print(importance.head(10))

Gradient Boosting

from sklearn.ensemble import GradientBoostingRegressor

def train_gradient_boosting(X_train, y_train):
    """Train Gradient Boosting model"""
    gb = GradientBoostingRegressor(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )

    gb.fit(X_train, y_train)
    return gb

# Usage
gb_model = train_gradient_boosting(X_train, y_train)

Model Evaluation

Comprehensive Evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

def evaluate_model(model, X_test, y_test, model_name="Model"):
    """Comprehensive model evaluation"""
    predictions = model.predict(X_test)

    metrics = {
        'MAE': mean_absolute_error(y_test, predictions),
        'RMSE': np.sqrt(mean_squared_error(y_test, predictions)),
        'R²': r2_score(y_test, predictions),
        'MAPE': np.mean(np.abs((y_test - predictions) / y_test)) * 100
    }

    print(f"\n{model_name} Evaluation:")
    print(f"  MAE:  ${metrics['MAE']:,.0f}")
    print(f"  RMSE: ${metrics['RMSE']:,.0f}")
    print(f"  R²:   {metrics['R²']:.3f}")
    print(f"  MAPE: {metrics['MAPE']:.1f}%")

    return metrics, predictions

# Usage
metrics, predictions = evaluate_model(model, X_test, y_test, "Linear Regression")

Compare Multiple Models

def compare_models(models, X_test, y_test):
    """Compare multiple models"""
    results = []

    for name, model in models.items():
        metrics, _ = evaluate_model(model, X_test, y_test, name)
        metrics['Model'] = name
        results.append(metrics)

    comparison = pd.DataFrame(results)
    comparison = comparison.set_index('Model')

    print("\nModel Comparison:")
    print(comparison.round(2))

    return comparison

# Usage
models = {
    'Linear Regression': linear_model,
    'KNN': knn_model,
    'Random Forest': rf_model,
    'Gradient Boosting': gb_model
}
comparison = compare_models(models, X_test, y_test)

Cross-Validation

from sklearn.model_selection import cross_val_score

def cross_validate_model(model, X, y, cv=5):
    """Perform cross-validation"""
    scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
    mae_scores = -scores

    print(f"Cross-Validation MAE: ${mae_scores.mean():,.0f} (+/- ${mae_scores.std():,.0f})")
    return mae_scores

# Usage
cv_scores = cross_validate_model(rf_model, X, y)

Prediction Pipeline

Complete Prediction Function

import joblib

def create_prediction_pipeline(model, feature_names, scaler=None):
    """Create a reusable prediction pipeline"""

    def predict_cost(project_data):
        """
        Predict cost for new project

        Args:
            project_data: dict with project features

        Returns:
            Predicted cost and confidence interval
        """
        # Create DataFrame from input
        df = pd.DataFrame([project_data])

        # Ensure all required features
        for col in feature_names:
            if col not in df.columns:
                df[col] = 0

        df = df[feature_names]

        # Scale if necessary
        if scaler:
            df = scaler.transform(df)

        # Predict
        prediction = model.predict(df)[0]

        # Confidence interval (simple estimation)
        confidence = 0.15  # 15% margin
        lower = prediction * (1 - confidence)
        upper = prediction * (1 + confidence)

        return {
            'predicted_cost': prediction,
            'lower_bound': lower,
            'upper_bound': upper,
            'confidence_level': f"{(1-confidence)*100:.0f}%"
        }

    return predict_cost

# Usage
predictor = create_prediction_pipeline(rf_model, X.columns.tolist())

# Predict new project
new_project = {
    'area_m2': 5000,
    'floors': 8,
    'complexity_score': 3,
    'material_quality': 2
}

result = predictor(new_project)
print(f"Predicted Cost: ${result['predicted_cost']:,.0f}")
print(f"Range: ${result['lower_bound']:,.0f} - ${result['upper_bound']:,.0f}")

Save and Load Model

import joblib

# Save model
def save_model(model, filepath):
    """Save trained model to file"""
    joblib.dump(model, filepath)
    print(f"Model saved to {filepath}")

# Load model
def load_model(filepath):
    """Load model from file"""
    model = joblib.load(filepath)
    print(f"Model loaded from {filepath}")
    return model

# Usage
save_model(rf_model, "cost_prediction_model.pkl")
loaded_model = load_model("cost_prediction_model.pkl")

Using with ChatGPT

# Prompt for ChatGPT to help with cost prediction

prompt = """
I have historical construction project data with these columns:
- area_m2: Building area in square meters
- floors: Number of floors
- building_type: residential, commercial, industrial
- total_cost: Total project cost in USD

Write Python code using scikit-learn to:
1. Prepare the data for machine learning
2. Train a Random Forest model
3. Evaluate the model
4. Predict cost for a new 3000 m² commercial building with 5 floors
"""

Quick Reference

Task	Code
Split data	`train_test_split(X, y, test_size=0.2)`
Linear Regression	`LinearRegression().fit(X, y)`
KNN	`KNeighborsRegressor(n_neighbors=5)`
Random Forest	`RandomForestRegressor(n_estimators=100)`
Predict	`model.predict(X_new)`
MAE	`mean_absolute_error(y_true, y_pred)`
R² Score	`r2_score(y_true, y_pred)`
Cross-validate	`cross_val_score(model, X, y, cv=5)`
Save model	`joblib.dump(model, 'file.pkl')`

Best Practices

Data Quality: More historical data = better predictions
Feature Selection: Include relevant project characteristics
Inflation Adjustment: Normalize costs to current prices
Regular Retraining: Update model with new completed projects
Ensemble Methods: Combine multiple models for robustness
Confidence Intervals: Always provide prediction ranges

Resources

Book: "Data-Driven Construction" by Artem Boiko, Chapter 4.5
Website: https://datadrivenconstruction.io
scikit-learn: https://scikit-learn.org

Next Steps

See duration-prediction for project duration forecasting
See ml-model-builder for custom ML workflows
See kpi-dashboard for visualization
See big-data-analysis for large dataset processing

Files

3 total

Select a file

Select a file to preview.

Comments

Loading comments…