House Price Prediction — Kaggle

Regression project for the Kaggle “House Prices: Advanced Regression Techniques.” Compared Linear Regression, Decision Tree, and Random Forest with emphasis on clean preprocessing, sensible feature engineering, and robust evaluation.

Role: Data Cleaning · Feature Eng · Modeling Stack: Python (pandas, scikit-learn, numpy, matplotlib) Data: train.csv / test.csv (Kaggle) Task: Regression (SalePrice)

← Back to Portfolio View on Kaggle

Overview

I built a regression pipeline to predict SalePrice from the Ames housing dataset. The workflow emphasized transparent preprocessing (missing values, encoding), light feature engineering, and fair model comparison using cross-validation.

Data Cleaning

Loaded Kaggle train.csv/test.csv; set SalePrice as target.
Handled missing values by semantic rules (e.g., NA = “None” vs. true missing) and simple imputers for numeric.
Dropped obvious ID-only fields; trimmed extreme outliers in GrLivArea and SalePrice tails.

Exploratory Analysis

Inspected target skew; applied log1p(SalePrice) for more Gaussian-like residuals.
Visualized key relationships (OverallQual, GrLivArea, TotalBsmtSF).
Checked correlation heatmap to guide feature selection and leakage avoidance.

Modeling

Built a ColumnTransformer with: numeric imputation & scaling; categorical imputation & one-hot encoding.
Compared models: Linear Regression (baseline), Decision Tree, Random Forest.
Tuned key hyperparameters (e.g., n_estimators, max_depth) via grid/random search with CV.

Results (CV / Hold-out)

Metric: RMSE on log1p scale (back-transformed for interpretability).
Random Forest performed best among tested models.
Kaggle Public Leaderboard (RMSE): 0.16699

Takeaways

SignalQuality and size features dominate; log transform stabilizes variance

PrepThoughtful NA handling and one-hot encoding materially improved baseline

NextTry regularized linear models (Ridge/Lasso), Gradient Boosting/LightGBM, target-encoding

Key Steps

Feature Engineering

Constructed TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF.
Encoded ordinal categories (e.g., quality/condition) in an ordered fashion where appropriate.
Created age features (e.g., HouseAge = YrSold − YearBuilt).

Evaluation Protocol

5-fold cross-validation on train with consistent preprocessing inside the pipeline.
Monitored RMSE/R²; inspected residuals vs. fitted values for bias/heteroscedasticity.
Generated Kaggle submission with back-transformed predictions.