House Price Prediction — Kaggle

Regression project for the Kaggle “House Prices: Advanced Regression Techniques.” Compared Linear Regression, Decision Tree, and Random Forest with emphasis on clean preprocessing, sensible feature engineering, and robust evaluation.

Role: Data Cleaning · Feature Eng · Modeling Stack: Python (pandas, scikit-learn, numpy, matplotlib) Data: train.csv / test.csv (Kaggle) Task: Regression (SalePrice)

Overview

I built a regression pipeline to predict SalePrice from the Ames housing dataset. The workflow emphasized transparent preprocessing (missing values, encoding), light feature engineering, and fair model comparison using cross-validation.

Data Cleaning

  • Loaded Kaggle train.csv/test.csv; set SalePrice as target.
  • Handled missing values by semantic rules (e.g., NA = “None” vs. true missing) and simple imputers for numeric.
  • Dropped obvious ID-only fields; trimmed extreme outliers in GrLivArea and SalePrice tails.

Exploratory Analysis

  • Inspected target skew; applied log1p(SalePrice) for more Gaussian-like residuals.
  • Visualized key relationships (OverallQual, GrLivArea, TotalBsmtSF).
  • Checked correlation heatmap to guide feature selection and leakage avoidance.

Modeling

  • Built a ColumnTransformer with: numeric imputation & scaling; categorical imputation & one-hot encoding.
  • Compared models: Linear Regression (baseline), Decision Tree, Random Forest.
  • Tuned key hyperparameters (e.g., n_estimators, max_depth) via grid/random search with CV.

Results (CV / Hold-out)

  • Metric: RMSE on log1p scale (back-transformed for interpretability).
  • Random Forest performed best among tested models.
  • Kaggle Public Leaderboard (RMSE): 0.16699

Takeaways

SignalQuality and size features dominate; log transform stabilizes variance
PrepThoughtful NA handling and one-hot encoding materially improved baseline
NextTry regularized linear models (Ridge/Lasso), Gradient Boosting/LightGBM, target-encoding

Feature Engineering

  • Constructed TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF.
  • Encoded ordinal categories (e.g., quality/condition) in an ordered fashion where appropriate.
  • Created age features (e.g., HouseAge = YrSoldYearBuilt).

Evaluation Protocol

  • 5-fold cross-validation on train with consistent preprocessing inside the pipeline.
  • Monitored RMSE/R²; inspected residuals vs. fitted values for bias/heteroscedasticity.
  • Generated Kaggle submission with back-transformed predictions.