House Price Prediction — Kaggle
Regression project for the Kaggle “House Prices: Advanced Regression Techniques.” Compared Linear Regression, Decision Tree, and Random Forest with emphasis on clean preprocessing, sensible feature engineering, and robust evaluation.
Overview
I built a regression pipeline to predict SalePrice
from the Ames housing dataset.
The workflow emphasized transparent preprocessing (missing values, encoding), light feature engineering,
and fair model comparison using cross-validation.
Data Cleaning
- Loaded Kaggle
train.csv
/test.csv
; setSalePrice
as target. - Handled missing values by semantic rules (e.g., NA = “None” vs. true missing) and simple imputers for numeric.
- Dropped obvious ID-only fields; trimmed extreme outliers in
GrLivArea
andSalePrice
tails.
Exploratory Analysis
- Inspected target skew; applied
log1p(SalePrice)
for more Gaussian-like residuals. - Visualized key relationships (
OverallQual
,GrLivArea
,TotalBsmtSF
). - Checked correlation heatmap to guide feature selection and leakage avoidance.
Modeling
- Built a ColumnTransformer with: numeric imputation & scaling; categorical imputation & one-hot encoding.
- Compared models: Linear Regression (baseline), Decision Tree, Random Forest.
- Tuned key hyperparameters (e.g.,
n_estimators
,max_depth
) via grid/random search with CV.
Results (CV / Hold-out)
- Metric: RMSE on log1p scale (back-transformed for interpretability).
- Random Forest performed best among tested models.
- Kaggle Public Leaderboard (RMSE): 0.16699
Takeaways
SignalQuality and size features dominate; log transform stabilizes variance
PrepThoughtful NA handling and one-hot encoding materially improved baseline
NextTry regularized linear models (Ridge/Lasso), Gradient Boosting/LightGBM, target-encoding
Key Steps
Feature Engineering
- Constructed
TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF
. - Encoded ordinal categories (e.g., quality/condition) in an ordered fashion where appropriate.
- Created age features (e.g.,
HouseAge
=YrSold
−YearBuilt
).
Evaluation Protocol
- 5-fold cross-validation on
train
with consistent preprocessing inside the pipeline. - Monitored RMSE/R²; inspected residuals vs. fitted values for bias/heteroscedasticity.
- Generated Kaggle submission with back-transformed predictions.