SMALL
캘리포니아 주택가격 예측하기 문제는 레이블된 훈련 샘플이 있으므로 전형적인 지도학습 작업입니다.
예측에 사용할 특성이 여러 개 이므로 다중회귀 모델로 해결할 수 있습니다.
여러 회귀 모델을 사용해 주택가격을 예측하고 성능을 비교하는 주피터 노트북을 공유합니다!
[1] 주택가격 예측하기¶
◆ 작성일: 2021.10.17. ◆ 작성자: 오민지
CONTENTS¶
1. 도입¶
1.1. 데이터 불러오기·split 하기
1.2. 데이터 기본 정보 파악하기1 - 기술통계
1.3. 데이터 특성 파악하기2 - 시각화
1.4. 파생변수 만들고 상관계수 확인하기
1.5. 결측치는 열 평균으로 맞춰줌
1.6. 범주평 데이터를 One Hot Encoding 하기
2. 머신러닝¶
2.1. 선형회귀모델
2.2 의사결정나무
2.3. 랜덤포레스트
2.4. XGBoost 회귀
2.5. k-fold 교차검증
3. 최종 평가¶
3.1. 최종모델의 성능평가
3.2. 예측값과 실제값의 비교 - 시각화
3.3. 변수중요도
In [2]:
# 데이터 다운로드
import os
import tarfile
import urllib.request
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/rickiepark/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
In [3]:
fetch_housing_data()
In [4]:
# 데이터 불러오기
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
In [5]:
housing = load_housing_data()
In [26]:
housing.head()
Out[26]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
1.2. 데이터 기본 정보 파악하기1 - 기술통계¶
In [27]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [28]:
housing.isna().sum()
Out[28]:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
In [29]:
housing.ocean_proximity.unique()
Out[29]:
array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
dtype=object)
In [30]:
housing.describe()
Out[30]:
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
1.3. 데이터 특성 파악하기2 - 시각화¶
In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # 코드경고 무시
plt.rc('font', family='Malgun Gothic')
plt.rcParams['axes.unicode_minus'] = False # 한글 글꼴에서 마이너스 기호 깨지지 않도록
In [31]:
housing.plot(kind='scatter', x='longitude', y='latitude', figsize=(8,7), alpha=0.1)
plt.xlabel('경도')
plt.ylabel('위도')
plt.title('위·경도에 따른 산점도 (밀집지역)')
Out[31]:
Text(0.5, 1.0, '위·경도에 따른 산점도 (밀집지역)')
In [32]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.xlabel('경도')
plt.ylabel('위도')
#plt.clabel('주택가격 중앙값')
#plt.legend()
#save_fig("housing_prices_scatterplot")
Out[32]:
Text(0, 0.5, '위도')
In [33]:
sns.boxplot(x="ocean_proximity", y="median_house_value", data=housing)
plt.show()
In [34]:
sns.boxplot(x="ocean_proximity", y="total_bedrooms", data=housing)
plt.show()
In [35]:
sns.pairplot(housing, corner=True)
plt.show()
1.4. 파생변수 만들고 상관계수 확인하기¶
In [36]:
housing['rooms_p_household']=housing.total_rooms / housing.households
housing['bedrooms_p_room']=housing.total_bedrooms / housing.total_rooms
housing['population_p_household']=housing.population / housing.households
In [37]:
corr=housing.corr()
corr['median_house_value'].sort_values(ascending=False)
Out[37]:
median_house_value 1.000000
median_income 0.688075
rooms_p_household 0.151948
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population_p_household -0.023737
population -0.024650
longitude -0.045967
latitude -0.144160
bedrooms_p_room -0.255880
Name: median_house_value, dtype: float64
In [41]:
housing.drop(['total_rooms', 'households', 'total_bedrooms', 'population_p_household'], axis=1, inplace=True)
1.5. 결측치는 열 평균으로 맞춰줌¶
In [38]:
housing['total_bedrooms'].fillna(housing['total_bedrooms'].mean(), inplace=True)
housing['bedrooms_p_room'].fillna(housing['bedrooms_p_room'].mean(), inplace=True)
In [39]:
housing.isna().sum()
Out[39]:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
rooms_p_household 0
bedrooms_p_room 0
population_p_household 0
dtype: int64
1.6. 범주형 데이터를 One Hot Encoding 하기¶
In [45]:
one_hot = pd.get_dummies(housing['ocean_proximity'])
housing = pd.concat([housing, one_hot], axis=1)
housing.drop(['ocean_proximity'], axis=1, inplace=True)
In [46]:
housing.head()
Out[46]:
longitude | latitude | housing_median_age | population | median_income | median_house_value | rooms_p_household | bedrooms_p_room | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 322.0 | 8.3252 | 452600.0 | 6.984127 | 0.146591 | 0 | 0 | 0 | 1 | 0 |
1 | -122.22 | 37.86 | 21.0 | 2401.0 | 8.3014 | 358500.0 | 6.238137 | 0.155797 | 0 | 0 | 0 | 1 | 0 |
2 | -122.24 | 37.85 | 52.0 | 496.0 | 7.2574 | 352100.0 | 8.288136 | 0.129516 | 0 | 0 | 0 | 1 | 0 |
3 | -122.25 | 37.85 | 52.0 | 558.0 | 5.6431 | 341300.0 | 5.817352 | 0.184458 | 0 | 0 | 0 | 1 | 0 |
4 | -122.25 | 37.85 | 52.0 | 565.0 | 3.8462 | 342200.0 | 6.281853 | 0.172096 | 0 | 0 | 0 | 1 | 0 |
2. 머신러닝¶
In [60]:
X, y = housing.drop(['median_house_value'], axis=1), housing['median_house_value']
In [134]:
# train, test 나누기
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# min-max scaling (정규화, Normalization)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_normal=scaler.fit_transform(X_train)
# valid 만들기
x_train, x_valid, y_train, y_valid = train_test_split(X_normal, y_train, test_size=0.2, random_state=1)
2.1. 선형회귀모델¶
In [135]:
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(x_train, y_train)
Out[135]:
LinearRegression()
In [136]:
y_pred_li = linear.predict(x_train)
In [137]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
lin_R2 = r2_score(y_train, y_pred_li)
lin_mae = mean_absolute_error(y_train, y_pred_li)
print(lin_R2)
print(lin_mae)
0.624211813272781
51516.31446363971
In [138]:
y_pred_li = linear.predict(x_valid)
In [139]:
lin_R2 = r2_score(y_valid, y_pred_li)
lin_mae = mean_absolute_error(y_valid, y_pred_li)
print(lin_R2)
print(lin_mae)
0.6276659644943523
50978.07769321241
2.2. 의사결정나무¶
In [140]:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()
tree.fit(x_train, y_train)
Out[140]:
DecisionTreeRegressor()
In [141]:
y_pred_tr = tree.predict(x_train)
lin_R2 = r2_score(y_train, y_pred_tr)
lin_mae = mean_absolute_error(y_train, y_pred_tr)
print(lin_R2)
print(lin_mae)
1.0
0.0
In [142]:
y_pred_tr = tree.predict(x_valid)
In [143]:
tree_R2 = r2_score(y_valid, y_pred_tr)
tree_mae = mean_absolute_error(y_valid, y_pred_tr)
print(tree_R2)
print(tree_mae)
0.6040317121249402
45843.157129881925
2.3. 랜덤포레스트¶
In [144]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
Out[144]:
RandomForestRegressor()
In [145]:
y_pred_rf = tree.predict(x_train)
rf_R2 = r2_score(y_train, y_pred_rf)
rf_mae = mean_absolute_error(y_train, y_pred_rf)
print(rf_R2)
print(rf_mae)
1.0
0.0
In [146]:
y_pred_rf = tree.predict(x_valid)
In [147]:
rf_R2 = r2_score(y_valid, y_pred_rf)
rf_mae = mean_absolute_error(y_valid, y_pred_rf)
print(rf_R2)
print(rf_mae)
0.6040317121249402
45843.157129881925
2.4. XGBoost 회귀¶
In [149]:
import xgboost
xgb_model = xgboost.XGBRegressor()
xgb_model.fit(x_train, y_train)
Out[149]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
In [150]:
y_pred_xgb = xgb_model.predict(x_train)
xgb_R2 = r2_score(y_train, y_pred_xgb)
xgb_mae = mean_absolute_error(y_train, y_pred_xgb)
print(xgb_R2)
print(xgb_mae)
0.9390500039169059
20266.895483932592
In [151]:
y_pred_xgb = xgb_model.predict(x_valid)
In [152]:
xgb_R2 = r2_score(y_valid, y_pred_xgb)
xgb_mae = mean_absolute_error(y_valid, y_pred_xgb)
print(xgb_R2)
print(xgb_mae)
0.8195479642727931
32955.129073001815
2.5. k-fold 교차검증¶
In [104]:
def display_scores(model,scores):
print('<<',model, '모델 평가 결과 >>')
print('평균 RMSE:', scores.mean())
print('표준편차:', scores.std())
In [113]:
import numpy as np
from sklearn.model_selection import cross_val_score
tree_scores = cross_val_score(tree, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
tree_rmse_scores = np.sqrt(-tree_scores)
lin_scores = cross_val_score(linear, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
rf_scores = cross_val_score(rf, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
rf_rmse_scores = np.sqrt(-rf_scores)
xgb_scores = cross_val_score(xgb_model, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
xgb_rmse_scores = np.sqrt(-xgb_scores)
In [114]:
display_scores('선형회귀',lin_rmse_scores)
print('\n')
display_scores('의사결정나무',tree_rmse_scores)
print('\n')
display_scores('랜덤포레스트',rf_rmse_scores)
print('\n')
display_scores('XGBoost',xgb_rmse_scores)
<< 선형회귀 모델 평가 결과 >>
평균 RMSE: 70134.82258901293
표준편차: 3135.2154009929136
<< 의사결정나무 모델 평가 결과 >>
평균 RMSE: 77048.8584526432
표준편차: 3991.6789207226984
<< 랜덤포레스트 모델 평가 결과 >>
평균 RMSE: 57317.818981409866
표준편차: 1593.355992301959
<< XGBoost 모델 평가 결과 >>
평균 RMSE: 54391.60931727962
표준편차: 2778.9638841159435
3. 최종 평가¶
3.1. 최종모델의 성능평가¶
In [153]:
x_test_nor = scaler.transform(X_test)
In [155]:
final_pred = xgb_model.predict(x_test_nor)
In [162]:
from sklearn.metrics import mean_squared_error
final_mse = mean_squared_error(y_test, final_pred)
final_rmse = np.sqrt(final_mse)
final_r2 = r2_score(y_test, final_pred)
In [166]:
print('RMSE: ',final_rmse)
print('R2: ',final_r2)
RMSE: 48553.95504760515
R2: 0.8202711742377262
3.2. 예측값과 실제값의 비교 - 시각화¶
In [175]:
pred = pd.DataFrame(final_pred, columns=['prediction'])
actual = pd.DataFrame(y_test)
actual.reset_index(inplace=True, drop=True)
table = pd.concat([pred, actual], axis=1)
In [178]:
table
Out[178]:
prediction | median_house_value | |
---|---|---|
0 | 344882.437500 | 355000.0 |
1 | 62316.609375 | 70700.0 |
2 | 225855.718750 | 229400.0 |
3 | 146715.187500 | 112500.0 |
4 | 247561.625000 | 225400.0 |
... | ... | ... |
4123 | 73654.890625 | 68200.0 |
4124 | 346308.218750 | 225000.0 |
4125 | 283500.062500 | 350000.0 |
4126 | 246082.078125 | 227300.0 |
4127 | 145000.578125 | 141700.0 |
4128 rows × 2 columns
In [189]:
table.iloc[2000:2100,:].plot(figsize=(30,7))
Out[189]:
<AxesSubplot:>
3.3. 변수중요도¶
In [193]:
from xgboost import plot_importance
plot_importance(xgb_model)
Out[193]:
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>
- f0 : longitude
- f1 : latitude
- f4 : median_income
- f6 : bedrooms_p_room
- f3 : population
- f5 : rooms_p_household
- f2 : housing_median_age
변수중요도 해석¶
- 지역(경도, 위도)는 주택가격에 주요한 영향을 줌
- 중간소득이 주택가격에 주요한 영향을 줌
In [199]:
X_test.columns
Out[199]:
Index(['longitude', 'latitude', 'housing_median_age', 'population',
'median_income', 'rooms_p_household', 'bedrooms_p_room', '<1H OCEAN',
'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype='object')
In [ ]:
LIST