SMALL

캘리포니아 주택가격 예측하기 문제는 레이블된 훈련 샘플이 있으므로 전형적인 지도학습 작업입니다.

예측에 사용할 특성이 여러 개 이므로 다중회귀 모델로 해결할 수 있습니다.

여러 회귀 모델을 사용해 주택가격을 예측하고 성능을 비교하는 주피터 노트북을 공유합니다!

[1] 주택가격 예측하기¶

◆ 작성일: 2021.10.17. ◆ 작성자: 오민지

CONTENTS¶

1. 도입¶

1.1. 데이터 불러오기·split 하기
1.2. 데이터 기본 정보 파악하기1 - 기술통계
1.3. 데이터 특성 파악하기2 - 시각화
1.4. 파생변수 만들고 상관계수 확인하기
1.5. 결측치는 열 평균으로 맞춰줌
1.6. 범주평 데이터를 One Hot Encoding 하기

2. 머신러닝¶

2.1. 선형회귀모델
2.2 의사결정나무
2.3. 랜덤포레스트
2.4. XGBoost 회귀
2.5. k-fold 교차검증

3. 최종 평가¶

3.1. 최종모델의 성능평가
3.2. 예측값과 실제값의 비교 - 시각화
3.3. 변수중요도

1. 도입¶

1.1. 데이터 불러오기·split하기¶

In [2]:

# 데이터 다운로드
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/rickiepark/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [3]:

fetch_housing_data()

In [4]:

# 데이터 불러오기
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [5]:

housing = load_housing_data()

In [26]:

housing.head()

Out[26]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

1.2. 데이터 기본 정보 파악하기1 - 기술통계¶

In [27]:

housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

In [28]:

housing.isna().sum()

Out[28]:

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [29]:

housing.ocean_proximity.unique()

Out[29]:

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [30]:

housing.describe()

Out[30]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

1.3. 데이터 특성 파악하기2 - 시각화¶

In [11]:

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')          # 코드경고 무시
plt.rc('font', family='Malgun Gothic')
plt.rcParams['axes.unicode_minus'] = False # 한글 글꼴에서 마이너스 기호 깨지지 않도록

In [31]:

housing.plot(kind='scatter', x='longitude', y='latitude', figsize=(8,7), alpha=0.1)
plt.xlabel('경도')
plt.ylabel('위도')
plt.title('위·경도에 따른 산점도 (밀집지역)')

Out[31]:

Text(0.5, 1.0, '위·경도에 따른 산점도 (밀집지역)')

In [32]:

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
             sharex=False)
plt.xlabel('경도')
plt.ylabel('위도')
#plt.clabel('주택가격 중앙값')
#plt.legend()
#save_fig("housing_prices_scatterplot")

Out[32]:

Text(0, 0.5, '위도')

In [33]:

sns.boxplot(x="ocean_proximity", y="median_house_value", data=housing)
plt.show()

In [34]:

sns.boxplot(x="ocean_proximity", y="total_bedrooms", data=housing)
plt.show()   

In [35]:

sns.pairplot(housing, corner=True)
plt.show()

1.4. 파생변수 만들고 상관계수 확인하기¶

In [36]:

housing['rooms_p_household']=housing.total_rooms / housing.households
housing['bedrooms_p_room']=housing.total_bedrooms / housing.total_rooms
housing['population_p_household']=housing.population / housing.households

In [37]:

corr=housing.corr()
corr['median_house_value'].sort_values(ascending=False)

Out[37]:

median_house_value        1.000000
median_income             0.688075
rooms_p_household         0.151948
total_rooms               0.134153
housing_median_age        0.105623
households                0.065843
total_bedrooms            0.049686
population_p_household   -0.023737
population               -0.024650
longitude                -0.045967
latitude                 -0.144160
bedrooms_p_room          -0.255880
Name: median_house_value, dtype: float64

In [41]:

housing.drop(['total_rooms', 'households', 'total_bedrooms', 'population_p_household'], axis=1, inplace=True)

1.5. 결측치는 열 평균으로 맞춰줌¶

In [38]:

housing['total_bedrooms'].fillna(housing['total_bedrooms'].mean(), inplace=True)
housing['bedrooms_p_room'].fillna(housing['bedrooms_p_room'].mean(), inplace=True)

In [39]:

housing.isna().sum()

Out[39]:

longitude                 0
latitude                  0
housing_median_age        0
total_rooms               0
total_bedrooms            0
population                0
households                0
median_income             0
median_house_value        0
ocean_proximity           0
rooms_p_household         0
bedrooms_p_room           0
population_p_household    0
dtype: int64

1.6. 범주형 데이터를 One Hot Encoding 하기¶

In [45]:

one_hot = pd.get_dummies(housing['ocean_proximity'])
housing = pd.concat([housing, one_hot], axis=1)
housing.drop(['ocean_proximity'], axis=1, inplace=True)

In [46]:

housing.head()

Out[46]:

	longitude	latitude	housing_median_age	population	median_income	median_house_value	rooms_p_household	bedrooms_p_room	NEAR BAY
0	-122.23	37.88	41.0	322.0	8.3252	452600.0	6.984127	0.146591	1
1	-122.22	37.86	21.0	2401.0	8.3014	358500.0	6.238137	0.155797	1
2	-122.24	37.85	52.0	496.0	7.2574	352100.0	8.288136	0.129516	1
3	-122.25	37.85	52.0	558.0	5.6431	341300.0	5.817352	0.184458	1
4	-122.25	37.85	52.0	565.0	3.8462	342200.0	6.281853	0.172096	1

2. 머신러닝¶

In [60]:

X, y = housing.drop(['median_house_value'], axis=1), housing['median_house_value']

In [134]:

# train, test 나누기
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# min-max scaling (정규화, Normalization)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_normal=scaler.fit_transform(X_train)

# valid 만들기
x_train, x_valid, y_train, y_valid = train_test_split(X_normal, y_train, test_size=0.2, random_state=1)

2.1. 선형회귀모델¶

In [135]:

from sklearn.linear_model import LinearRegression

linear = LinearRegression()
linear.fit(x_train, y_train)

Out[135]:

LinearRegression()

In [136]:

y_pred_li = linear.predict(x_train)

In [137]:

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

lin_R2 = r2_score(y_train, y_pred_li)
lin_mae = mean_absolute_error(y_train, y_pred_li)
print(lin_R2)
print(lin_mae)

0.624211813272781
51516.31446363971

In [138]:

y_pred_li = linear.predict(x_valid)

In [139]:

lin_R2 = r2_score(y_valid, y_pred_li)
lin_mae = mean_absolute_error(y_valid, y_pred_li)
print(lin_R2)
print(lin_mae)

0.6276659644943523
50978.07769321241

2.2. 의사결정나무¶

In [140]:

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor()
tree.fit(x_train, y_train)

Out[140]:

DecisionTreeRegressor()

In [141]:

y_pred_tr = tree.predict(x_train)

lin_R2 = r2_score(y_train, y_pred_tr)
lin_mae = mean_absolute_error(y_train, y_pred_tr)
print(lin_R2)
print(lin_mae)

1.0
0.0

In [142]:

y_pred_tr = tree.predict(x_valid)

In [143]:

tree_R2 = r2_score(y_valid, y_pred_tr)
tree_mae = mean_absolute_error(y_valid, y_pred_tr)
print(tree_R2)
print(tree_mae)

0.6040317121249402
45843.157129881925

2.3. 랜덤포레스트¶

In [144]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(x_train, y_train)

Out[144]:

RandomForestRegressor()

In [145]:

y_pred_rf = tree.predict(x_train)

rf_R2 = r2_score(y_train, y_pred_rf)
rf_mae = mean_absolute_error(y_train, y_pred_rf)
print(rf_R2)
print(rf_mae)

1.0
0.0

In [146]:

y_pred_rf = tree.predict(x_valid)

In [147]:

rf_R2 = r2_score(y_valid, y_pred_rf)
rf_mae = mean_absolute_error(y_valid, y_pred_rf)
print(rf_R2)
print(rf_mae)

0.6040317121249402
45843.157129881925

2.4. XGBoost 회귀¶

In [149]:

import xgboost

xgb_model = xgboost.XGBRegressor()
xgb_model.fit(x_train, y_train)

Out[149]:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [150]:

y_pred_xgb = xgb_model.predict(x_train)

xgb_R2 = r2_score(y_train, y_pred_xgb)
xgb_mae = mean_absolute_error(y_train, y_pred_xgb)
print(xgb_R2)
print(xgb_mae)

0.9390500039169059
20266.895483932592

In [151]:

y_pred_xgb = xgb_model.predict(x_valid)

In [152]:

xgb_R2 = r2_score(y_valid, y_pred_xgb)
xgb_mae = mean_absolute_error(y_valid, y_pred_xgb)
print(xgb_R2)
print(xgb_mae)

0.8195479642727931
32955.129073001815

2.5. k-fold 교차검증¶

In [104]:

def display_scores(model,scores):
    print('<<',model, '모델 평가 결과 >>')
    print('평균 RMSE:', scores.mean())
    print('표준편차:', scores.std())

In [113]:

import numpy as np
from sklearn.model_selection import cross_val_score

tree_scores = cross_val_score(tree, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
tree_rmse_scores = np.sqrt(-tree_scores)

lin_scores = cross_val_score(linear, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)

rf_scores = cross_val_score(rf, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
rf_rmse_scores = np.sqrt(-rf_scores)

xgb_scores = cross_val_score(xgb_model, x_valid, y_valid, scoring='neg_mean_squared_error', cv=10)
xgb_rmse_scores = np.sqrt(-xgb_scores)

In [114]:

display_scores('선형회귀',lin_rmse_scores)
print('\n')
display_scores('의사결정나무',tree_rmse_scores)
print('\n')
display_scores('랜덤포레스트',rf_rmse_scores)
print('\n')
display_scores('XGBoost',xgb_rmse_scores)

<< 선형회귀 모델 평가 결과 >>
평균 RMSE: 70134.82258901293
표준편차: 3135.2154009929136


<< 의사결정나무 모델 평가 결과 >>
평균 RMSE: 77048.8584526432
표준편차: 3991.6789207226984


<< 랜덤포레스트 모델 평가 결과 >>
평균 RMSE: 57317.818981409866
표준편차: 1593.355992301959


<< XGBoost 모델 평가 결과 >>
평균 RMSE: 54391.60931727962
표준편차: 2778.9638841159435

3. 최종 평가¶

3.1. 최종모델의 성능평가¶

In [153]:

x_test_nor = scaler.transform(X_test)

In [155]:

final_pred = xgb_model.predict(x_test_nor)

In [162]:

from sklearn.metrics import mean_squared_error
final_mse = mean_squared_error(y_test, final_pred)
final_rmse = np.sqrt(final_mse)
final_r2 = r2_score(y_test, final_pred)

In [166]:

print('RMSE: ',final_rmse)
print('R2: ',final_r2)

RMSE:  48553.95504760515
R2:  0.8202711742377262

3.2. 예측값과 실제값의 비교 - 시각화¶

In [175]:

pred = pd.DataFrame(final_pred, columns=['prediction'])
actual = pd.DataFrame(y_test)
actual.reset_index(inplace=True, drop=True)
table = pd.concat([pred, actual], axis=1)

In [178]:

table

Out[178]:

	prediction	median_house_value
0	344882.437500	355000.0
1	62316.609375	70700.0
2	225855.718750	229400.0
3	146715.187500	112500.0
4	247561.625000	225400.0
...	...	...
4123	73654.890625	68200.0
4124	346308.218750	225000.0
4125	283500.062500	350000.0
4126	246082.078125	227300.0
4127	145000.578125	141700.0

4128 rows × 2 columns

In [189]:

table.iloc[2000:2100,:].plot(figsize=(30,7))

Out[189]:

<AxesSubplot:>

3.3. 변수중요도¶

In [193]:

from xgboost import plot_importance
plot_importance(xgb_model)

Out[193]:

<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

f0 : longitude
f1 : latitude
f4 : median_income
f6 : bedrooms_p_room
f3 : population
f5 : rooms_p_household
f2 : housing_median_age

변수중요도 해석¶

지역(경도, 위도)는 주택가격에 주요한 영향을 줌
중간소득이 주택가격에 주요한 영향을 줌

In [199]:

X_test.columns

Out[199]:

Index(['longitude', 'latitude', 'housing_median_age', 'population',
       'median_income', 'rooms_p_household', 'bedrooms_p_room', '<1H OCEAN',
       'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype='object')

In [ ]:

LIST

빅데이터

[HandsOn/회귀분석] 캘리포니아 주택가격 예측하기

[1] 주택가격 예측하기¶

CONTENTS¶

1. 도입¶

2. 머신러닝¶

3. 최종 평가¶

1. 도입¶

1.1. 데이터 불러오기·split하기¶

1.2. 데이터 기본 정보 파악하기1 - 기술통계¶

1.3. 데이터 특성 파악하기2 - 시각화¶

1.4. 파생변수 만들고 상관계수 확인하기¶

1.5. 결측치는 열 평균으로 맞춰줌¶

1.6. 범주형 데이터를 One Hot Encoding 하기¶

2. 머신러닝¶

2.1. 선형회귀모델¶

2.2. 의사결정나무¶

2.3. 랜덤포레스트¶

2.4. XGBoost 회귀¶

2.5. k-fold 교차검증¶

3. 최종 평가¶

3.1. 최종모델의 성능평가¶

3.2. 예측값과 실제값의 비교 - 시각화¶

3.3. 변수중요도¶

변수중요도 해석¶

티스토리툴바