파일

파일 불러오기

data = pd.read_csv("G:/내 드라이브/bb/cc/data.csv')

파일 내보내기

test.to_csv('test.csv', index = False)

해당 소스코드가 있는 곳에 파일이 내보내진다.

test.to_csv('G:/내 드라이브/Github/TIL-Blog/test.csv', index = False)

해당 경로에 파일이 내보내진다.

기본 라이브러리

import pandas as pd # pandas
import numpy as np # numpy

import matplotlib.pyplot as plt # matplotlib
import matplotlib

import seaborn as sns # seaborn

전처리

train / validation set split

train = pd.read_csv('https://bit.ly/fc-ml-titanic')

feature = [
    'Pclass', 'Sex', 'Age', 'Fare'
]

label = [
    'Survived'
]

from sklearn.model_selection import train_test_split

test_size: validation set에 할당할 비율 (20% -> 0.2)
shuffle: 셔플 옵션 (기본 True)
random_state: 랜덤 시드값

x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size=0.2, shuffle=True, random_state=30)

결측치 처리

from sklearn.impute import SimpleImputer

1. 수치형

칼럼 1개 처리하는 경우

train['Age'].fillna(train['Age'].mean())

칼럼 여러개 처리하는 경우

imputer = SimpleImputer(strategy='median') ## 한번에 여러개 처리. median, mean ...
result = imputer.fit_transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result

2. 범주형

train = pd.read_csv('https://bit.ly/fc-ml-titanic')

컬럼 1개 처리하는 경우

train['Embarked'].fillna('S')

칼럼 여러개 처리하는 경우

imputer = SimpleImputer(strategy='most_frequent')
result = imputer.fit_transform(train[['Embarked', 'Cabin']])
train[['Embarked', 'Cabin']] = result

Label Encoding : 문자를 수치로 변환

from sklearn.preprocessing import LabelEncoder

train['Embarked_num'] = LabelEncoder().fit_transform(train['Embarked'])

train['Embarked_num'].value_counts()

2    646
0    168
1     77
Name: Embarked_num, dtype: int64

원 핫 인코딩

pd.get_dummies(train['Embarked_num'], prefix = 'Embarked')

정규화 Normalize (최솟값 0 최대값 1)

movie = {'naver': [2, 4, 6, 8, 10], 
         'netflix': [1, 2, 3, 4, 5]}
movie = pd.DataFrame(data=movie)

from sklearn.preprocessing import MinMaxScaler

min_max_movie = MinMaxScaler().fit_transform(movie)

pd.DataFrame(min_max_movie, columns=['naver', 'netflix'])

표준화 Standard Scaling (평균 0 표준편차 1)

from sklearn.preprocessing import StandardScaler

x = np.arange(10)
# outlier 추가
x[9] = 1000
x = x.reshape(-1, 1)

scaled = StandardScaler().fit_transform(x)

round(scaled.mean(), 2), scaled.std()

(0.0, 1.0)

검증, 튜닝

Cross Validation

from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['MEDV'] = data['target']
from lightgbm import LGBMRegressor, LGBMClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.drop('MEDV', 1), df['MEDV'], random_state=42)

from sklearn.model_selection import KFold

n_splits = 5
kfold = KFold(n_splits=n_splits, shuffle = True, random_state=42)
X = np.array(df.drop('MEDV', 1))
Y = np.array(df['MEDV'])
lgbm_fold = LGBMRegressor(random_state=42)

i = 1
total_error = 0
for train_index, test_index in kfold.split(X):
    x_train_fold, x_test_fold = X[train_index], X[test_index]
    y_train_fold, y_test_fold = Y[train_index], Y[test_index]
    lgbm_pred_fold = lgbm_fold.fit(x_train_fold, y_train_fold).predict(x_test_fold)
    error = mean_squared_error(lgbm_pred_fold, y_test_fold)
    print('Fold = {}, prediction score = {:.2f}'.format(i, error))
    total_error += error
    i+=1
print('---'*10)
print('Average Error: %s' % (total_error / n_splits))

Fold = 1, prediction score = 8.34
Fold = 2, prediction score = 10.40
Fold = 3, prediction score = 17.58
Fold = 4, prediction score = 6.94
Fold = 5, prediction score = 12.16
------------------------------
Average Error: 11.083201392666322

Hyperparameter 튜닝

1. RandomizedSearchCV

params = {
    'n_estimators': [200, 500, 1000, 2000], 
    'learning_rate': [0.1, 0.05, 0.01], 
    'max_depth': [6, 7, 8], 
    'colsample_bytree': [0.8, 0.9, 1.0], 
    'subsample': [0.8, 0.9, 1.0],
}

주요 Hyperparameter (LGBM)

random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!
n_jobs: CPU 사용 갯수
learning_rate: 학습율. 너무 큰 학습율은 성능을 떨어뜨리고, 너무 작은 학습율은 학습이 느리다. 적절한 값을 찾아야함. n_estimators와 같이 튜닝. default=0.1
n_estimators: 부스팅 스테이지 수. (랜덤포레스트 트리의 갯수 설정과 비슷한 개념). default=100
max_depth: 트리의 깊이. 과대적합 방지용. default=3.
colsample_bytree: 샘플 사용 비율 (max_features와 비슷한 개념). 과대적합 방지용. default=1.0

from sklearn.model_selection import RandomizedSearchCV

n_iter 값을 조절하여, 총 몇 회의 시도를 진행할 것인지 정의합니다.

(회수가 늘어나면, 더 좋은 parameter를 찾을 확률은 올라가지만, 그만큼 시간이 오래걸립니다.)

clf = RandomizedSearchCV(LGBMRegressor(), params, random_state=42, cv=3, n_iter=25, scoring='neg_mean_squared_error')
clf.fit(x_train, y_train)

RandomizedSearchCV(cv=3, estimator=LGBMRegressor(), n_iter=25,
                   param_distributions={'colsample_bytree': [0.8, 0.9, 1.0],
                                        'learning_rate': [0.1, 0.05, 0.01],
                                        'max_depth': [6, 7, 8],
                                        'n_estimators': [200, 500, 1000, 2000],
                                        'subsample': [0.8, 0.9, 1.0]},
                   random_state=42, scoring='neg_mean_squared_error')

clf.best_score_

-13.707228623244996

clf.best_params_

{'subsample': 0.9,
 'n_estimators': 2000,
 'max_depth': 6,
 'learning_rate': 0.01,
 'colsample_bytree': 0.8}

lgbm_best = LGBMRegressor(n_estimators=2000, subsample=0.8, max_depth=7, learning_rate=0.01, colsample_bytree=0.8)
lgbm_best_pred = lgbm_best.fit(x_train, y_train).predict(x_test)

2. GridSearchCV

모든 매개 변수 값에 대하여 완전 탐색을 시도합니다.
따라서, 최적화할 parameter가 많다면, 시간이 매우 오래걸립니다.

params = {
    'n_estimators': [500, 1000], 
    'learning_rate': [0.1, 0.05, 0.01], 
    'max_depth': [7, 8], 
    'colsample_bytree': [0.8, 0.9], 
    'subsample': [0.8, 0.9,],
}

from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(LGBMRegressor(), params, cv=3, n_jobs=-1, scoring='neg_mean_squared_error')

grid_search.fit(x_train, y_train)

GridSearchCV(cv=3, estimator=LGBMRegressor(), n_jobs=-1,
             param_grid={'colsample_bytree': [0.8, 0.9],
                         'learning_rate': [0.1, 0.05, 0.01],
                         'max_depth': [7, 8], 'n_estimators': [500, 1000],
                         'subsample': [0.8, 0.9]},
             scoring='neg_mean_squared_error')

grid_search.best_score_

-13.598939419010335

grid_search.best_params_

{'colsample_bytree': 0.8,
 'learning_rate': 0.05,
 'max_depth': 7,
 'n_estimators': 500,
 'subsample': 0.8}

lgbm_best = LGBMRegressor(n_estimators=500, subsample=0.8, max_depth=7, learning_rate=0.05, colsample_bytree=0.8)
lgbm_best_pred = lgbm_best.fit(x_train, y_train).predict(x_test)

Model

CatBoost + 예시

from catboost import CatBoostRegressor # 캣부스트 회귀
from catboost import CatBoostClassifier # 캣부스트 분류

model = CatBoostRegressor()
model.fit(X_train, y_train, silent=True)

pred = model.predict(X_test)

rmse = (np.sqrt(np.mean(mean_squared_error(y_test, pred))))
rmse

Random Forest

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123,max_depth=6)

rf.fit(X_train, y_train)

XG BOOST, LightGBM

from xgboost import XGBRegressor
from xgboost import XGBClassifier

from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier

평가점수

RMSE

from sklearn.metrics import mean_squared_error

rmse = (np.sqrt(np.mean(mean_squared_error(y_test, pred))))
rmse

Accuracy

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predicted)

	Embarked_0	Embarked_1	Embarked_2
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
...	...	...	...
886	0	0	1
887	0	0	1
888	0	0	1
889	1	0	0
890	0	1	0

	naver	netflix
0	0.00	0.00
1	0.25	0.25
2	0.50	0.50
3	0.75	0.75
4	1.00	1.00

	Embarked_0	Embarked_1	Embarked_2
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
...	...	...	...
886	0	0	1
887	0	0	1
888	0	0	1
889	1	0	0
890	0	1	0

	Embarked_0	Embarked_1	Embarked_2
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
...	...	...	...
886	0	0	1
887	0	0	1
888	0	0	1
889	1	0	0
890	0	1	0

	Embarked_0	Embarked_1	Embarked_2
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
...	...	...	...
886	0	0	1
887	0	0	1
888	0	0	1
889	1	0	0
890	0	1	0