2021.06.29

EDA

라이브러리 불러오기

Pandas 와 Scikit-learn 라이브러리를 불러오세요

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

데이터 불러오기

train.csv 와 test.csv 를 DataFrame 클래스로 불러오세요

pd.read_csv()

train = pd.read_csv('data/train.csv') 
test = pd.read_csv('data/test.csv')

데이터 행열 갯수 관찰

shape 를 사용해 데이터 크기를 관찰하세요

df.shape

train.shape

(1459, 11)

test.shape

(715, 10)

결측치 확인

info() 를 사용해 결측치가 있는지 확인하세요.

df.info()

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      1459 non-null   int64  
 1   hour                    1459 non-null   int64  
 2   hour_bef_temperature    1457 non-null   float64
 3   hour_bef_precipitation  1457 non-null   float64
 4   hour_bef_windspeed      1450 non-null   float64
 5   hour_bef_humidity       1457 non-null   float64
 6   hour_bef_visibility     1457 non-null   float64
 7   hour_bef_ozone          1383 non-null   float64
 8   hour_bef_pm10           1369 non-null   float64
 9   hour_bef_pm2.5          1342 non-null   float64
 10  count                   1459 non-null   float64
dtypes: float64(9), int64(2)
memory usage: 125.5 KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 715 entries, 0 to 714
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      715 non-null    int64  
 1   hour                    715 non-null    int64  
 2   hour_bef_temperature    714 non-null    float64
 3   hour_bef_precipitation  714 non-null    float64
 4   hour_bef_windspeed      714 non-null    float64
 5   hour_bef_humidity       714 non-null    float64
 6   hour_bef_visibility     714 non-null    float64
 7   hour_bef_ozone          680 non-null    float64
 8   hour_bef_pm10           678 non-null    float64
 9   hour_bef_pm2.5          679 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 56.0 KB

df['a'].value_counts()

train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

df['a'].unique()

train['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

pd.Series.plot(kind = "bar")

막대 그래프
index 값이 x축, value값이 y축으로 대응 됩니다.
value_counts()의 결과물을 보여줄 때 유용합니다.
groupby된 결과물을 보여줄 때 유용합니다.

train.groupby('Pclass').mean()['Survived'].plot(kind='bar', rot = 0) # 각도 0

<AxesSubplot:xlabel='Pclass'>

pd.Series.plot(kind = 'hist')

히스토그램: 구간별로 속해있는 row의 개수를 시각화 합니다.
수치형에서만 가능, 범주는 안됩니다!

train['Age'].plot(kind='hist', bins = 30) # bins 촘촘한 정도

<AxesSubplot:ylabel='Frequency'>

보조선 => grid = True

train['Age'].plot(kind='hist', bins = 30, grid=True) # bins 촘촘한 정도

<AxesSubplot:ylabel='Frequency'>

pd.DataFrame.plot(x, y, kind = 'scatter')

산점도: 두 변수간의 관계를 시각화

train.plot(x = 'Age', y = 'Fare', kind = 'scatter')

<AxesSubplot:xlabel='Age', ylabel='Fare'>

전처리

결측치 전처리

dropna() 를 사용해 train 데이터는 결측치를 제거하고
fillna() 를 사용해 test 데이터 결측치는 0 으로 대체하세요.
그리고 결측치의 갯수를 출력하여 확인하세요.

train = train.dropna()
test = test.fillna(0)
print(train.isnull().sum())

id                        0
hour                      0
hour_bef_temperature      0
hour_bef_precipitation    0
hour_bef_windspeed        0
hour_bef_humidity         0
hour_bef_visibility       0
hour_bef_ozone            0
hour_bef_pm10             0
hour_bef_pm2.5            0
count                     0
dtype: int64

결측치 대체평균

train.fillna({'hour_bef_temperature' : int(train['hour_bef_temperature'].mean())},inplace=True)

결측치 대체보간

피쳐의 정보성을 강조하기 위한 보간법 사용
데이터의 순서가 시간 순서인 경우에 결측치들을 이전 행(직전시간)과 다음 행(직후시간)의 평균으로 보간하는 것은 상당히 합리적
이처럼 데이터에 따라서 결측치를 어떻게 대체할지 결정하는 것은 엔지니어의 결정.
Python pandas의 interpolate() 를 이용해 결측치를 DataFrame 값에 선형으로 비례하여 보간하는 코드
```
train.interpolate(inplace=True)
```

pd.Series.map()

시리즈 내 값을 변환할 때 사용하는 함수
문자열의 경우 숫자형으로 대체해주어야함. 모델에 넣기위해서

train = train['Sex'].map({'male' : 0, 'female' : 1})

모델 훈련

train 데이터의 count 피쳐를 뺀 것을 X_train 으로 할당하세요.
train 데이터의 count 피쳐만을 가진 것을 Y_train 으로 할당하세요.
회귀의사결정나무를 선언하고 fit() 으로 훈련시키세요.

X_train = train.drop(['count'], axis=1)
Y_train = train['count']
model = DecisionTreeRegressor()
model.fit(X_train, Y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

테스트 파일 예측

predict() 을 이용해 test data 를 훈련한 모델로 예측한 data array 를 생성하세요.

pred = model.predict(test)

submission 파일 생성

submission.csv 를 불러오세요.
submission df 의 count 피쳐에 예측한 결과를 덧입히세요.
submission df 를 to_csv() 를 이용해 csv 을 생성하세요. *index=False)

submission = pd.read_csv('data/submission.csv')
submission['count'] = pred
submission.to_csv('sub.csv',index=False)

2021.07.24 추가

데이터 수정

특정열의 특정행 값 바꾸기

train['석식계'][973] = 479.8605851979346

필터링하여 데이터 수정

df.iloc[ , ]

train.iloc[1,2] = 498

데이터 추출, 인덱싱

df.groupby()

train.groupby('년').mean()[['중식계','석식계']]

	중식계	석식계
년
2016	932.792952	519.418502
2017	897.614754	459.015822
2018	882.903766	465.547534
2019	850.512195	447.336832
2020	882.267241	432.736468
2021	1009.705882	396.588235

df.query

df.sort_values()

train.query('월<4 & 월>1 & 년 == 2020').mean()['중식계']

952.4285714285714

train.query('년>2016').groupby('월').mean()['중식계'].sort_values()

월
11    815.963415
12    834.473684
8     838.253012
7     839.523256
6     840.333333
5     849.460526
4     880.225000
10    894.666667
9     901.842857
1     934.278351
3     952.829268
2     998.042857
Name: 중식계, dtype: float64

특정 데이터 드랍

df.drop()

특정 행 데이터 드랍

train = train.drop(train.index[[204,  224,  244,  262,  281,  306,  327,  346,  366,  392,  410, 828,  853,  872,  890,  912,  932,  955,  973,  993, 1166]])

특정 열 데이터 드랍

df.drop('열 이름', axis=1)