[데이터시각화] 3. Pandas - lambda df:의 활용, MultiIndex의 이해, tidydata의 이해

Author

고경수

Published

January 25, 2024

전북대학교 통계학과 최규빈 교수님 강의 정리내용입니다.

https://guebin.github.io/DV2023/posts/07wk-1.html

imports

import numpy as np
import pandas as pd

import plotly.io as pio
pio.renderers.default = "notebook_connected"
pd.options.plotting.backend = "plotly"
pio.templates.default = "plotly_white"

Pandas - lambda df: 의 활용

추후 작성

Pandas - MultiIndex의 이해

추후 작성

Pandas - tidydata

tidydata의 개념

  • 아래의 자료는 처리가 불편하다.
df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])\
.stack().stack().reset_index()\
.rename({'level_0':'department','level_1':'result','level_2':'gender',0:'count'},axis=1).pivot_table(index=['gender','department'], columns='result',values='count',aggfunc=sum)
df 
result fail pass
gender department
female A 19 89
B 7 18
C 391 202
D 244 131
E 299 94
F 103 238
male A 314 511
B 208 352
C 204 121
D 279 138
E 137 54
F 149 224

- 가정1: 만약에 A학과에 해당하는 결과만 뽑고 싶다면? → department가 column으로 있어야함..

- 가정2: 이 데이터를 바탕으로 합격한사람만 bar plot을 그리고 싶다면? → department, gender, pass 가 column으로 있어야함..

tidydata = df['pass'].reset_index()
tidydata
gender department pass
0 female A 89
1 female B 18
2 female C 202
3 female D 131
4 female E 94
5 female F 238
6 male A 511
7 male B 352
8 male C 121
9 male D 138
10 male E 54
11 male F 224
tidydata.plot.bar(
    x='gender', y='pass',
    color='gender',
    facet_col ='department'
)

- tidydata 정의: https://r4ds.had.co.nz/tidy-data.html

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

tidydata가 아닌 예시

예시1 - MultiIndex 구조를 가지면 무조건 tidydata가 아니다.

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])\
.stack().stack().reset_index()\
.rename({'level_0':'department','level_1':'result','level_2':'gender',0:'count'},axis=1).pivot_table(index=['gender','department'], columns='result',values='count',aggfunc=sum)
df
result fail pass
gender department
female A 19 89
B 7 18
C 391 202
D 244 131
E 299 94
F 103 238
male A 314 511
B 208 352
C 204 121
D 279 138
E 137 54
F 149 224
  • 이건 tidydata가 아니고
tidydata = df.stack().reset_index().rename({0:'applicant_count'},axis=1)
tidydata 
gender department result applicant_count
0 female A fail 19
1 female A pass 89
2 female B fail 7
3 female B pass 18
4 female C fail 391
5 female C pass 202
6 female D fail 244
7 female D pass 131
8 female E fail 299
9 female E pass 94
10 female F fail 103
11 female F pass 238
12 male A fail 314
13 male A pass 511
14 male B fail 208
15 male B pass 352
16 male C fail 204
17 male C pass 121
18 male D fail 279
19 male D pass 138
20 male E fail 137
21 male E pass 54
22 male F fail 149
23 male F pass 224
  • 이게 tidydata

- 구분하는 방법1: 직관에 의한 설명

  • query쓰기 불편: 남성 지원자만 뽑고 싶다면?, 학과A만 뽑고 싶다면? 탈락한 지원자만 뽑고싶다면? 학과A에서 탈락한 지원자만 뽑고싶다면??
  • 시각화하기 불편
  • 다루기가 불편하다…

- 구분하는 방법 2 - df는 원칙 1에 위배된다. (왜냐하면 gender, department, result, applicant_count에 해당하는 변수는 하나의 컬럼을 차지하지 못함) - df는 원칙 2에 위배된다. (왜냐하면 하나의 행에 2개의 applicant_count observation이 존재함)

예시2 - 아래의 자료는 tidydata가 아니다.

df=pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])\
.stack().stack().reset_index()\
.rename({'level_0':'department','level_1':'result','level_2':'gender',0:'count'},axis=1).pivot_table(index='gender', columns='result', values='count', aggfunc=sum)\
.assign(pass_fail = lambda df: list(map(lambda x,y: (y,x),df['fail'],df['pass']))).drop(['fail','pass'],axis=1).reset_index()
df
result gender pass_fail
0 female (772, 1063)
1 male (1400, 1291)
  • df는 원칙 3에 위배된다.

예시3 - wide df

df=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df
Date Samsung Apple Huawei Xiaomi Oppo Mobicel Motorola LG Others Realme Google Nokia Lenovo OnePlus Sony Asus
0 2019-10 461 324 136 109 76 81 43 37 135 28 39 14 22 17 20 17
1 2019-11 461 358 167 141 86 61 29 36 141 27 29 20 23 10 19 27
2 2019-12 426 383 143 105 53 45 51 48 129 30 20 26 28 18 18 19
3 2020-01 677 494 212 187 110 79 65 49 158 23 13 19 19 22 27 22
4 2020-02 593 520 217 195 112 67 62 71 157 25 18 16 24 18 23 20
5 2020-03 637 537 246 187 92 66 59 67 145 21 16 24 18 31 22 14
6 2020-04 647 583 222 154 98 59 48 64 113 20 23 25 19 19 23 21
7 2020-05 629 518 192 176 91 87 50 66 150 43 27 15 18 19 19 13
8 2020-06 663 552 209 185 93 69 54 60 140 39 16 16 17 29 25 16
9 2020-07 599 471 214 193 89 78 65 59 130 40 27 25 21 18 18 12
10 2020-08 615 567 204 182 105 82 62 42 129 47 16 23 21 27 23 20
11 2020-09 621 481 230 220 102 88 56 49 143 54 14 15 17 15 19 15
12 2020-10 637 555 232 203 90 52 63 49 140 33 17 20 22 9 22 21
  • 이건 tidydata가 아니고
tidydata = df.melt(id_vars='Date').assign(Date = lambda _df: _df.Date.apply(pd.to_datetime))
tidydata
Date variable value
0 2019-10-01 Samsung 461
1 2019-11-01 Samsung 461
2 2019-12-01 Samsung 426
3 2020-01-01 Samsung 677
4 2020-02-01 Samsung 593
... ... ... ...
203 2020-06-01 Asus 16
204 2020-07-01 Asus 12
205 2020-08-01 Asus 20
206 2020-09-01 Asus 15
207 2020-10-01 Asus 21

208 rows × 3 columns

  • 이건 tidydata이다.

- df를 가지고 아래와 같은 그림을 그릴 수 있겠어?

tidydata.plot.line(
    x='Date', y='value', 
    color='variable',
    width=600
)

Pandas - melt/stack

reset_index()

중첩구조를 가지는 series일 경우 .reset_index()를 사용하면 쉽게 tidydata를 얻을 수 있다.

- 예시 1

dct = {'43052': 80, '43053': 90, '43054': 50}
s = pd.Series(dct)
s
43052    80
43053    90
43054    50
dtype: int64

- 예시 2

dct = {('43052',4): 80, ('43053',1): 90, ('43054',2): 50} # (학번,학년)
s = pd.Series(dct)
s
43052  4    80
43053  1    90
43054  2    50
dtype: int64
s.reset_index()
level_0 level_1 0
0 43052 4 80
1 43053 1 90
2 43054 2 50

- 예시 3

df=pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1]).stack().stack()
df
A  fail  female     19
         male      314
   pass  female     89
         male      511
B  fail  female      7
         male      208
   pass  female     18
         male      352
C  fail  female    391
         male      204
   pass  female    202
         male      121
D  fail  female    244
         male      279
   pass  female    131
         male      138
E  fail  female    299
         male      137
   pass  female     94
         male       54
F  fail  female    103
         male      149
   pass  female    238
         male      224
dtype: int64
df.reset_index()
level_0 level_1 level_2 0
0 A fail female 19
1 A fail male 314
2 A pass female 89
3 A pass male 511
4 B fail female 7
5 B fail male 208
6 B pass female 18
7 B pass male 352
8 C fail female 391
9 C fail male 204
10 C pass female 202
11 C pass male 121
12 D fail female 244
13 D fail male 279
14 D pass female 131
15 D pass male 138
16 E fail female 299
17 E fail male 137
18 E pass female 94
19 E pass male 54
20 F fail female 103
21 F fail male 149
22 F pass female 238
23 F pass male 224

- 예시4 .reset_index()는 말그대로 index를 reset 하는 명령어. 꼭 pd.Series에만 쓰는건 아니다.

df=pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1]).stack()
df
female male
A fail 19 314
pass 89 511
B fail 7 208
pass 18 352
C fail 391 204
pass 202 121
D fail 244 279
pass 131 138
E fail 299 137
pass 94 54
F fail 103 149
pass 238 224
df.reset_index()
level_0 level_1 female male
0 A fail 19 314
1 A pass 89 511
2 B fail 7 208
3 B pass 18 352
4 C fail 391 204
5 C pass 202 121
6 D fail 244 279
7 D pass 131 138
8 E fail 299 137
9 E pass 94 54
10 F fail 103 149
11 F pass 238 224

melt()

예시1: 아래의 자료를 tidydata로 만들자

df = pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df
Date Samsung Apple Huawei Xiaomi Oppo Mobicel Motorola LG Others Realme Google Nokia Lenovo OnePlus Sony Asus
0 2019-10 461 324 136 109 76 81 43 37 135 28 39 14 22 17 20 17
1 2019-11 461 358 167 141 86 61 29 36 141 27 29 20 23 10 19 27
2 2019-12 426 383 143 105 53 45 51 48 129 30 20 26 28 18 18 19
3 2020-01 677 494 212 187 110 79 65 49 158 23 13 19 19 22 27 22
4 2020-02 593 520 217 195 112 67 62 71 157 25 18 16 24 18 23 20
5 2020-03 637 537 246 187 92 66 59 67 145 21 16 24 18 31 22 14
6 2020-04 647 583 222 154 98 59 48 64 113 20 23 25 19 19 23 21
7 2020-05 629 518 192 176 91 87 50 66 150 43 27 15 18 19 19 13
8 2020-06 663 552 209 185 93 69 54 60 140 39 16 16 17 29 25 16
9 2020-07 599 471 214 193 89 78 65 59 130 40 27 25 21 18 18 12
10 2020-08 615 567 204 182 105 82 62 42 129 47 16 23 21 27 23 20
11 2020-09 621 481 230 220 102 88 56 49 143 54 14 15 17 15 19 15
12 2020-10 637 555 232 203 90 52 63 49 140 33 17 20 22 9 22 21

(풀이1) .melt() - 실패

df.melt()
variable value
0 Date 2019-10
1 Date 2019-11
2 Date 2019-12
3 Date 2020-01
4 Date 2020-02
... ... ...
216 Asus 16
217 Asus 12
218 Asus 20
219 Asus 15
220 Asus 21

221 rows × 2 columns

(풀이2) .melt(id_vars=) - 성공

df.melt(id_vars='Date')
Date variable value
0 2019-10 Samsung 461
1 2019-11 Samsung 461
2 2019-12 Samsung 426
3 2020-01 Samsung 677
4 2020-02 Samsung 593
... ... ... ...
203 2020-06 Asus 16
204 2020-07 Asus 12
205 2020-08 Asus 20
206 2020-09 Asus 15
207 2020-10 Asus 21

208 rows × 3 columns

stack() + reset_index()

교수님이 가장 많이 쓰는 테크닉: DataFrame을 MultiIndex를 가지는 Series로 “일부러” 변환하고 reset_index()를 시킴

예시1: 아래의 자료를 tidydata로 만들자.

df = pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df
Date Samsung Apple Huawei Xiaomi Oppo Mobicel Motorola LG Others Realme Google Nokia Lenovo OnePlus Sony Asus
0 2019-10 461 324 136 109 76 81 43 37 135 28 39 14 22 17 20 17
1 2019-11 461 358 167 141 86 61 29 36 141 27 29 20 23 10 19 27
2 2019-12 426 383 143 105 53 45 51 48 129 30 20 26 28 18 18 19
3 2020-01 677 494 212 187 110 79 65 49 158 23 13 19 19 22 27 22
4 2020-02 593 520 217 195 112 67 62 71 157 25 18 16 24 18 23 20
5 2020-03 637 537 246 187 92 66 59 67 145 21 16 24 18 31 22 14
6 2020-04 647 583 222 154 98 59 48 64 113 20 23 25 19 19 23 21
7 2020-05 629 518 192 176 91 87 50 66 150 43 27 15 18 19 19 13
8 2020-06 663 552 209 185 93 69 54 60 140 39 16 16 17 29 25 16
9 2020-07 599 471 214 193 89 78 65 59 130 40 27 25 21 18 18 12
10 2020-08 615 567 204 182 105 82 62 42 129 47 16 23 21 27 23 20
11 2020-09 621 481 230 220 102 88 56 49 143 54 14 15 17 15 19 15
12 2020-10 637 555 232 203 90 52 63 49 140 33 17 20 22 9 22 21
df.set_index('Date').stack().reset_index()
Date level_1 0
0 2019-10 Samsung 461
1 2019-10 Apple 324
2 2019-10 Huawei 136
3 2019-10 Xiaomi 109
4 2019-10 Oppo 76
... ... ... ...
203 2020-10 Nokia 20
204 2020-10 Lenovo 22
205 2020-10 OnePlus 9
206 2020-10 Sony 22
207 2020-10 Asus 21

208 rows × 3 columns

예시2: 아래의 자료를 tidydata로 만들어라.

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])
df
male female
fail pass fail pass
A 314 511 19 89
B 208 352 7 18
C 204 121 391 202
D 279 138 244 131
E 137 54 299 94
F 149 224 103 238
df.stack().stack().reset_index()
level_0 level_1 level_2 0
0 A fail female 19
1 A fail male 314
2 A pass female 89
3 A pass male 511
4 B fail female 7
5 B fail male 208
6 B pass female 18
7 B pass male 352
8 C fail female 391
9 C fail male 204
10 C pass female 202
11 C pass male 121
12 D fail female 244
13 D fail male 279
14 D pass female 131
15 D pass male 138
16 E fail female 299
17 E fail male 137
18 E pass female 94
19 E pass male 54
20 F fail female 103
21 F fail male 149
22 F pass female 238
23 F pass male 224

unstack() + reset_index()

예시1 - .stack().unstack()은 반대연산

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')\
.groupby(["AIRLINE","WEEKDAY"]).agg({"CANCELLED":[np.mean,"count"],"DIVERTED":[np.mean,"count"]})
df
CANCELLED DIVERTED
mean count mean count
AIRLINE WEEKDAY
AA 1 0.032106 1277 0.004699 1277
2 0.007341 1226 0.001631 1226
3 0.011949 1339 0.001494 1339
4 0.015004 1333 0.003751 1333
5 0.014151 1272 0.000786 1272
... ... ... ... ... ...
WN 3 0.014118 1275 0.001569 1275
4 0.007911 1264 0.003165 1264
5 0.005828 1201 0.000000 1201
6 0.010132 987 0.003040 987
7 0.006066 1154 0.002600 1154

98 rows × 4 columns

df.stack().unstack()
CANCELLED DIVERTED
mean count mean count
AIRLINE WEEKDAY
AA 1 0.032106 1277.0 0.004699 1277.0
2 0.007341 1226.0 0.001631 1226.0
3 0.011949 1339.0 0.001494 1339.0
4 0.015004 1333.0 0.003751 1333.0
5 0.014151 1272.0 0.000786 1272.0
... ... ... ... ... ...
WN 3 0.014118 1275.0 0.001569 1275.0
4 0.007911 1264.0 0.003165 1264.0
5 0.005828 1201.0 0.000000 1201.0
6 0.010132 987.0 0.003040 987.0
7 0.006066 1154.0 0.002600 1154.0

98 rows × 4 columns

- 이 자료를 tidydata로 만들자.

(풀이1) - stack 2번

df.stack().stack().reset_index()
AIRLINE WEEKDAY level_2 level_3 0
0 AA 1 mean CANCELLED 0.032106
1 AA 1 mean DIVERTED 0.004699
2 AA 1 count CANCELLED 1277.000000
3 AA 1 count DIVERTED 1277.000000
4 AA 2 mean CANCELLED 0.007341
... ... ... ... ... ...
387 WN 6 count DIVERTED 987.000000
388 WN 7 mean CANCELLED 0.006066
389 WN 7 mean DIVERTED 0.002600
390 WN 7 count CANCELLED 1154.000000
391 WN 7 count DIVERTED 1154.000000

392 rows × 5 columns

(풀이2) - unstack 2번

df.unstack().unstack().reset_index()
level_0 level_1 WEEKDAY AIRLINE 0
0 CANCELLED mean 1 AA 0.032106
1 CANCELLED mean 1 AS 0.000000
2 CANCELLED mean 1 B6 0.000000
3 CANCELLED mean 1 DL 0.006068
4 CANCELLED mean 1 EV 0.034130
... ... ... ... ... ...
387 DIVERTED count 7 OO 924.000000
388 DIVERTED count 7 UA 1038.000000
389 DIVERTED count 7 US 263.000000
390 DIVERTED count 7 VX 135.000000
391 DIVERTED count 7 WN 1154.000000

392 rows × 5 columns

예시2 - 아래의 자료를 tidydata로 만들어라.

df=pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])
df
male female
fail pass fail pass
A 314 511 19 89
B 208 352 7 18
C 204 121 391 202
D 279 138 244 131
E 137 54 299 94
F 149 224 103 238

(풀이1) - stack 2번

df.stack().stack().reset_index()
level_0 level_1 level_2 0
0 A fail female 19
1 A fail male 314
2 A pass female 89
3 A pass male 511
4 B fail female 7
5 B fail male 208
6 B pass female 18
7 B pass male 352
8 C fail female 391
9 C fail male 204
10 C pass female 202
11 C pass male 121
12 D fail female 244
13 D fail male 279
14 D pass female 131
15 D pass male 138
16 E fail female 299
17 E fail male 137
18 E pass female 94
19 E pass male 54
20 F fail female 103
21 F fail male 149
22 F pass female 238
23 F pass male 224

(풀이2) - unstack 1번

df.unstack().reset_index()
level_0 level_1 level_2 0
0 male fail A 314
1 male fail B 208
2 male fail C 204
3 male fail D 279
4 male fail E 137
5 male fail F 149
6 male pass A 511
7 male pass B 352
8 male pass C 121
9 male pass D 138
10 male pass E 54
11 male pass F 224
12 female fail A 19
13 female fail B 7
14 female fail C 391
15 female fail D 244
16 female fail E 299
17 female fail F 103
18 female pass A 89
19 female pass B 18
20 female pass C 202
21 female pass D 131
22 female pass E 94
23 female pass F 238