데이터 전처리¶

(1)

데이터 전처리 ¶

누락데이터 처리 중복 데이터 처리 데이터 표준화 범주형 데이터 처리 정규화

시계열 데이터

누락 데이터 처리

(2)

In [1]:

(891, 15)

survived 0

pclass 0

sex 0

age 177

sibsp 0

parch 0

fare 0

embarked 2

class 0

who 0

adult_male 0

deck 688

embark_town 2

alive 0

alone 0 dtype: int64

RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns):

survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 714 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object

survived pclass sex age sibsp parch fare embarked class who adult_male

0 0 3 male 22.0 1 0 7.2500 S Third man True

1 1 1 female 38.0 1 0 71.2833 C First woman False

2 1 3 female 26.0 0 0 7.9250 S Third woman False

3 1 1 female 35.0 1 0 53.1000 S First woman False

4 0 3 male 35.0 0 0 8.0500 S Third man True

import

seaborn

as

sns # seaborn 은 그래프화 해주는 라이브러리. 얘가 dataset을 제공을 해준다.

#titanic 데이터셋 가져오기

titanic_df

=

sns.load_dataset('titanic') display(titanic_df.head())

print(titanic_df.shape)

# 891행, 15열

display(titanic_df.isnull().sum()) # null이 각각 몇개인지 display(titanic_df.info())

print(type(titanic_df))

(3)

In [2]:

Q.deck 열의 NaN갯수를 계산하세요.

In [3]:

Q.titanic_df의 처음 5개 행에서 null값을 찾아 출력하세요(True/False)

None

Out[2]:

0 549 1 342

Name: survived, dtype: int64

NaN 688 C 59 B 47 D 33 E 32 A 15 F 13 G 4

Name: deck, dtype: int64 0 True

1 False 2 True 3 False 4 True

Name: deck, dtype: bool 688

#value_counts

titanic_df.survived.value_counts()

#deck 열의 NaN 개수 계산하기

nan_deck

=

titanic_df['deck'].value_counts(dropna

=False

)

#nan_deck=df['deck'].value_counts()

print(nan_deck)

display(titanic_df['deck'].isnull().head())

print(titanic_df['deck'].isnull().sum(axis = 0))

#행방향으로 찾기

(4)

In [4]:

Q.titanic_df의 'deck' 칼럼의 null의 갯수를 구하세요

In [5]:

survived pclass sex age sibsp parch fare embarked class who adult_male de

0 False False False False False False False False False False False Tr 1 False False False False False False False False False False False Fal 2 False False False False False False False False False False False Tr 3 False False False False False False False False False False False Fal 4 False False False False False False False False False False False Tr

survived pclass sex age sibsp parch fare embarked class who adult_male deck 0 True True True True True True True True True True True False 1 True True True True True True True True True True True True 2 True True True True True True True True True True True False 3 True True True True True True True True True True True True 4 True True True True True True True True True True True False

0 688

# isnull()메소드로 누락데이터 찾기 display(titanic_df.head().isnull())

print()

display(titanic_df.head().notnull())

# isnull()메소드로 누락데이터 개수 구하기

print(titanic_df.survived.isnull().sum(axis = 0))

print(titanic_df.deck.isnull().sum(axis = 0))

(5)

Q. titanic_df의 각 칼럼별 null 의 갯수를 for반복문을 사용해서 구한 후 출력하세 요.

(missing_count는 예외처리하고 처리 방식은 0을 출력함)

In [7]:

Q. embark_town 열의 NaN값을 바로 앞에 있는 828행의 값으로 변경한 후 출력하세 요.

survived 0

pclass 0

sex 0

age 177

sibsp 0

parch 0

fare 0

embarked 2

class 0

who 0

adult_male 0

deck 688

embark_town 2

alive 0

alone 0 dtype: int64

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive', 'alone'],

dtype='object')

# 각 칼럼별 null 개수

print(titanic_df.isnull().sum(axis = 0))

# thresh = 500: NaN값이 500개 이상인 열을 모두 삭제

# - deck 열(891개 중 688개의 NaN 값)

# how = 'any' : NaN 값이 하나라도 있으면 삭제

# how = 'all': 모든 데이터가 NaN 값일 경우에만 삭제 df_thresh

=

titanic_df.dropna(axis

= 1, thresh = 500)

# df1.dropna(axis=0) # NaN row 삭제

# df1.dropna(axis=1) # NaN column 삭제

print(df_thresh.columns)

(6)

In [8]:

중복 데이터 처리

In [9]:

In [10]:

827 Cherbourg 828 Queenstown 829 NaN 830 Cherbourg

Name: embark_town, dtype: object 827 Cherbourg

828 Queenstown 829 Queenstown 830 Cherbourg

Name: embark_town, dtype: object

c1 c2 c3 0 a 1 1 1 a 1 1 2 b 1 2 3 a 2 2 4 b 2 2

0 False 1 True 2 False 3 False 4 False dtype: bool

# embark_town 열의 NaN값을 바로 앞에 있는 828행의 값으로 변경하기

import

seaborn

as

sns

titanic_df

=

sns.load_dataset('titanic') titanic_df.embark_town.isnull().sum()

#print(titanic_df['embark_town'][827:831])

print(titanic_df.embark_town.iloc[827:831]) print()

titanic_df['embark_town'].fillna(method

= 'ffill', inplace =True

)

# Nan 데이터 0으로 채워줌

print(titanic_df['embark_town'][827:831])

# 중복데이터를 갖는 데이터 프레임 만들기

import

pandas

as

pd

df

=

pd.DataFrame({'c1':['a','a','b','a','b'], 'c2':[1,1,1,2,2,],

'c3':[1,1,2,2,2]})

print(df)

#데이터 프레임 전체 행 데이터 중에서 중복값 찾기 df_dup

=

df.duplicated()

print(df_dup)

(7)

Q. df에서 중복행을 제거한 후 df2에 저장하고 출력하세요.

In [12]:

Q. df에서 c2, c3열을 기준으로 중복행을 제거 한 후 df3에 저장하고 출력하세요.

In [13]:

3

0 False 1 True 2 True 3 False 4 True

Name: c2, dtype: bool

c1 c2 c3 0 a 1 1 1 a 1 1 2 b 1 2 3 a 2 2 4 b 2 2 c1 c2 c3 0 a 1 1 2 b 1 2 3 a 2 2 4 b 2 2

c1 c2 c3 0 a 1 1 2 b 1 2 3 a 2 2

# 데이터프레임의 특정 열 데이터에서 중복값 찾기 col_dup_sum

=

df['c2'].duplicated().sum() col_dup

=

df['c2'].duplicated()

print(col_dup_sum,'\n')

# c2에서 중복되는 값이 몇개인지 확인

print(col_dup)

# 데이터 프레임에서 중복행 제거 : drop_duplicates()

print(df)

print()

df2

=

df.drop_duplicates()

print(df2)

# print(df)

# print()

# df3= df[['c2','c3']].drop_duplicates()

# print(df3)

df3

=

df.drop_duplicates(subset

=

['c2','c3'])

print(df3)

#[1,1] , [1,2],[2,2] 를 한묶음으로 봐서 같은 행을 삭제 한다.

(8)

데이터 단위 변경

In [14]:

In [15]:

Q. 'mpg'를 'kpl'로 환산하여 새로운 열을 생성하고 처음 3개 행을 소수점 아래 둘쨰 자 리에서 반올림하여 출력하시요.

Out[14]:

18.0 8 307.0 130.0 3504. 12.0 70 1 chevrolet chevelle malibu 0 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320 1 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite 2 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst 3 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino 4 15.0 8 429.0 198.0 4341.0 10.0 70 1 ford galaxie 500

0.42514285110463595

# read_csv() 함수로 df 생성

import

pandas

as

pd

auto_df

=

pd.read_csv('auto-mpg.csv') auto_df.head()

#mpg(mile per gallon)를 kpl(kilometer per liter)로 반환 (mmpg_to_kpl=0.425) mpg_to_kpl

=

1.60934

/ 3.78541

print(mpg_to_kpl)

(9)

mpg cylinders displacement horsepower weight acceleration model

year origin name

0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet

chevelle malibu

1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick

skylark 320

2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth

satellite

mpg cylinders displacement horsepower weight acceleration model

year origin name k

0 18.0 8 307.0 130.0 3504.0 12.0 70 1

chevrolet chevelle malibu

7.6 1 15.0 8 350.0 165.0 3693.0 11.5 70 1

buick skylark 320

6.3 2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth

satellite 7.6 import

pandas

as

pd

auto_df

=

pd.read_csv('auto-mpg.csv', header

=None

)

# 열 이름을 지정

auto_df.columns

=

['mpg','cylinders','displacement','horsepower','weight','acceleration','model year display(auto_df.head(3))

print('\n')

auto_df['kpl']

=

round((auto_df['mpg']

*

mpg_to_kpl),2) display(auto_df.head(3))

(10)

In [17]:

Q.horsepower 열의 고유값을 출력하세요.

In [18]:

Q. horsepower 열의 누락 데이터 '?'을 삭제한 후 NaN 값의 갯수를 출력하세요.

In [19]:

Q. horsepower'문자열을 실수 형으로 변환 후 자료형을 확인하세요.

mpg float64 cylinders int64 displacement float64 horsepower object weight float64 acceleration float64 model year int64 origin int64 name object kpl float64 dtype: object

['130.0' '165.0' '150.0' '140.0' '198.0' '220.0' '215.0' '225.0' '190.0' '170.0' '160.0' '95.00' '97.00' '85.00' '88.00' '46.00' '87.00' '90.00' '113.0' '200.0' '210.0' '193.0' '?' '100.0' '105.0' '175.0' '153.0' '180.0' '110.0' '72.00' '86.00' '70.00' '76.00' '65.00' '69.00' '60.00' '80.00' '54.00' '208.0' '155.0' '112.0' '92.00' '145.0' '137.0' '158.0' '167.0' '94.00' '107.0' '230.0' '49.00' '75.00' '91.00' '122.0' '67.00' '83.00' '78.00' '52.00' '61.00' '93.00' '148.0' '129.0' '96.00' '71.00' '98.00' '115.0' '53.00' '81.00' '79.00' '120.0' '152.0' '102.0' '108.0' '68.00' '58.00' '149.0' '89.00' '63.00' '48.00' '66.00' '139.0' '103.0' '125.0' '133.0' '138.0' '135.0' '142.0' '77.00' '62.00' '132.0' '84.00' '64.00' '74.00' '116.0' '82.00']

object

# 각 열의 자료형 확인

print(auto_df.dtypes)

print(auto_df['horsepower'].unique())

# '?' 때문에 horsepower object로 모든 것을 인식한다.

# 누락 데이터 ('?') 삭제

import

numpy

as

np

auto_df['horsepower'].replace('?',np.nan,inplace

=True

) #'?'을 np.nan으로 변경 auto_df.dropna(subset

=

['horsepower'], axis

= 0, inplace =True

) # 누락데이터 행을 제거 auto_df['horsepower'].isnull().sum()

print(auto_df.horsepower.dtypes)

(11)

Q.아래사항을 처리하세요

-

In [21]:

origin 열의 자료형을 확인하고 범주형으로 변환하여 출력하세요.

In [22]:

In [23]:

Q.origin열을 범주형에서 문자열로 변환한 후 자료형을 출력하세요.

0 130.0 1 165.0 2 150.0 3 150.0 4 140.0 ...

393 86.0 394 52.0 395 84.0 396 79.0 397 82.0

Name: horsepower, Length: 392, dtype: float64 float64

[1 3 2]

['USA' 'JAPEN' 'EU']

object

=

auto_df['horsepower'].astype('float')

print(auto_dff)

print(auto_dff.dtypes)

# origin 열의 고유값 확인

print(auto_df['origin'].unique())

# 정수형 데이터를 문자형 데이터로 변환

auto_df['origin'].replace({1:'USA',2:'EU',3:'JAPEN'},inplace

=True

)

print(auto_df['origin'].unique())

# 연속형 (1,2,3,4,5..)/범주형('AG','BG'..)

print(auto_df['origin'].dtypes)

#origin 열의 문자열 자료형을 범주형으로 전환

auto_df['origin']

=

auto_df['origin'].astype('category')

print(auto_df['origin'].dtypes)

(12)

In [24]:

Q.model year열의 정수형을 범주형으로 변환한 후 출력하세요.

In [25]:

In [26]:

object

215 76 90 73 142 74

Name: model year, dtype: int64 72 72

232 77 370 82

Name: model year, dtype: category

Categories (13, int64): [70, 71, 72, 73, ..., 79, 80, 81, 82]

Int64Index: 392 entries, 0 to 397 Data columns (total 11 columns):

mpg 392 non-null float64 cylinders 392 non-null int64 displacement 392 non-null float64 horsepower 392 non-null float64 weight 392 non-null float64 acceleration 392 non-null float64 model year 392 non-null category origin 392 non-null object name 392 non-null object kpl 392 non-null float64 hp 392 non-null int32

dtypes: category(1), float64(6), int32(1), int64(1), object(2) memory usage: 33.3+ KB

### origin 열의 자료형을 확인하고 문자형에서 범주형으로 변환하여 출력하세요.

auto_df['origin']

=

auto_df['origin'].astype('str') #str 대신 object를 사용 할 수 있다.

print(auto_df['origin'].dtypes)

# auto_df['model year'].unique() : [70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82]

display(auto_df['model year'].sample(3))

auto_df['model year']

=

auto_df['model year'].astype('category') display(auto_df['model year'].sample(3))

# 범주형(카테고리)데이터 처리

auto_df['horsepower'].replace('?',np.nan,inplace

=True

) auto_df.dropna(subset

=

= 0, inplace =True

) auto_df['horsepower']

=

auto_df['horsepower'].astype('float') auto_df['hp']

=

auto_df['horsepower'].astype('int')

print()

auto_df.info()

(13)

Int64Index: 392 entries, 0 to 397 Data columns (total 11 columns):

mpg 392 non-null float64 cylinders 392 non-null int64 displacement 392 non-null float64 horsepower 392 non-null float64 weight 392 non-null float64 acceleration 392 non-null float64 model year 392 non-null category origin 392 non-null object name 392 non-null object kpl 392 non-null float64 hp 392 non-null int32

dtypes: category(1), float64(6), int32(1), int64(1), object(2) memory usage: 33.3+ KB

auto_df['horsepower'].replace('?', np.nan, inplace

=True

) auto_df.dropna(subset

=

= 0, inplace =True

) auto_df['horsepower']

=

auto_df['horsepower'].astype('float') auto_df['hp']

=

auto_df['horsepower'].astype('int')

print()

auto_df.info()

(14)

In [28]:

더미 변수

-카테고리를 나타내는 범주형 데이터를 회귀분석 등 ml알고리즘에 바로 사용할 수 없는 경우 -컴퓨터가 인식 가 능한 입력값으로 변환 숫자0or1로 표현된느 더미 변수 사용. -0,1은 크고 작음과 상관없고 어떤 특성 여부만 표시 (존재1,비존재 0) -> one hot encoding(one hot vector로 변환한다는 의미)

array([ 46. , 107.33333333, 168.66666667, 230. ])

horsepower hp_bin

0 130.0 보통출력

1 165.0 보통출력

2 150.0 보통출력

3 150.0 보통출력

4 140.0 보통출력

# np.histogram 함수로 3개의 bin으로 나누는 경계값의 리스트 구하기

count, bin_dividers

=

np.histogram(auto_df['horsepower'], bins

= 3)

# 3개로 나누기 때문에 bins=3 display(bin_dividers) # [ 46. , 107.33333333, 168.66666667, 230.] : 46 ~ 107, 107 ~ 168, 168 ~ 230 총

print()

# 3개의 bin에 이름 지정

bin_names

=

['저출력', '보통출력', '고출력']

# pd.cut 함수로 각 데이터를 3개의 bin에 할당

auto_df['hp_bin']

=

pd.cut(x

=

auto_df['horsepower'], # 데이터 배열 bins

=

bin_dividers, # 경계 값 리스트 labels

=

bin_names, # bin 이름 include_lowest

=True

) # 첫 경계값 포함

# horsepower 열, hp_bin 열의 첫 5행을 출력 display(auto_df[['horsepower', 'hp_bin']].head())

(15)

In [32]:

Out[29]:

hp_bin 저출력 보통출력 고출력

0 0 1 0

1 0 1 0

2 0 1 0

3 0 1 0

4 0 1 0

5 0 0 1

6 0 0 1

7 0 0 1

8 0 0 1

9 0 0 1

10 0 0 1

11 0 1 0

12 0 1 0

13 0 0 1

14 1 0 0

[1 1 1 1 1 0 0 0 0 0 0 1 1 0 2]

# hp_bin 열의 범주형 데이터를 더미 변수로 변환

horsepower_dummies

=

pd.get_dummies(auto_df['hp_bin']) horsepower_dummies.head(15)

# sklearn 라이브러리 불러오기

from

sklearn

import

preprocessing

# 전처리를 위한 encoder 객체 만들기

label_encoder

=

preprocessing.LabelEncoder() #label encoder 생성 onehot_encoder

=

preprocessing.OneHotEncoder() #one hot encoder 생성

# label encoder로 문자열 범주를 숫자형 범주로 변환

onehot_labeled

=

label_encoder.fit_transform(auto_df['hp_bin'].head(15))

print(onehot_labeled)

print(type(onehot_labeled))

(16)

In [33]:

In [36]:

정규화

각변수의 숫자 데이터의 상대적 크기 차이 때문에 ml분석결과가 달라질 수 있음.(a 변수 1 ~ 1000, b변수 0 ~ 1)

숫자 데이터의 상대적 크기 차이를 제거 할 필요가 있으며 각 열(변수)에 속하는 데이터 값을 동일한 크기 기 분즈오 나누어 정교화 함

[[1]

[1]

[0]

[1]

[0]

[2]]

(0, 1) 1.0 (1, 1) 1.0 (2, 1) 1.0 (3, 1) 1.0 (4, 1) 1.0 (5, 0) 1.0 (6, 0) 1.0 (7, 0) 1.0 (8, 0) 1.0 (9, 0) 1.0 (10, 0) 1.0 (11, 1) 1.0 (12, 1) 1.0 (13, 0) 1.0 (14, 2) 1.0

# 2차원 행렬로 형태 변경

onehot_reshaped

=

onehot_labeled.reshape(len(onehot_labeled),1)

print(onehot_reshaped)

print(type(onehot_reshaped))

#희소행렬로 변환

import

warnings

warnings.filterwarnings('ignore')

onehot_fitted

=

onehot_encoder.fit_transform(onehot_reshaped)

print(onehot_fitted)

print(type(onehot_fitted))

(17)

표준화

평균이 0이고 분산(or표준편차)이 1인 가우시간 표준 정규분포를 가진 값으로 변환하는 것

In [41]:

count 392.000000 mean 104.469388 std 38.491160 min 46.000000 25% 75.000000 50% 93.500000 75% 126.000000 max 230.000000

Name: horsepower, dtype: float64 0 0.565217

1 0.717391 2 0.652174 3 0.652174 4 0.608696

Name: horsepower, dtype: float64 count 392.000000

mean 0.454215 std 0.167353 min 0.200000 25% 0.326087 50% 0.406522 75% 0.547826 max 1.000000

Name: horsepower, dtype: float64

# horsepower 열의 누락 데이터('?') 삭제하고 실수형으로 변환 auto_df

=

=None

)

auto_df.columns

=

['mpg','cylinders','displacement','horsepower','weight','acceleration','model year auto_df['horsepower'].replace('?', np.nan, inplace

=True

) #'?'을 np.nan으로 변경

auto_df.dropna(subset

=

= 0, inplace =True

) # 누락데이터 행을 제거 auto_df['horsepower']

=

auto_df['horsepower'].astype('float') # 문자열을 실수형으로 변경

#horsepower 열의 통계 요약정보로 최대값(max)을 확인

print(auto_df.horsepower.describe())

print()

# horsepower 열의 최대값의 절대값으로 모든 데이터를 나눠서 저장

auto_df.horsepower

=

auto_df.horsepower

/

abs(auto_df.horsepower.max())

print(auto_df.horsepower.head())

print()

print(auto_df.horsepower.describe())

(18)

In [43]:

count 392.000000 mean 104.469388 std 38.491160 min 46.000000 25% 75.000000 50% 93.500000 75% 126.000000 max 230.000000

Name: horsepower, dtype: float64 0 0.456522

1 0.646739 2 0.565217 3 0.565217 4 0.510870

Name: horsepower, dtype: float64 count 392.000000

mean 0.317768 std 0.209191 min 0.000000 25% 0.157609 50% 0.258152 75% 0.434783 max 1.000000

Name: horsepower, dtype: float64

auto_df

=

=None

)

auto_df.columns

=

['mpg','cylinders','displacement','horsepower','weight','acceleration','model year

#horsepower 열의 누락 데이터('?') 삭제하고 실수형으로 변환

auto_df['horsepower'].replace('?', np.nan, inplace

=True

) #'?'을 np.nan으로 변경 auto_df.dropna(subset

=

= 0, inplace =True

) # 누락데이터 행을 제거 auto_df['horsepower']

=

auto_df['horsepower'].astype('float') # 문자열을 실수형으로 변환

# horsepower 열의 통계 요약정보로 최대값 (max)과 최소값(min)을 확인

print(auto_df.horsepower.describe())

print()

# horsepower 각 열 데이터에서 해당 열의 최소값을 뺀 값을 분자, 해당 열의 최대값 - 최소값을 분모

# 가장 큰 값은 역시 1

min_x

=

auto_df.horsepower

-

auto_df.horsepower.min()

min_max

=

auto_df.horsepower.max()

-

auto_df.horsepower.min() auto_df.horsepower

=

min_x

/

min_max