• 검색 결과가 없습니다.

Summarizing Data

N/A
N/A
Protected

Academic year: 2022

Share "Summarizing Data"

Copied!
54
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Summarizing Data

(2)

Statistics

(3)

statistics probability

probability vs. statistics

sampling

inference

(4)

Distribution ?

(5)

Distribution :

A mathematical way to represent the diversity of characteristics of a group.

Group may be a sample and a population.

• population distribution

• distribution of a sample

(6)

dist’n of a sample pop’n dist’n

realistic imaginary

data Theory (model)

statistics

(7)
(8)

Statistics starts from data.

(9)

Data are clues to truth,

and say about truth.

Data are not just sets of numbers.

(10)

The 1st principle of statistics :

The sample is not the same with the population,

but the population is represented by the sample

sufficiently well.

(11)

(12)

Datawork

(13)

• From real world

• Data collecting

• Exploring data

• Reducing data

• Modeling

• Evaluating

• From forest

• Making timber

• Inspecting wood grain

• Cutting

• Structuring

• Finishing

Woodwork & Datawork

(14)

Craft & Endeavor

(15)

Tools & Skills

(16)

• Paper, pencil & calculator

• Spreadsheet SW (Excel)

• Minitab, SPSS, SAS, R

• DBMS ( Access, Oracle, …)

• C/C++, Java, Python, …

Statistical tools

You need skill to use these.

(17)

Also, you need craft & experiences.

However, the more important point in

datawork is trying to get perspectives

of the data on your hand.

(18)

No typical ways for good datawork.

Think, think and think !

That’s the only way.

(19)
(20)

Datawork is not a miagic. It's a hard job.

살라카둘라 메치카불라

비비디 바비디 부 --

(21)

Wood grain ?

(22)

Grain of data ?

(23)

Seeing the grain of data

Exploratory Data Analysis

(24)

The step to check the basic properties of data, by using the basic statistical methods.

From EDA, we aim to develop insight on data, as a first step for more specific analysis.

Exploratory Data Analysis (EDA)

(25)

Qualitative variable

• frequency table

• crosstabulation (contingency table)

• bar chart, pie chart, ….

Basic Statistical Methods

(26)

• (cumulative) frequency distribution

• histogram

• dot-plot

• stem & leaf diagram

• scatter plot

• box plot, ….

Quantitative scale

Basic Statistical Methods

(27)

• 12 var’s & 100 obs’s

• Many types of ‘offer’ to cardholders

• To find the type of ‘offer’ that increases

cardholder’s usage maximally.

Credit_Card_Bank: p22 of SVV

Example Data

(28)

[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No)

oct08 mseg

iseg

loct08 = log(oct08)

data.svv<-dir("c:/temp/text")

dfile.svv<-paste("c:/temp/text/",data.svv,sep="") dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t") names(dsv)

oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0]

mseg<-dsv[,5]; iseg<-dsv[,6]

(29)

[1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97 [11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85 [21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46 [31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03 [41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93 [51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05 [61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63 [71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61 [81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65 [91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30

log(oct08):

log(0) = - Inf Rounded up to 2

nd

decimal

round(loct08,2)

(30)

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –Inf.

round(sort(xoct08,2)

(31)

[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T Levels: A B R T

iseg

Meaning of the levels are not known.

(32)

[1] M L L M B A L A M H M L A M M B L B H L [21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H

mseg

L: low, B: below medium, M: medium, A: above medium, H: high

levels(mseg)<-c("M","H","L","A","B")

mseg<-factor(mseg, levels=c("L","B","M","A","H")) mseg

(33)

Histogram of loct08

loct08

Frequ ency

2 4 6 8 10

0 5 10 15 20

hist(xoct08,col="grey")

(34)

Stem and leaf display:

leaf unit = 0.1

2 | 5

3 | 11345889 4 | 003667789

5 | 3555666677999

6 | 0011122333567789999

7 | 000111122244445556678999 8 | 0001222234444789

9 | 11 10 | 4

a stem

a leaf 2.5

stem(xoct08)

(35)

leaf unit = 1

2 | 5

3 | 11345889 4 | 003667789

5 | 3555666677999

6 | 0011122333567789999

7 | 000111122244445556678999 8 | 0001222234444789

9 | 11 10 | 4

25

stem(10*xoct08)

(36)

Min. Q1 Median Q3 Max.

2.509 5.563 6.864 7.682 10.420

5 number summary of log(oct08):

IQR = 2.119

summary(xoct08)

(37)

Quartiles : Q1, Q2 , Q3

Q1 : values ranked at 25% from lowest Q2 : values ranked at 50% from lowest Q3 : values ranked at 75% from lowest

IQR (Inter-Quartile Range) = Q3 – Q1

Median = Q2

(38)

How to take : Q1, Q2, Q3

If c is an integer, then c-th ranked value x[c]

If c is not an integer, then (x[c

-

]+ x[c

+

])/2 Q1 : c = 0.25*(n+1)

Q2 : c= 0.5*(n+1) Q3 : c= 0.75*(n+1)

c

-

: the largest lower integer than c

c

+

: the smallest upper integer than c

(39)

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of – Inf.

n= 93 , 0.25*94=23.5, 0.5*94=47, 0.75*94=70.5

(40)

2 4 6 8 10 12

loct08

Dot plot

(41)

050001000015000200002500030000

Box plot oct08

46810

Box plot of log(oct08)

boxplot(xoct08) boxplot(oct08)

(42)

IQR

Q1 Q2 Q3

* *

mild-outlier extreme-outlier

min(non-outlier) min(non-outlier)

1.5 IQR

(43)

freq %freq cum. freq %cum. freq Low Spender 26 0.26 26 0.26

Med Low Spender 20 0.20 46 0.46 Average Spender 11 0.11 57 0.57 Med High Spender 25 0.25 82 0.82 High Spender 18 0.18 100 1.00

--- Total 100 1.00

Frequency table

table(mseg)

table(mseg)/length(mseg) cumsum(table(mseg))

cumsum(table(mseg))/length(mseg)

(44)

Bar chart of log(oct08)

(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]

05101520

(45)

Histogram & Bar chart

Histogram : for quantitative variables connected bar’s

Bar chart : for categorical variables

disconnected bar’s

(46)

A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18 Total 31 32 10 27 100

Contingency table of mseg and iseg

mseg

iseg

table(mseg,iseg)

apply(table(mseg,iseg),1,sum) apply(table(mseg,iseg),2,sum)

(47)

A B

R T

Pie chart of iseg

31 32

10 27

pie(table(iseg),col=c("red","light green","green","blue"))

(48)

A B R T

0 5 10 15 20 25 30

Segmented bar chart of (mseg, iseg) - serial

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

(49)

A B R T

0 2 4 6 8 10 12

Segmented bar chart of (mseg, iseg) - parallel

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

(50)

Mosaic Plot

iseg

m se g

A B R T

L B M A H

mosaicplot(~iseg+mseg,col=rainbow(5))

(51)

L B M A H 4 6 8 10

Box plot of log(oct08) by mseg

boxplot(loct08[oct08>0]~mseg[oct08>0])

(52)

A B C D E F

10 11 0 3 3 11

7 17 1 5 5 9

20 21 7 12 3 15

14 11 2 6 5 22

14 16 3 4 3 15

12 14 1 3 6 16

10 17 2 5 1 13

23 17 1 5 1 10

17 19 3 5 3 26

20 21 0 5 2 26

14 7 1 2 6 24

13 13 4 4 4 13

(53)

A B C D E F 0510152025

InsectSprays data

Type of spray

Insect count

(54)

Thank you !!

참조

관련 문서

This study aims to show the basic data for the effective counselling activity of the student-counselling volunteers(hereinafter, cited as volunteers ) through

Additionally, all Google user to download the query index data as a CVS file. So, we gather the individual companies listed on KOSPI100’s search volume index by

Particularly, after selecting a specific cycle by collecting HWC data and primary system-related data from Gori NPP Unit 1 and comparatively analyzing

The purpose of the study is to develop a sensor data collection and monitoring system with database using IoT techrology and to apply the ststem to three

The aim of this chapter is to collect some definitions and to quickly review basic facts necessary for the proof of our results given in Chapter 3.. First,

To confirm the actual biological meaning of data obtained through genetic analysis, Litsea japonica fruit the data obtained from the

This study aims to suggest a basic data for the activation of aromatherapy pimple treatment program with the pimple skin care and treatment by testing the actual

– Step 3: substitute the element trial solution into integrals (LHS) – Step 4: Develop specific expression for the element trial