Summarizing Data
Statistics
statistics probability
probability vs. statistics
sampling
inference
Distribution ?
Distribution :
A mathematical way to represent the diversity of characteristics of a group.
Group may be a sample and a population.
• population distribution
• distribution of a sample
dist’n of a sample pop’n dist’n
realistic imaginary
data Theory (model)
statistics
Statistics starts from data.
Data are clues to truth,
and say about truth.
Data are not just sets of numbers.
The 1st principle of statistics :
The sample is not the same with the population,
but the population is represented by the sample
sufficiently well.
≈
Datawork
• From real world
• Data collecting
• Exploring data
• Reducing data
• Modeling
• Evaluating
• From forest
• Making timber
• Inspecting wood grain
• Cutting
• Structuring
• Finishing
Woodwork & Datawork
Craft & Endeavor
Tools & Skills
• Paper, pencil & calculator
• Spreadsheet SW (Excel)
• Minitab, SPSS, SAS, R
• DBMS ( Access, Oracle, …)
• C/C++, Java, Python, …
Statistical tools
You need skill to use these.
Also, you need craft & experiences.
However, the more important point in
datawork is trying to get perspectives
of the data on your hand.
No typical ways for good datawork.
Think, think and think !
That’s the only way.
Datawork is not a miagic. It's a hard job.
살라카둘라 메치카불라
비비디 바비디 부 --
Wood grain ?
Grain of data ?
Seeing the grain of data
Exploratory Data Analysis
≈
The step to check the basic properties of data, by using the basic statistical methods.
From EDA, we aim to develop insight on data, as a first step for more specific analysis.
Exploratory Data Analysis (EDA)
Qualitative variable
• frequency table
• crosstabulation (contingency table)
• bar chart, pie chart, ….
Basic Statistical Methods
• (cumulative) frequency distribution
• histogram
• dot-plot
• stem & leaf diagram
• scatter plot
• box plot, ….
Quantitative scale
Basic Statistical Methods
• 12 var’s & 100 obs’s
• Many types of ‘offer’ to cardholders
• To find the type of ‘offer’ that increases
cardholder’s usage maximally.
Credit_Card_Bank: p22 of SVV
Example Data
[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No)
oct08 mseg
iseg
loct08 = log(oct08)
data.svv<-dir("c:/temp/text")
dfile.svv<-paste("c:/temp/text/",data.svv,sep="") dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t") names(dsv)
oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0]
mseg<-dsv[,5]; iseg<-dsv[,6]
[1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97 [11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85 [21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46 [31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03 [41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93 [51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05 [61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63 [71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61 [81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65 [91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30
log(oct08):
log(0) = - Inf Rounded up to 2
nddecimal
round(loct08,2)
[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42
Sorted values of log(oct08):
after deleting 7 cases of –Inf.
round(sort(xoct08,2)
[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T Levels: A B R T
iseg
Meaning of the levels are not known.
[1] M L L M B A L A M H M L A M M B L B H L [21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H
mseg
L: low, B: below medium, M: medium, A: above medium, H: high
levels(mseg)<-c("M","H","L","A","B")
mseg<-factor(mseg, levels=c("L","B","M","A","H")) mseg
Histogram of loct08
loct08
Frequ ency
2 4 6 8 10
0 5 10 15 20
hist(xoct08,col="grey")
Stem and leaf display:
leaf unit = 0.1
2 | 5
3 | 11345889 4 | 003667789
5 | 3555666677999
6 | 0011122333567789999
7 | 000111122244445556678999 8 | 0001222234444789
9 | 11 10 | 4
a stem
a leaf 2.5
stem(xoct08)
leaf unit = 1
2 | 5
3 | 11345889 4 | 003667789
5 | 3555666677999
6 | 0011122333567789999
7 | 000111122244445556678999 8 | 0001222234444789
9 | 11 10 | 4
25
stem(10*xoct08)
Min. Q1 Median Q3 Max.
2.509 5.563 6.864 7.682 10.420
5 number summary of log(oct08):
IQR = 2.119
summary(xoct08)
Quartiles : Q1, Q2 , Q3
Q1 : values ranked at 25% from lowest Q2 : values ranked at 50% from lowest Q3 : values ranked at 75% from lowest
IQR (Inter-Quartile Range) = Q3 – Q1
Median = Q2
How to take : Q1, Q2, Q3
If c is an integer, then c-th ranked value x[c]
If c is not an integer, then (x[c
-]+ x[c
+])/2 Q1 : c = 0.25*(n+1)
Q2 : c= 0.5*(n+1) Q3 : c= 0.75*(n+1)
c
-: the largest lower integer than c
c
+: the smallest upper integer than c
[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42
Sorted values of log(oct08):
after deleting 7 cases of – Inf.
n= 93 , 0.25*94=23.5, 0.5*94=47, 0.75*94=70.5
2 4 6 8 10 12
loct08
Dot plot
050001000015000200002500030000
Box plot oct08
46810
Box plot of log(oct08)
boxplot(xoct08) boxplot(oct08)
IQR
Q1 Q2 Q3
* *
mild-outlier extreme-outlier
min(non-outlier) min(non-outlier)
1.5 IQR
freq %freq cum. freq %cum. freq Low Spender 26 0.26 26 0.26
Med Low Spender 20 0.20 46 0.46 Average Spender 11 0.11 57 0.57 Med High Spender 25 0.25 82 0.82 High Spender 18 0.18 100 1.00
--- Total 100 1.00
Frequency table
table(mseg)
table(mseg)/length(mseg) cumsum(table(mseg))
cumsum(table(mseg))/length(mseg)
Bar chart of log(oct08)
(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]
05101520
Histogram & Bar chart
Histogram : for quantitative variables connected bar’s
Bar chart : for categorical variables
disconnected bar’s
A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18 Total 31 32 10 27 100
Contingency table of mseg and iseg
mseg
iseg
table(mseg,iseg)
apply(table(mseg,iseg),1,sum) apply(table(mseg,iseg),2,sum)
A B
R T
Pie chart of iseg
31 32
10 27
pie(table(iseg),col=c("red","light green","green","blue"))
A B R T
0 5 10 15 20 25 30
Segmented bar chart of (mseg, iseg) - serial
barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))
A B R T
0 2 4 6 8 10 12
Segmented bar chart of (mseg, iseg) - parallel
barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)
Mosaic Plot
iseg
m se g
A B R T
L B M A H
mosaicplot(~iseg+mseg,col=rainbow(5))
L B M A H 4 6 8 10
Box plot of log(oct08) by mseg
boxplot(loct08[oct08>0]~mseg[oct08>0])
A B C D E F
10 11 0 3 3 11
7 17 1 5 5 9
20 21 7 12 3 15
14 11 2 6 5 22
14 16 3 4 3 15
12 14 1 3 6 16
10 17 2 5 1 13
23 17 1 5 1 10
17 19 3 5 3 26
20 21 0 5 2 26
14 7 1 2 6 24
13 13 4 4 4 13
A B C D E F 0510152025
InsectSprays data
Type of spray
Insect count