• 검색 결과가 없습니다.

ƒ Decision Support Systems

N/A
N/A
Protected

Academic year: 2022

Share "ƒ Decision Support Systems"

Copied!
54
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Advanced DB

CHAPTER 18

DATA ANALYSIS & MINING

(2)

Chapter 18: Data Analysis and Mining

ƒ Decision Support Systems

ƒ Data Analysis and OLAP

ƒ Data Warehousing

ƒ Data Mining

(3)

Online Transaction Processing

ƒ Transaction processing activities are needed to capture and process data in ol ed in transactions

process data involved in transactions

ƒ Online transaction processing systems

à Real time systems that capt re and process transactions immediately à Real-time systems that capture and process transactions immediately

à Can help provide superior service to customers and other trading partners

ƒ An integral part of the every day business of an enterprise An integral part of the every day business of an enterprise

à Order processing system of an electronics distributor à Banking system of a bankg

à Order control and billing system of a hospital

à Switches and billing system of a telephone company

(4)

Decision Support Systems

ƒ Decision-support systems are used to make business decisions

à What items to stock?What items to stock?

à What insurance premium to change?

à To whom to send advertisements?

ƒ Examples of data used for making decisions

à Retail sales transaction details

à Customer profiles (income age gender etc )Customer profiles (income, age, gender, etc.)

ƒ Identifying patterns in customer behavior and Using the patterns to make business decisions

à A Sudden spurt in purchases of flannel shirts

à Most of small sports cars are bought by young woman with more than $50,000 annual salaryy

(5)

Multidimensional View

ƒ Data that can be modeled as dimension attributes and measure attributes

attributes

sales(item-name, color, size, number)

ƒ Measure attributes

à example: number

Sales

Seoul

example: number à Measure some value à Can be aggregated upon

Item

1 2 3

A B

C Busan

Kwangju

ƒ Dimension attributes

à example: item name color size

Month

1 2 3

à example: item-name, color, size

à Define the dimensions on which measure attributes, and summaries of measure attributes

(6)

Online Analytical Processing (OLAP)

ƒ Interactive analysis of summary information

ƒ Statistical analysis often requires grouping on multiple attributes

ƒ Although complex statistical analysis is best left to statistics

packages, database should support simple, commonly used, forms of data analysis

ƒ Several SQL extensions have been developed to support OLAP

tools

(7)

Cross Tabulation of sales by item-name and color

ƒ

An example of a cross-tabulation (cross-tab)

à also referred to as a pivot-table.

à Values for one of the dimension attributes form the row headers

à Values for another dimension attribute form the column headers

à Other dimension attributes are listed on top

à Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell.

dimension attributes that specify the cell.

(8)

Relational Representation of Cross-tabs

„ Cross-tabs can be represented as relations

relations

„ We use the value all is used to represent aggregates

„ Th SQL 1999 t d d t ll

„ The SQL:1999 standard actually uses null values in place of all

despite confusion with regular null values

(9)

Data Cube

„ A data cube is a multidimensional generalization of a cross-tab

„ C h di i h 3 b l

„ Can have n dimensions; we show 3 below

„ Cross-tabs can be used as views on a data cube

(10)

Online Analytical Processing Operations

ƒ Pivoting: changing the dimensions used in a cross-tab is called

ƒ Slicing: creating a cross-tab for fixed values only

à Sometimes called dicing, particularly when values for multiple dimensions are fixed

are fixed.

ƒ Rollup: moving from finer-granularity data to a coarser granularity

g y

ƒ Drill down: The opposite operation - that of moving from

coarser-granularity data to finer-granularity data g y g y

(11)

Hierarchies on Dimensions

ƒ Hierarchy on dimension attributes: lets dimensions to be viewed at different le els of detail

at different levels of detail

à E.g. the dimension DateTime can be used to aggregate by hour of day, date, day of week, month, quarter or yeary q y

(12)

Cross Tabulation With Hierarchy

ƒ Cross-tabs can be easily extended to deal with hierarchies

C d ill d ll hi h

à Can drill down or roll up on a hierarchy

(13)

OLAP Implementation

ƒ MOLAP (multidimensional OLAP)

à Multidimensional arrays in memory to store data cubes (earliest OLAPMultidimensional arrays in memory to store data cubes (earliest OLAP systems)

ƒ ROLAP (relational OLAP)

à Data stored in relational database à Data stored in relational database

ƒ HOLAP (hybrid OLAP)

à Store some summaries in memory

à Store the base data and other summaries in rdb

ƒ Many OLAP systems are implemented as client-server systems

(14)

OLAP Implementation (Cont.)

ƒ Precompute all possible aggregates in order to provide online response

response

à 2n combinations of group by

ƒ Precompute some aggregates, and compute others on demand p gg g p from one of the precomputed aggregates

à Can compute aggregate on (item-name, color) from an aggregate on (item- name color size)

name, color, size)

ƒ Several optimizations available for computing multiple aggregates

à Can compute aggregate on (item-name, color) from an aggregate on p gg g gg g (item-name, color, size)

à Can compute aggregates on (item-name, color, size), (item-name, color), and (item-name) using a single sorting of the base data

( ) g g g

(15)

Extended Aggregation in SQL:1999

ƒ The cube operation computes union of group by’s on every subset of the specified attributes

subset of the specified attributes

ƒ E.g. consider the query

select item-name, color, size, sum(number) select item name, color, size, sum(number) from sales

group by cube(item-name, color, size)

computes the union of eight different groupings of the sales relation:

{ (it l i ) (it l ) (it i )

{ (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) }

where ( ) denotes an empty group by list where ( ) denotes an empty group by list.

ƒ For each grouping, the result contains the null value for attributes not present in the grouping.

not present in the grouping.

(16)

Extended Aggregation (Cont.)

ƒ Relational representation of cross-tab of figure 18.2 can be computed by

computed by

select item-name color sum(number) select item name, color, sum(number) from sales

group by cube(item-name, color)

(17)

Data Warehouse - Motivation

ƒ Decision Support Problems in conventional systems (W. H. Inmon)

à Credibility of data: no time basis, algorithmic difference, no common sourcey , g ,

à Productivity: multiple sources in multiple environment, insufficient information, repeated work

à Data into information: data from multiple applications, consistency, past historyp pp y p y

ƒ Conventional DSS Cycle

Sales

request

Item

Month 1 2 3

A B C

executive modifications IS / IT

executive

(18)

Solution 1: Rebuild System

ƒ Pros

A & i i d d d l

à A new & consistent integrated data model

à Opportunity adopt new HW, SW, & methodology

ƒ Cons

ƒ Cons

à Large long-term project

à Change in management/organization/operations during projectg g / g / p g p j à Transition risks

à Still operations (OLTP) oriented

‚ DSS is second priority at best

(19)

Solution 2: DW-based Restructuring

DSS DSS

DSS DSS

EIS DW

Extract Transform Load

Legacy OLTP

EIS

g y

Reports Data Processing

(Transaction) Information Processing

(20)

Data Warehouse – Features

ƒ A repository (or archive) of information gathers from multiple sources stored under a unified schema at a single site

sources, stored under a unified schema, at a single site

à Once gathered, the data are stored for a long time, permitting access to historical data

à Provide the user a single consolidated interface to data, making decision- support queries easier to write

à Online transaction-processing systems are not affected by the decision-Online transaction processing systems are not affected by the decision support workload

ƒ Key Features

à Subject oriented à Integrated

à Time variant historical à Time-variant, historical

(21)

Data Warehouse – Features

ƒ Subject oriented

D i OLTP d i d/ d/ d i h

à Data in a OLTP system are designed/managed/used with a process/function-oriented view

à Data in a DW are designed/managed/used with a subject-oriented viewg g j

‚ Focus of a DSS is in the subjects of an organization (not the processes)

‚ Not all data from OLTP need to be moved to the DW

Operational Data Warehouse

Loan proc.

Deposit proc.

customer

account

Credit review operations

(22)

Data Warehouse – Features

ƒ Integrated

N bl d l

à Names: table names and column names à Units, Values and codes

à Aggregation and consolidationAggregation and consolidation

à A consistent source of information

balance

amount balance

A l A l th cm

amount

Curr. amount Appl. A length:cm

Appl. B length:inch

cm

transformation

(23)

Data Warehouse – Features

ƒ Time Variant

OLTP h f h d i ld

à OLTP: represents the current state of the domain world à DW: need to log the history of a subject

‚ Monthly call charges for K. Kim for last yeary g y

‚ Weekly sales for each brand (or item) à Time element must be present in the key

K Kim K Kim append

K. Kim addr:Busan K. Kim

addr:Seoul

append K. Kim

addr:Busan

K. Kim addr:Seoul

date:2008-4-5 update

Moved from Seoul to Busan on

2008-4-5

date:2002-7-15 OLTP

DW K. Kim

(24)

Data Warehouse – Features

ƒ Nonvolatile

Li l ( ) d

à Little (or no) updates

à Most operations are new inserts (load) à Normalization is not importantNormalization is not important

OLTP

update OLTP DW

insert

update

(25)

Data Modeling

ƒ DB Design for OLTP

à Normalization: minimize update anomalies by eliminating redundancy à Normalization: minimize update anomalies by eliminating redundancy à Transactions usually access (relatively) small number of records

à Node – table / Link – join path

à Ad hoc join paths – not easy to predict access patterns

ƒ DB Design for Data warehouse

ƒ DB Design for Data warehouse

à Historical and summary oriented à Analytical “views”

à Non-volatile

à Dimensional model

‚ Fact tables and dimensional tablesFact tables and dimensional tables

(26)

Data Modeling – OLTP

Shi Ship

Type Shipper Ship To Product

District Order Contact Product

District Credit

Order Item

Contact Loc

Product Line

Sales O d

Cust

Loc Product

Contrct

Order

Contrct

Type Customer

Loc

Contact

Product Group

Sales Rep

Sales District

Sales Division Sales

Region

(27)

Multi-dimensional Modeling – Fact table

ƒ Store all transactions for factual data about a business

ƒ Have a multi-part key (composite key)

ƒ Each element of the key is a FK to a single dimension table

ƒ The remaining fields in fact table are known as Facts

ƒ Facts are almost numeric – target of aggregation (count, sum, average, …)

ƒ much larger than dimension tables

(28)

Multi-dimensional Modeling – Dimension table

ƒ Dimensions are attribute of facts

ƒ Store hierarchical relationships between different grouping or characteristics of those transactions

Di i hi h

ƒ Dimension hierarchy

‚ Time : Year, Quarter, Week, Day

‚ Product : Group, Subgroup, Department, Class, Itemp, g p, p , ,

‚ Territory : Division, Country, Region, Distinct à Unit of drill-down & roll-up

ƒ Have single key that is joined to a Fact table

ƒ The remaining fields are called attributes

(29)

Hierarchies on Dimensions

ƒ The different levels of detail for an attribute can be organized into a hierarch

into a hierarchy

(30)

Data Modeling Example

ƒ Requirements

à A convenience store chainA convenience store chain

à Sale results by date, item, and store

à Sales by area, month, item or brand

à Performance of promotions

ƒ Fact table

à Fact event: sale transactionFact event: sale transaction

à Atomic unit: item, day, store, …

à Fact table attributes: sales quantity, amount, cost

ƒ Dimension tables

à time: key, date, month, quarter, year, day-of-week, holiday-flag, season

à item: key name category package-type size weight coloritem: key, name, category, package type, size, weight, color

(31)

Data Modeling – Multi-dim. Star Schema

item#

item name timeKey

date

Item dimension Time dimension

item name item class

brand package

date dayofWeek

month quarter timeKey

item#

store#

i # size

weight color

q year holiday season promotion#

items_sold amount_sold

cost promotion#

promo_name store#

Promotion dimension Store dimension

cost p

promo_type discount%

displays promo cost

store_name address

city zipcode

Fact table

promo_cost promo_dates

zipcode area manager

Q: The top selling brands on holidays in Q1?

Q: The top selling brands on holidays in Q1?

(32)

Example Queries

ƒ For the past 3 months, compare holiday and workday sales for each brand

each brand

à Time: restriction (month) & grouping (holiday) à Item: grouping (brand)Item: grouping (brand)

à Measure: sum(amount_sold)

ƒ Last quarter profitability per promotion type q p y p p yp

à Time: restriction (quarter)

à Promotion : grouping (promo_type)

à Measure : (sum(amount_sold)-sum(cost))/sum(promo_cost)

(33)

Data Warehouse Tuning

ƒ Characteristics

à Data Wareho ses are bigger terab tes petab tes à Data Warehouses are bigger – terabytes~petabytes à They change less often

à Queries use aggregation or complex WHERE clausesgg g p

ƒ Implications

à Scanning all data is very very slow à We can keep redundant data

à Index is important

ƒ T i t h i

ƒ Tuning techniques

à Bitmap index à Materialized viewate a ed v ew à Approximate answer

(34)

Bitmaps

ƒ Order of magnitude improvement

R ƒ Order of magnitude improvement

compared to scan.

ƒ Bitmaps are best suited for multiple conditions on several attributes, each

B Y

color

conditions on several attributes, each having a low selectivity.

Y N

holiday_flag

10 12 14

ec) bitmap - two

attributes

10 12

sec)

4 6 8 10

ponse Time (se attributes

r-tre e

4 6 8

ghput (Queries/s

line ar scan

bitmap p

bitmap

(35)

Materialized Views

highly

i d monthly sales per item-type

summarized monthly sales per item type

for past 10 years

M

lightly

summarized

weekly sales for past 5 years

Metadata

sales detail of current year Current

Detail ETL from

OLTP system

sales detail of current year (current detail)

sales detail for past 00 years (older detail)

OLTP system

(36)

Materialized Views

ƒ Graph:

14 p

à Oracle9i on Linux

à Total sale by vendor is materialized

ƒ Trade-off between query speed-up and view

6 8 10 12 14

Time (sec)

q y p p

maintenance:

à The impact of incremental maintenance on performance is significant.

R b ild i hi d

0 2 4 6

M aterialized View No M aterialized View

Response

à Rebuild maintenance achieves a good throughput.

à A static data warehouse offers a good trade-off.

1200 1600

sec)

M aterialized View No M aterialized View

400 800 1200

t (statements/

(37)

Approximations

ƒ Ideal application parameters: broad, dynamic, large data

ƒ Rippling

ƒ Rippling

à When computing average of all salaries

à If average value changes little with increasing size, then it is probably accurate for all data (within some error bound)

( )

à Can be extended to multiple joins

(38)

Approximations (cont.)

ƒ Good approximation for query Q1 on

T PC-H Query Q1: lineitem ƒ Good approximation for query Q1 on

lineitem

ƒ The aggregated values obtained on a query with a 6-way join are significantly different

y

20 40 60

mated e Values Base e Value) 1% sample 10% sample

with a 6 way join are significantly different from the actual values -- for some

applications may still be good enough.

-40 -20 0

1 2 3 4 5 6 7 8

Aggregate values Approxi Aggregate (% of B Aggregate

T PC-H Query Q5: 6 way join

60

gg g

3 3.5 4

ec) base re lations

0 20 40 60

proximated gated values of Base ate Value)

0.5 1 1.5 2 2.5 3

sponse Time (se

1 % sample 10% sample

(39)

Other Issues

ƒ When and how to gather data

à Source-driven architecture

à Destination-driven architecture

à Data warehouses typically have slightly out-of-date data

ƒ What schema to use

à Usually an integration of multiple sources

à Must decide on a unified (consolidated) view of the data

ƒ Data cleansingg

à The process of correcting and preprocessing data

à Merge-purge operation

à Householdingg

ƒ How to propagate updates

à View-maintenance problem

ƒ What data to summarizeWhat data to summarize

à Maintain only summary data obtained by aggregation on a relation

à How do we decide on the unit granularity?

(40)

Data Mining

ƒ Data mining is the process of semi-automatically analyzing large databases to find sef l patterns

databases to find useful patterns

ƒ Knowledge Discovery in (Large) Database (KDD)

à "knowledge extraction" "data dredging" "data archaeology"

à knowledge extraction , data dredging , data archaeology ,

"data/pattern analysis“, etc.

ƒ Prediction based on past history p y

à Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history

à Predict if a pattern of phone calling card usage is likely to be fraudulent

(41)

Data Mining (Cont.)

Some examples of prediction mechanisms:

Cl ifi i

ƒ

Classification

à Given a new item whose class is unknown, predict to which class it belongs

ƒ

Regression formulae

ƒ

Regression formulae

à Given a set of mappings for an unknown function, predict the function result for a new parameter value

ƒ

Associations

à Find books that are often bought by “similar” customers. If a new such customer buys one such book suggest the others too

customer buys one such book, suggest the others too.

à Associations may be used as a first step in detecting causation

à E.g. association between exposure to chemical X and cancer,

ƒ

Clusters

à E.g. typhoid cases were clustered in an area surrounding a contaminated well

à Detection of clusters remains important in detecting epidemics

(42)

Classification Rules

ƒ Classification rules help assign new objects to classes.

E i bil i li h ld h h b

à E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk?

ƒ Classification rules for above example could use a variety of data Classification rules for above example could use a variety of data, such as educational level, salary, age, etc.

à ∀ person P, P.degree = masters and P.income > 75,000

⇒ P.credit = excellent à ∀ person P, P.degree = bachelors and

(Pincome ≥ 25 000 and Pincome ≤ 75 000) (P.income ≥ 25,000 and P.income ≤ 75,000)

⇒ P.credit = good

ƒ Rules are not necessarily exact: there may be some y y

(43)

Decision Tree

ƒ Classification rules can be shown compactly as a decision tree.

(44)

Other Types of Classifiers

ƒ Neural net classifiers are studied in artificial intelligence and are not covered here

not covered here

ƒ Bayesian classifiers use Bayes theorem, which says p (c

j

| d ) = p (d | c

j

) p (c

j

)

p (c

j

| d ) p (d | c

j

) p (c

j

) p ( d )

where

p (c

j

| d ) = probability of instance d being in class c

j

,

p (d | c

j

) = probability of generating instance d given class c

j

, p (c

j

) = probability of occurrence of class c

j

p (d ) = probability of instance d occurring

(45)

Naïve Bayesian Classifiers

ƒ Bayesian classifiers require

i f (d | ) à computation of p (d | cj) à precomputation of p (cj )

à p (d ) can be ignored since it is the same for all classesp (d ) can be ignored since it is the same for all classes

ƒ To simplify the task, assume attributes have independent distributions;;

p (d | c

j

) = p (d

1

| c

j

) * p (d

2

| c

j

) * ….* (p (d

n

| c

j

)

à Each of the p (di| c| jj) can be estimated from a histogram on dg ivalues for each class cj

à the histogram is computed from the training instances

(46)

Association Rules

ƒ Examples

à Someone who buys bread is quite likely also to buy milk à Someone who buys bread is quite likely also to buy milk

à A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts.

ƒ Association rules:

bread milk

DB-Concepts, OS-Concepts Networks

à Left hand side: antecedent, right hand side: consequent

à An association rule must have an associated population; the population p p ; p p consists of a set of instances

‚ E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the populationp p

(47)

Association Rules (Cont.)

ƒ Rules have an associated support, as well as an associated confidence

confidence.

ƒ Support

à P(X ∩ Y) ho often does this p ttern o r à P(X ∩ Y): how often does this pattern occur

à what fraction of the population satisfies both the antecedent and the consequent of the rule

à E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule is milk ⇒ screwdrivers is low.

C fid

ƒ Confidence

à P(Y | X): how prevalent is the pattern given X

à how often the consequent is true when the antecedent is true à how often the consequent is true when the antecedent is true.

à E.g. the rule bread ⇒ milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk.

(48)

Finding Association Rules

ƒ

To discover association rule,

à (i( 11 , i, 22 , …, i, , nn) => i) 00

à Find large itemsets

‚ large itemset S: sets of items with sufficient support à Find rule (S – s)⇒s for every subset s⊂S( ) y

where (S – s)⇒s has sufficient confidence

B1 = {m, c, b} B2 = {m, p, j}

+ { , , } { , p, j}

B3 = {m, b} B4 = {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

– +

{ , , j} { , }

ƒ

An association rule: {m, b} -> c.

à Confidence = 2/4 = 50%

à Support = 2/8 = 25%

à Support = 2/8 = 25%

(49)

Finding Association Rules (cont.)

ƒ Sample Database

1. A supermarket sells 10,000 items

2. The average basket has 10 items

3. There are 10,000,000 baskets in the database

ƒ Table structure

à Baskets_long(BID, item1, item2, …, item10)

à Baskets(BID, item) ( , )

‚ 10 X 10,000,000 = 100,000,000 records

ƒ The support threshold is 1% of the baskets pp

à i.e., X=>Y is interesting only if X, Y are purchased together

at least 1% of the time (100,000 baskets) (

(50)

Frequent Pairs in SQL

SELECT b1.item, b2.item

Look for two

Basket tuples

FROM Baskets b1, Baskets b2

Basket tuples with the same

basket and different items.

WHERE b1.BID = b2.BID AND b1.item < b2.item

First item must precede second,

so we don’t

AND b1.item b2.item GROUP BY b1.item, b2.item

count the same pair twice.

HAVING COUNT(*) >= 100000;

Create a group for each pair of items each pair of items

(51)

A-Priori Algorithm

ƒ The naïve way (query in previous page)

à Join complexity: 100,000,000 X 100,000,000

à Each basket contributes C(10, 2) = 45 pairs

à Result of join has 45 X 10,000,000 records

à Need to perform Group by and Having

ƒ The A-Priori algorithm

à Key idea: If item i does not appear in s baskets, then no pair y pp , p including i can appear in s baskets

à So, do not consider items that appear in less than s baskets

(52)

A-Priori Algorithm (cont.)

ƒ The new query

INSERT INTO Baskets1(basket item) INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets

WHERE item IN ( SELECT item FROM Baskets GROUP BY item

HAVING COUNT(*) >= 100000 );

ƒ Perform previous query with Baskets1

ƒ Join computation

à Computing Baskets1 is cheap since there is no join

‚ Or could have built index from beginning à At most 1 000 items can be frequentAt most 1,000 items can be frequent

(53)

Other Types of Mining

ƒ Text mining: application of data mining to textual documents

l W b fi d l d

à cluster Web pages to find related pages

à cluster pages a user has visited to organize their visit history à classify Web pages automatically into a Web directoryclassify Web pages automatically into a Web directory

ƒ Data visualization systems help users examine large volumes of data and detect patterns visually p y

à Can visually encode large amounts of information on a single screen à Humans are very good a detecting visual patterns

(54)

END OF CHAPTER 18

참조

관련 문서

Family support was selected as an independent variable, probability game item addiction was placed as a dependent variable, and friendship support,

Full name, resident registration number (In case of foreign students: alien registration number or passport number), address, phone number, cell phone number,

As the results from this study, it proved the usefulness of color psychological counseling through the effect of color stimuli on brain waves to

• 화장품 산업은 cosmetics와 toiletries로 분류되고 헤어케어, 스킨케어, 향수, color-cosmetics, 방취제, 베이비 케어, 탈모제, 선 케어, 구강위생, 남성

This designation is without prejudice to positions on status, and is in line with United Nations Security Council Resolution 1244/99 and the Opinion of the International

 IR color (Yellow filter used to eliminate blue and create IR color (or false-color infrared) of 05-1.0 m, or green, red, and IR).  4 bands (blue, green,

복측 흐름 Item identification (무엇 what 경로) 배측 흐름 Spatial location

v EMP 테이블에서 DEPTNO의 오름차순으로 정렬하고 같은 경우 JOB의 오름차순으로 JOB이 같은 경우에는 SAL의 내림차순으로 DEPTNO, JOB, SAL, EMPNO, ENAME, HIREDATE